Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Htsget Support for VCF Format #233

Closed
wants to merge 9 commits into from

Conversation

amilamanoj
Copy link

Extends htsget spec to support VCF format by adding a new method: Get Variants by ID.

@jeromekelleher
Copy link
Member

This seems pretty sensible to me. Perhaps we should leave this feature out until after we've hit 1.0 though? The deadline is about about 6 weeks away and it would probably be better to harden up the rest of the spec before introducing a major new feature like this.

htsget.md Outdated
`format`
_optional string_
</td><td>
Request read data in this format. Allowed values: VCF.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/read/variant

Also: we should allow BCF as well as VCF.

@cyenyxe cyenyxe added the htsget label Sep 21, 2017
htsget.md Outdated
`id`
_required_
</td><td>
Study ids from which variants are to be returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be just 'Ids' instead of 'Study IDs' (For EGA this will initially have to be File IDs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: Would it be sufficient to use a single ID (in line with BAM/CRAM specs), and wait for a unified way to submit multiple ID via POST?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be fine with changing it to "one or multiple IDs". Using only one wouldn't be sufficient for the EVA use case for instance as we support cross-study queries.

It could be useful to list examples of possible types of identifiers (studies, files, samples) for both reads and variants endpoints, just to make clear what could be supported. I helped to define this addition and we were a bit unsure at first.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the last call, we agreed that we would add POST support to support mulitple IDs. Otherwise, the url could/will get truncated unless there is a hard bound on the total number of characters (maybe http already defines this).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems orthogonal to VCF support --- perhaps we should limit to using GET with a single ID here for now, and tackle multiple IDs/POST for both reads and variants in a separate PR (as @AlexanderSenf suggests)?

htsget.md Outdated
`format`
_optional string_
</td><td>
Request variant data in this format. Allowed values: VCF.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not BCF also?

@jeromekelleher
Copy link
Member

Looks good to me. I see no reason not to support this, since the protocol is identical. Perhaps we should think about abstracting a bit rather than copy-pasting? It might be better to get VCF support into a few clients and servers first though and see how it works before attempting this.

@mgcam
Copy link

mgcam commented Nov 21, 2017

I agree with @jeromekelleher . Our implementation already supports VCF via 'format' param. We previously treated the 'reads/' part of URL as a suggestion; we are using 'sample/' since our IDs are sample accession IDs. Serving variants on the 'sample/' URL seems OK.

@cyenyxe
Copy link
Member

cyenyxe commented Nov 22, 2017

@jeromekelleher This was defined based on an implementation, although it is not public yet. We can try to deploy it in a test environment before the Christmas break.
I would be okay with having a more generic section like "URL structure" with a list of entity types (reads, variants, samples, etc) that must be used.

@mgcam I'm not completely sold about using 'sample' for variants, it doesn't sound very intuitive... It can be enough for streaming those reported by a single sample, which I think it's a reasonable use case, by what if a VCF has multiple samples or it is aggregated and has none? In that case a study/file ID would make more sense.

@mgcam
Copy link

mgcam commented Nov 22, 2017 via email

@tk2
Copy link

tk2 commented Nov 22, 2017

@cyenyxe it looks like this pull request will need some changes and is unlikely to be merged in current form. Will you or @amilamanoj be making the edits required or preparing a follow-up pull request?

@cyenyxe
Copy link
Member

cyenyxe commented Feb 20, 2018

I have made some changes to address the following comments:

  • Generic ID support instead of "study ID"
  • Limit GET requests to single value
  • Support VCF and BCF
  • Abstract to make more generic for reads and variants (less copy-pasting)

htsget.md Outdated
@@ -239,8 +239,9 @@ The client can request only variants overlapping a given genomic range. The resp
`id`
_required_
</td><td>
Study ids from which variants are to be returned.
A string specifying which variants to return.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use 'identifier'? That would be more general than 'Study ID' but more specific than 'string'?

Copy link
Member

@cyenyxe cyenyxe Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it just replicated what was already in the reads section (this comment refers to an old commit btw).

htsget.md Outdated

# Method: get reads by ID
## Methods

GET /reads/<id>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this actually be part of the specs? The EGA API doesn't use '/reads/' at all in the URL, because at the EGA I have to use file IDs; so I am using '/files/' instead. I think this should be more generic, so that different implementations can choose their own way. I do like the addition of '/variants/' for VCF/BCF files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the URLs are completely arbitrary, how would a client know what endpoint to call to? Yet another one would be necessary to discover which is the URL to search for reads, variants, etc., in a particular server.

Copy link
Member

@jeromekelleher jeromekelleher Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure there's much point in mandating the format of the URL either, since most people are actively ignoring the current 'reads' prefix. In terms of the client knowing where to look for reads or variants, we have no idea how a client got the URL in the first place (explicitly out of band), so I think we can assume that the service that provided the URL will know whether it points to reads or variants.

@@ -162,29 +175,25 @@ _optional 32-bit unsigned integer_
</td><td>
The start position of the range on the reference, 0-based, inclusive.

The server SHOULD respond with an `InvalidInput` error if `start` is specified and a reference is not specified
(see `referenceName`).
The server SHOULD respond with an `InvalidInput` error if `start` is specified and a reference is not specified (see `referenceName`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's after changing here? I think we should keep the diff to the minimum.

Copy link
Member

@cyenyxe cyenyxe Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to make this section follow the same style as the rest of the document, where lines are not split.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, that's a good idea. It might be better to do such housekeeping stuff separately to semantic changes like this though. I know @jmarshall likes a nice clean diff!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to find any other formatting issues in the whole document. Is removing 3 line breaks worth rebasing this PR and creating a new one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not worth another PR. Could split this into two commits (one housekeeping, one VCF) when squashing? It really makes no difference though. If nobody else complains, whatever you prefer is fine by me.

@jeromekelleher
Copy link
Member

Looks good to me. Ther'e's a few 'drive-by' edits that should be removed in the interest of keeping the diff on topic, but other that I'd vote to squash and merge.

@AlexanderSenf
Copy link
Contributor

+1

1 similar comment
@mlin
Copy link
Member

mlin commented Feb 26, 2018

👍

@cyenyxe
Copy link
Member

cyenyxe commented Mar 5, 2018

Is anything else needed (such as more +1) to get this merged?

@mlin
Copy link
Member

mlin commented Mar 6, 2018

Let me make a final call for comments-- I'll merge this in a couple of days if no concerns are raised. I will try to do so with @jeromekelleher's suggestion to organize it into main and housekeeping diffs. Thanks @cyenyxe!

@mlin
Copy link
Member

mlin commented Mar 12, 2018

I've reorganized the commits as suggested, in #301

Closing this PR (the new one links back here for the record)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants