-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Htsget Support for VCF Format #233
Conversation
This seems pretty sensible to me. Perhaps we should leave this feature out until after we've hit 1.0 though? The deadline is about about 6 weeks away and it would probably be better to harden up the rest of the spec before introducing a major new feature like this. |
htsget.md
Outdated
`format` | ||
_optional string_ | ||
</td><td> | ||
Request read data in this format. Allowed values: VCF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/read/variant
Also: we should allow BCF as well as VCF.
htsget.md
Outdated
`id` | ||
_required_ | ||
</td><td> | ||
Study ids from which variants are to be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be just 'Ids' instead of 'Study IDs' (For EGA this will initially have to be File IDs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: Would it be sufficient to use a single ID (in line with BAM/CRAM specs), and wait for a unified way to submit multiple ID via POST?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be fine with changing it to "one or multiple IDs". Using only one wouldn't be sufficient for the EVA use case for instance as we support cross-study queries.
It could be useful to list examples of possible types of identifiers (studies, files, samples) for both reads and variants endpoints, just to make clear what could be supported. I helped to define this addition and we were a bit unsure at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the last call, we agreed that we would add POST support to support mulitple IDs. Otherwise, the url could/will get truncated unless there is a hard bound on the total number of characters (maybe http already defines this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems orthogonal to VCF support --- perhaps we should limit to using GET with a single ID here for now, and tackle multiple IDs/POST for both reads and variants in a separate PR (as @AlexanderSenf suggests)?
htsget.md
Outdated
`format` | ||
_optional string_ | ||
</td><td> | ||
Request variant data in this format. Allowed values: VCF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not BCF also?
Looks good to me. I see no reason not to support this, since the protocol is identical. Perhaps we should think about abstracting a bit rather than copy-pasting? It might be better to get VCF support into a few clients and servers first though and see how it works before attempting this. |
I agree with @jeromekelleher . Our implementation already supports VCF via 'format' param. We previously treated the 'reads/' part of URL as a suggestion; we are using 'sample/' since our IDs are sample accession IDs. Serving variants on the 'sample/' URL seems OK. |
@jeromekelleher This was defined based on an implementation, although it is not public yet. We can try to deploy it in a test environment before the Christmas break. @mgcam I'm not completely sold about using 'sample' for variants, it doesn't sound very intuitive... It can be enough for streaming those reported by a single sample, which I think it's a reasonable use case, by what if a VCF has multiple samples or it is aggregated and has none? In that case a study/file ID would make more sense. |
On 22/11/2017 09:33, Cristina Yenyxe Gonzalez Garcia wrote:
It can be enough for streaming those reported by a single sample,
which I think it's a reasonable use case, by what if a VCF has
multiple samples or it is aggregated and has none? In that case a
study/file ID would make more sense.
I agree. The sample/study/file URL is almost orthogonal to
reads/variants. The former hints on the origin of the data, the latter
on the nature of the data served.
…--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
|
@cyenyxe it looks like this pull request will need some changes and is unlikely to be merged in current form. Will you or @amilamanoj be making the edits required or preparing a follow-up pull request? |
I have made some changes to address the following comments:
|
htsget.md
Outdated
@@ -239,8 +239,9 @@ The client can request only variants overlapping a given genomic range. The resp | |||
`id` | |||
_required_ | |||
</td><td> | |||
Study ids from which variants are to be returned. | |||
A string specifying which variants to return. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use 'identifier'? That would be more general than 'Study ID' but more specific than 'string'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it just replicated what was already in the reads section (this comment refers to an old commit btw).
htsget.md
Outdated
|
||
# Method: get reads by ID | ||
## Methods | ||
|
||
GET /reads/<id> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this actually be part of the specs? The EGA API doesn't use '/reads/' at all in the URL, because at the EGA I have to use file IDs; so I am using '/files/' instead. I think this should be more generic, so that different implementations can choose their own way. I do like the addition of '/variants/' for VCF/BCF files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the URLs are completely arbitrary, how would a client know what endpoint to call to? Yet another one would be necessary to discover which is the URL to search for reads, variants, etc., in a particular server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure there's much point in mandating the format of the URL either, since most people are actively ignoring the current 'reads' prefix. In terms of the client knowing where to look for reads or variants, we have no idea how a client got the URL in the first place (explicitly out of band), so I think we can assume that the service that provided the URL will know whether it points to reads or variants.
@@ -162,29 +175,25 @@ _optional 32-bit unsigned integer_ | |||
</td><td> | |||
The start position of the range on the reference, 0-based, inclusive. | |||
|
|||
The server SHOULD respond with an `InvalidInput` error if `start` is specified and a reference is not specified | |||
(see `referenceName`). | |||
The server SHOULD respond with an `InvalidInput` error if `start` is specified and a reference is not specified (see `referenceName`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's after changing here? I think we should keep the diff to the minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just trying to make this section follow the same style as the rest of the document, where lines are not split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, that's a good idea. It might be better to do such housekeeping stuff separately to semantic changes like this though. I know @jmarshall likes a nice clean diff!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't been able to find any other formatting issues in the whole document. Is removing 3 line breaks worth rebasing this PR and creating a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not worth another PR. Could split this into two commits (one housekeeping, one VCF) when squashing? It really makes no difference though. If nobody else complains, whatever you prefer is fine by me.
Looks good to me. Ther'e's a few 'drive-by' edits that should be removed in the interest of keeping the diff on topic, but other that I'd vote to squash and merge. |
+1 |
1 similar comment
👍 |
Is anything else needed (such as more +1) to get this merged? |
Let me make a final call for comments-- I'll merge this in a couple of days if no concerns are raised. I will try to do so with @jeromekelleher's suggestion to organize it into main and housekeeping diffs. Thanks @cyenyxe! |
I've reorganized the commits as suggested, in #301 Closing this PR (the new one links back here for the record) |
Extends htsget spec to support VCF format by adding a new method: Get Variants by ID.