Htsget Support for VCF Format #233

amilamanoj · 2017-08-18T15:12:01Z

Extends htsget spec to support VCF format by adding a new method: Get Variants by ID.

jeromekelleher · 2017-08-21T15:26:19Z

This seems pretty sensible to me. Perhaps we should leave this feature out until after we've hit 1.0 though? The deadline is about about 6 weeks away and it would probably be better to harden up the rest of the spec before introducing a major new feature like this.

jeromekelleher · 2017-08-21T15:27:07Z

htsget.md

+`format`  
+_optional string_
+</td><td>
+Request read data in this format. Allowed values: VCF.


s/read/variant

Also: we should allow BCF as well as VCF.

AlexanderSenf · 2017-11-21T08:43:23Z

htsget.md

+`id`  
+_required_
+</td><td>
+Study ids from which variants are to be returned.


Should this be just 'Ids' instead of 'Study IDs' (For EGA this will initially have to be File IDs)

Also: Would it be sufficient to use a single ID (in line with BAM/CRAM specs), and wait for a unified way to submit multiple ID via POST?

I would be fine with changing it to "one or multiple IDs". Using only one wouldn't be sufficient for the EVA use case for instance as we support cross-study queries.

It could be useful to list examples of possible types of identifiers (studies, files, samples) for both reads and variants endpoints, just to make clear what could be supported. I helped to define this addition and we were a bit unsure at first.

On the last call, we agreed that we would add POST support to support mulitple IDs. Otherwise, the url could/will get truncated unless there is a hard bound on the total number of characters (maybe http already defines this).

This seems orthogonal to VCF support --- perhaps we should limit to using GET with a single ID here for now, and tackle multiple IDs/POST for both reads and variants in a separate PR (as @AlexanderSenf suggests)?

jeromekelleher · 2017-11-21T16:41:08Z

htsget.md

+`format`  
+_optional string_
+</td><td>
+Request variant data in this format. Allowed values: VCF.


Why not BCF also?

jeromekelleher · 2017-11-21T16:45:26Z

Looks good to me. I see no reason not to support this, since the protocol is identical. Perhaps we should think about abstracting a bit rather than copy-pasting? It might be better to get VCF support into a few clients and servers first though and see how it works before attempting this.

mgcam · 2017-11-21T23:18:19Z

I agree with @jeromekelleher . Our implementation already supports VCF via 'format' param. We previously treated the 'reads/' part of URL as a suggestion; we are using 'sample/' since our IDs are sample accession IDs. Serving variants on the 'sample/' URL seems OK.

cyenyxe · 2017-11-22T09:33:54Z

@jeromekelleher This was defined based on an implementation, although it is not public yet. We can try to deploy it in a test environment before the Christmas break.
I would be okay with having a more generic section like "URL structure" with a list of entity types (reads, variants, samples, etc) that must be used.

@mgcam I'm not completely sold about using 'sample' for variants, it doesn't sound very intuitive... It can be enough for streaming those reported by a single sample, which I think it's a reasonable use case, by what if a VCF has multiple samples or it is aggregated and has none? In that case a study/file ID would make more sense.

mgcam · 2017-11-22T09:50:01Z

On 22/11/2017 09:33, Cristina Yenyxe Gonzalez Garcia wrote: It can be enough for streaming those reported by a single sample, which I think it's a reasonable use case, by what if a VCF has multiple samples or it is aggregated and has none? In that case a study/file ID would make more sense.

I agree. The sample/study/file URL is almost orthogonal to reads/variants. The former hints on the origin of the data, the latter on the nature of the data served.

…

-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

tk2 · 2017-11-22T14:04:33Z

@cyenyxe it looks like this pull request will need some changes and is unlikely to be merged in current form. Will you or @amilamanoj be making the edits required or preparing a follow-up pull request?

…n 'Request'

cyenyxe · 2018-02-20T12:18:21Z

I have made some changes to address the following comments:

Generic ID support instead of "study ID"
Limit GET requests to single value
Support VCF and BCF
Abstract to make more generic for reads and variants (less copy-pasting)

AlexanderSenf · 2018-02-21T13:55:10Z

htsget.md

@@ -239,8 +239,9 @@ The client can request only variants overlapping a given genomic range. The resp
 `id`  
 _required_
 </td><td>
-Study ids from which variants are to be returned.
+A string specifying which variants to return.


Should we use 'identifier'? That would be more general than 'Study ID' but more specific than 'string'?

Agree, it just replicated what was already in the reads section (this comment refers to an old commit btw).

AlexanderSenf · 2018-02-21T13:59:54Z

htsget.md


-# Method: get reads by ID
+## Methods

    GET /reads/<id>


Should this actually be part of the specs? The EGA API doesn't use '/reads/' at all in the URL, because at the EGA I have to use file IDs; so I am using '/files/' instead. I think this should be more generic, so that different implementations can choose their own way. I do like the addition of '/variants/' for VCF/BCF files.

If the URLs are completely arbitrary, how would a client know what endpoint to call to? Yet another one would be necessary to discover which is the URL to search for reads, variants, etc., in a particular server.

I'm not sure there's much point in mandating the format of the URL either, since most people are actively ignoring the current 'reads' prefix. In terms of the client knowing where to look for reads or variants, we have no idea how a client got the URL in the first place (explicitly out of band), so I think we can assume that the service that provided the URL will know whether it points to reads or variants.

jeromekelleher · 2018-02-23T13:10:11Z

htsget.md

@@ -162,29 +175,25 @@ _optional 32-bit unsigned integer_
 </td><td>
 The start position of the range on the reference, 0-based, inclusive. 

-The server SHOULD respond with an `InvalidInput` error if `start` is specified and a reference is not specified
-(see `referenceName`).
+The server SHOULD respond with an `InvalidInput` error if `start` is specified and a reference is not specified (see `referenceName`).


What's after changing here? I think we should keep the diff to the minimum.

Just trying to make this section follow the same style as the rest of the document, where lines are not split.

Ah right, that's a good idea. It might be better to do such housekeeping stuff separately to semantic changes like this though. I know @jmarshall likes a nice clean diff!

I haven't been able to find any other formatting issues in the whole document. Is removing 3 line breaks worth rebasing this PR and creating a new one?

Definitely not worth another PR. Could split this into two commits (one housekeeping, one VCF) when squashing? It really makes no difference though. If nobody else complains, whatever you prefer is fine by me.

jeromekelleher · 2018-02-23T13:11:24Z

Looks good to me. Ther'e's a few 'drive-by' edits that should be removed in the interest of keeping the diff on topic, but other that I'd vote to squash and merge.

AlexanderSenf · 2018-02-23T13:17:02Z

+1

mlin · 2018-02-26T00:54:05Z

👍

cyenyxe · 2018-03-05T15:37:08Z

Is anything else needed (such as more +1) to get this merged?

mlin · 2018-03-06T22:28:50Z

Let me make a final call for comments-- I'll merge this in a couple of days if no concerns are raised. I will try to do so with @jeromekelleher's suggestion to organize it into main and housekeeping diffs. Thanks @cyenyxe!

mlin · 2018-03-12T18:33:04Z

I've reorganized the commits as suggested, in #301

Closing this PR (the new one links back here for the record)

jeromekelleher reviewed Aug 21, 2017

View reviewed changes

cyenyxe added the htsget label Sep 21, 2017

AlexanderSenf reviewed Nov 21, 2017

View reviewed changes

jeromekelleher reviewed Nov 21, 2017

View reviewed changes

amilamanoj and others added 7 commits February 20, 2018 11:20

Added support for VCF

2392fec

Update htsget.md

e9dabe4

BCF allowed in variants endpoint

eb59b58

More generic definition of 'ID' parameter in variants endpoint

7dbdf2e

More specific definition of the tickets received for variants endpoint

1d25119

Unified reads and variants endpoint description under a single sectio…

56bdf6f

…n 'Request'

htsget response allowed formats

5859595

cyenyxe force-pushed the htsget-vcf branch from 1c2581d to 5859595 Compare February 20, 2018 12:16

AlexanderSenf reviewed Feb 21, 2018

View reviewed changes

Cristina Yenyxe Gonzalez Garcia added 2 commits February 23, 2018 07:58

Endpoints are recommended instead of mandatory

f3ad3b7

Replaced 'string' with 'identifier' in URL parameter

f5d0408

jeromekelleher reviewed Feb 23, 2018

View reviewed changes

mlin closed this Mar 12, 2018

jmarshall mentioned this pull request Feb 21, 2019

Add htsget 1.2.0, OpenAPI v3.0.2 spec #385

Closed

brainstorm mentioned this pull request Feb 21, 2019

htsget VCF fields specification #386

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Htsget Support for VCF Format #233

Htsget Support for VCF Format #233

amilamanoj commented Aug 18, 2017

jeromekelleher commented Aug 21, 2017

jeromekelleher Aug 21, 2017

AlexanderSenf Nov 21, 2017

AlexanderSenf Nov 21, 2017

cyenyxe Nov 21, 2017

tk2 Nov 21, 2017

jeromekelleher Nov 21, 2017

jeromekelleher Nov 21, 2017

jeromekelleher commented Nov 21, 2017

mgcam commented Nov 21, 2017

cyenyxe commented Nov 22, 2017 •

edited

Loading

mgcam commented Nov 22, 2017 via email

tk2 commented Nov 22, 2017

cyenyxe commented Feb 20, 2018

AlexanderSenf Feb 21, 2018

cyenyxe Feb 21, 2018 •

edited

Loading

AlexanderSenf Feb 21, 2018

cyenyxe Feb 21, 2018

jeromekelleher Feb 21, 2018 •

edited

Loading

jeromekelleher Feb 23, 2018

cyenyxe Feb 23, 2018 •

edited

Loading

jeromekelleher Feb 23, 2018

cyenyxe Feb 27, 2018

jeromekelleher Feb 27, 2018

jeromekelleher commented Feb 23, 2018

AlexanderSenf commented Feb 23, 2018

mlin commented Feb 26, 2018

cyenyxe commented Mar 5, 2018

mlin commented Mar 6, 2018

mlin commented Mar 12, 2018 •

edited

Loading

Htsget Support for VCF Format #233

Htsget Support for VCF Format #233

Conversation

amilamanoj commented Aug 18, 2017

jeromekelleher commented Aug 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromekelleher commented Nov 21, 2017

mgcam commented Nov 21, 2017

cyenyxe commented Nov 22, 2017 • edited Loading

mgcam commented Nov 22, 2017 via email

tk2 commented Nov 22, 2017

cyenyxe commented Feb 20, 2018

Choose a reason for hiding this comment

cyenyxe Feb 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromekelleher Feb 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyenyxe Feb 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromekelleher commented Feb 23, 2018

AlexanderSenf commented Feb 23, 2018

mlin commented Feb 26, 2018

cyenyxe commented Mar 5, 2018

mlin commented Mar 6, 2018

mlin commented Mar 12, 2018 • edited Loading

cyenyxe commented Nov 22, 2017 •

edited

Loading

cyenyxe Feb 21, 2018 •

edited

Loading

jeromekelleher Feb 21, 2018 •

edited

Loading

cyenyxe Feb 23, 2018 •

edited

Loading

mlin commented Mar 12, 2018 •

edited

Loading