Added class to URLs in the response #322

cyenyxe · 2018-06-07T14:37:59Z

New class attribute for URLs in the htsget response, with supported values 'header' and 'body'.

In the example, I first try writing the class after the body but it was a bit confusing because the URL was the same, only the byte range changed. So I decided to write the class at the end of each object.

jeromekelleher · 2018-06-11T08:07:13Z

I think the idea is very elegant. class seems a good word to use here, and "header" and "body" seem like the right values as well.

My only issue here is that we may need to include more explanatory text. In #311 I tried to lay out the assumptions that we're making about the file formats, but maybe that was overboard. What do others think?

jmarshall · 2018-06-11T08:26:51Z

I was mulling over explanatory text over the weekend, and can write something up if @cyenyxe prefers.

Should this be optional so that existing implementations are still compliant? Leaving class off might mean you could make no assumptions about the contents of that sub-URL. Perhaps it would be best to say that class must be present on all of them or none of them.

jeromekelleher · 2018-06-11T08:31:17Z

Should this be optional so that existing implementations are still compliant? Leaving class off might mean you could make no assumptions about the contents of that sub-URL. Perhaps it would be best to say that class must be present on all of them or none of them.

Good question. I think we have to support older implementations that don't know about this, so the burden is on the client. I think you're right: either ALL urls should have a class attribute, or none of them should have it.

jmarshall · 2018-06-12T15:26:24Z

I've added a suggested draft of some more explanatory text and also an idea about using ETag and other HTTP headers to ensure header and body consistency if the client chooses to reuse already-downloaded headers. (I think the latter was mentioned in passing on the Toronto meeting call.)

I've moved the suggestion of some explanatory text and the Cache-Control text to a separate PR — see #325. You may wish to mine that for its explanatory text.

mlin · 2018-09-16T21:50:06Z

htsget.md

+`class`  
+_optional string_
+</td><td>
+A list of URL classes to include, see below.


I'd add the briefest explanation here too such as "allowing client to request the data header only, or to opt-out of receiving the header in subsequent requests"

mlin · 2018-09-16T21:51:45Z

htsget.md

+`class`
+_string_
+</td><td>
+For file formats whose specification describes a header and a body, the class indicates which of the two will be retrieved when querying this URL. Either all or none of the URLs in the response must have a class attribute. The allowed values are `header` and `body`.


For explicitness, I'd add something like "if class attributes are absent, client should assume data blocks include both header and body, possibly mixed"

I don't think this is a good idea. For instance, the EVA server provides separate URLs for headers and body by default, to facilitate downloading them separately even when the class field isn't available.

But if the server doesn't annotate the URLs with class fields, there is no way for the client to know that the EVA server is doing that. The text @mlin suggests just makes explicit the reality that clients can't make assumptions when class fields are missing.

Maybe I misunderstood what @mlin means. Which of these 2 is it?

Each URL could contain header or body

Each URL could contain header and body

If it is the latter, then a client would have to split the contents from each URL in order to build a valid output file. If it is the former, then I will think about a rewording that is completely unambiguous.

If class fields are absent, the client can make no assumptions about the contents of individual ticket URLs or the boundaries between their contents: each URL could contain headers, body records, partial headers or records, or both.¹

(Consider e.g. a request with no referenceName/start/end thus a request for the entire file, and the server just returns a ticket that chops it up into 1 Mb chunks.)

It's ~~implicit~~ explicit (in the diagram of core mechanic section) that the client proceeds as if it's concatenating the contents of all the ticket URLs in order, to get the full file contents. It might choose to avoid redownloading an URL or two because it's already got it (effectively) cached, but that's its business. I don't quite see what you're getting at about the “client [having] to split the contents”…?

¹ Actually perhaps some implementations would like data records or compression blocks not to be split across ticket URLs, but at present I don't think the protocol says anything about that — so such splitting is allowed.

Thanks for the explanation, I hadn't considered how they could be mixed due to the fix-sized chunks 😅 Now it's all clear.

I thought it meant that all the URLs could contain header and body, which would make individual blocks correct, but not the response as a whole.

Oh I see 😄 — yes: the client can't assume anything about the individual URL contents, but it is safe in expecting that the whole concatenated response is a valid header-body-body-…-body-EOFtrailer stream.

jmarshall · 2018-09-24T22:15:07Z

[Query parameters]
+class optional string A list of URL classes to include, see below.

The original functionality desired was to enable requesting just the headers (see #311 (comment)). Later we decided it would also be useful to enable clients to avoid re-downloading headers they already have, but this is a comparatively minor piece of functionality.

The elegance of the class: header / body in the JSON urls array is that it enables clients to do the latter (if they wish) while maintaining the invariant that all requests are asking for syntactically-valid BAM/CRAM/VCF/BCF files.

So IMHO class as a query parameter should implement only the originally desired functionality: the only allowable value for class as a query parameter should be header.

jeromekelleher · 2018-09-25T15:19:49Z

So IMHO class as a query parameter should implement only the originally desired functionality: the only allowable value for class as a query parameter should be header.

I'd support this in the spirit of us making the smallest possible change that's useful --- particularly if we can get it implemented. If we have an active server I'm happy to put together a client interface.

mlin · 2018-09-25T17:03:56Z

Opting-out of the header should be useful for RNA-seq data where the header is liable to include all the transcript IDs mapped to. However, I support the principle of simplifying a prototype implementation.

jmarshall · 2018-09-25T17:35:47Z

@mlin: This PR now adds two related facilities. The main point of this PR, “Add[ing] class to URLs in the response“ is that such a request for RNA-seq data would get an htsget ticket back that says

{ … "urls": [ { "url": "abc…", "class": "header" }, { "url": "def…", "class": "body" }, … ], … }

and it would be able to opt out of getting the header (if it wished) by not bothering to download the abc URL. That is functionality that is enabled by the main part of this PR. (This might be unclear, which IIRC is why “more explanatory text” was requested in the PR.)

In addition to this, since September 7th the PR also adds a class= query parameter to directly express the original desired functionality — of making a request for just the headers.

So not only is class=header-or-class-queryparam-not-present the smallest possible useful change, it's also the only query parameter value we'll ever need for this sort of opting-out! The PR adds the functionality @mlin describes by providing the tools for clever clients to cleverly interpret the htsget ticket.

cyenyxe · 2018-11-28T11:32:38Z

I feel it would be confusing to have a query parameter and response field with the same name, but only allowing a subset of values to be requested. What are your thoughts about using a parameter like header-url=<true|false> instead?

jmarshall · 2018-11-28T14:12:49Z

I don't think it is confusing. Describing the query parameter as “Use …?class=header to have the server return the SAM or VCF headers only” or similar will make this clear. (The purpose of this query parameter is to implement the original desire for a way for the client to request SAM/VCF headers only, not to be a general-purpose selector.)

IMHO it is elegant for the protocol to use the same class / header terminology in both places.

cyenyxe · 2018-11-30T14:58:02Z

I hope the last changes address all the concerns about lack of clarity, please let me know if that is not the case.

mlin · 2019-01-23T06:45:03Z

I've made an initial implementation of this in https://github.com/dnanexus-rnd/htsnexus/pull/29/files and think it looks very nice. Great job @cyenyxe @jmarshall @jeromekelleher! I think we can ask for +1s on this on the call today.

jmarshall · 2019-01-23T13:08:22Z

This describes the new class annotations, but it doesn't really say anything about what they're for or how to use them. This is what I think #322 (comment) was getting at.

The Diagram of core mechanic section is an opportunity to add explanatory text. We should add something like:

While the blocks must be finally concatenated in the given order, the client may fetch them in parallel and/or reuse cached data from URLs that have previously been downloaded.

When making a series of requests to fetch reads or variants within different regions of the same <id> resource, clients may wish to avoid re-fetching the SAM/CRAM/VCF headers each time, especially if they are large. If the ticket contains class fields, the client may reuse previously downloaded and parsed headers rather than re-fetching the header-class URLs.

In the response JSON, class should be listed as optional string.

IMHO the “If it is absent, clients should assume data blocks include both header and body, possibly mixed in one of the blocks” sentence is not particularly clear and would be better phrased as not assuming:

If class fields are not supplied, no assumptions can be made about which data blocks contain headers, body records, or parts of both.

jmarshall · 2019-01-23T13:47:45Z

Re the query parameter …?class=header facility:

It's somewhat unclear what kind of data headers it's talking about.
The text about “not possible to return the data body only” is irrelevant (the spec shouldn't complain about the protocol's design!) and incorrect (the desired effect is totally possible — the way to do it is by interpreting class in the response ticket). It might be worth adding something about “All responses elegantly denote complete SAM/CRAM/VCF files” to the Design principles section, but it's misplaced here.

IMHO we should write this parameter's description something like the following, to facilitate future additions:

format
optional string

Request data in this format. The allowed values for each type of record are: […snip…]

class
optional string

Request different classes of data. By default, i.e., when class is not specified, the response will represent a complete read or variant data stream, encompassing SAM/CRAM/VCF headers, body data records, and EOF marker.

If class is specified, its value MUST be one of the following:

`header`	Request the SAM/CRAM/VCF headers only. The server SHOULD respond with an `InvalidInput` error if any other htsget query parameters other than `format` are specified at the same time as `class=header`.

Should we specify whether the class=header response has an EOF marker? IMHO we should probably be explicit that it might or might not.
Servers might want to return the BAM/etc headers directly rather than returning a ticket and needing an extra round trip. Should this be allowed? If so, we might mention it here and we would need to relax the “All API invocations are made to a configurable HTTP(S) endpoint, receive URL-encoded query string parameters, and return JSON output” text in Protocol essentials.

jmarshall

👍 to the design, but text can be improved as per recent comments.

jeromekelleher · 2019-01-23T16:59:00Z

I basically agree to everything that @jmarshall has put in above. Perhaps he could open a PR on @cyenyxe's feature branch with these changes so that they will show up here in this PR? (Git history minutia: since both John and Cristina have put in so much here, I think it's probably fair if we have two commits, one authored by each, representing the actual history of the feature.)

cyenyxe · 2019-01-23T17:01:08Z

Given that it is just a change in wording, and I wasn't too convinced about mine to being with, I am happy with @jmarshall pushing to my branch without a PR. Please let me know if you don't have permissions to do so.

Describe the class=header request query parameter more fully. Add description of using response URL classes to the "diagram of core mechanic" section. [SQUASH INTO PREVIOUS] Add "optional", some wordsmithing, and add blank lines so formatting works when displayed in the GH repo too.]

mlin · 2019-02-20T17:11:41Z

bumping @jmarshall questions

Should we specify whether the class=header response has an EOF marker? IMHO we should probably be explicit that it might or might not.

Servers might want to return the BAM/etc headers directly rather than returning a ticket and needing an extra round trip. Should this be allowed? If so, we might mention it here and we would need to relax the “All API invocations are made to a configurable HTTP(S) endpoint, receive URL-encoded query string parameters, and return JSON output” text in Protocol essentials.

jeromekelleher · 2019-02-20T17:17:18Z

Weighing in here:

Servers might want to return the BAM/etc headers directly rather than returning a ticket and needing an extra round trip. Should this be allowed? If so, we might mention it here and we would need to relax the “All API invocations are made to a configurable HTTP(S) endpoint, receive URL-encoded query string parameters, and return JSON output” text in Protocol essentials.

I think we should always return a ticket, the alternative seems like it would cause a lot of issues.

Should we specify whether the class=header response has an EOF marker? IMHO we should probably be explicit that it might or might not.

Would you mind explaining when you think we would get an EOF marker @jmarshall? It's not clear to me what the tradeoffs are here.

jmarshall · 2019-02-20T19:43:08Z

Re returning headers directly, @daviesrob's point (made during the meeting) that clients expecting a ticket might be greatly surprised to get something that's not a ticket is a very good one.

Re EOF markers (as in the 28-byte empty BGZF block at the end of BGZFed files, and the CRAM equivalent):

The proposed class=header query text has “By default, i.e., when class is not specified, the response will represent a complete read or variant data stream, encompassing SAM/CRAM/VCF headers, body data records, and EOF marker”, which is perhaps more information than anyone really needs 😄 — it's there to set the scene for which headers/body the class stuff is talking about.

So for formats that have an EOF marker (BAM, BCF, and CRAM; not non-bgzipped SAM or VCF), this text spells out that the complete file datastream will have one.

For a class=header request for say BAM, the response will encompass BAM header and maybe an EOF marker. For a server implemented using HTSlib, it might write the response using hts_open/ sam_hdr_write/ hts_close in which case hts_close will write an EOF marker. Or it might hold the relevant stream open across multiple requests and only use sam_hdr_write, in which case there will be no EOF marker.

Hmmm… I guess if we're always returning a ticket for a class=header request, the server will have no difficulty in always putting a data:,<EOF-marker> URL at the end of the ticket. So we could say the response will be a BAM/etc header that SHOULD be followed by an EOF marker (for formats that have them).

mlin · 2019-02-20T20:20:46Z

Servers might want to return the BAM/etc headers directly rather than returning a ticket and needing an extra round trip. Should this be allowed? If so, we might mention it here and we would need to relax the “All API invocations are made to a configurable HTTP(S) endpoint, receive URL-encoded query string parameters, and return JSON output” text in Protocol essentials.

We have the way of returning a data: URI inside the ticket, this requires base64 encoding and decoding it but that shouldn't be too bad depending on the header size (and if it's very large, in principle the server can still offload it). Does that satisfy this idea?

Should we specify whether the class=header response has an EOF marker? IMHO we should probably be explicit that it might or might not.

I suppose the worry is that the response includes header & EOF marker, and the client later tries to be clever by reusing a large header by concatenation with body responses of subsequent requests, and ends up with an EOF marker in the middle of their file. For BAM and VCF this might actually be OK, in view of how the EOF marker is just gzip of the empty string. The CRAM EOF container might be more likely to cause problems, though.

On the other hand, including the EOF preserves the property that the concatenated response is a fully-formed file of the requested format, which is nice. Perhaps the clever client described above should be expected to be so clever as to detect and strip the EOF bytes from the header response, if they're present.

jeromekelleher · 2019-02-21T08:39:04Z

On the other hand, including the EOF preserves the property that the concatenated response is a fully-formed file of the requested format, which is nice. Perhaps the clever client described above should be expected to be so clever as to detect and strip the EOF bytes from the header response, if they're present.

I think this a good general principle, and the clever client should be expected to be clever enough to know about the EOF markers.

mlin · 2019-04-17T05:01:01Z

htsget.md

+`header`
+</td><td>
+
+Request the SAM/CRAM/VCF headers only.


@jmarshall What do you think of adding "(and EOF marker)" here? Useful clarification, thanks for bringing it up earlier.

mlin · 2019-04-18T04:36:14Z

The discussion on today's call leaned toward merging this and leaving the EOF as something we can clarify later if it's an issue. Rob Davies reported that the htslib CRAM reader isn't misled by having a CRAM EOF marker in the middle of the file, and even that is something that a carefully written client should be able to avoid.

I will merge this in another day if there are nothing further is raised. Thanks all!

jmarshall · 2019-04-25T07:29:04Z

Mike, when merging it would be good to remember things like #322 (comment) that had been agreed on. I thought you accepted my offer during the meeting to prepare this PR accordingly…

jeromekelleher · 2019-04-25T10:58:12Z

Mike, when merging it would be good to remember things like #322 (comment) that had been agreed on. I thought you accepted my offer during the meeting to prepare this PR accordingly…

This was my understanding as well. It probably doesn't make much difference for a low-traffic repo like this, but a clean and linear(ish) git history is worth having.

jmarshall · 2019-04-25T11:29:56Z

I've pushed a small formatting fix that was in what I was preparing.

Also, we probably ought to bump the spec version number for this.

jmarshall · 2019-04-25T11:53:21Z

And a second formatting fix. This is the sort of thing that ought to be caught by previewing the Jekyll formatting while doing the final PR merge or at least by checking the published website afterwards.

mlin · 2019-04-28T16:58:51Z

OK, sorry all, I evidently mis-remembered the action assignment. Let me know how I can help clean up. I agree on bumping the spec version number.

jmarshall · 2019-05-02T12:01:21Z

It doesn't much matter who does what (as long as attributions are maintained), but merging to master does need to be done with care and attention. You're the only one empowered to push to htsget.md on master and IMHO the checklist should include:

Read over the PR conversation to check that nothing has been omitted;
Check the kramdown formatting of the PR as merged (which would usually be done by previewing locally with jekyll serve);
Check the GitHub repository formatting of the PR as it will be merged (which can be approximated with “View file” from the PR's Files changed tab or by pushing a temporary branch containing the eventual commit(s) to your own fork).

This pretty much means you get to do this locally rather than by using GitHub's UI merge buttons.

Let me know how I can help clean up.

Fortunately nothing else has yet been pushed to master, so you could clean this up by bumping the version number in/via an explicit merge commit so that the whole PR arrives on master as a unit. (This will make for a bubble in the history, but that's probably the lesser of two evils…)

In particular: reset your master to ad374ab and merge --no-ff --edit the 45c9feb branch into it with an appropriate commit message, either bumping the version number in that merge commit or by committing the bump to the branch first.

jmarshall · 2019-05-07T09:39:32Z

@mlin: It's not the end of the world for the git history if this is not cleaned up, but you asked for advice on how to clean up. You have only a limited amount of time to do so, as other specifications want to carry on with their work.

In particular, the SAM people want to commit f48b522. If you wished, you could use this to clean up nicely: check out the sam-tp branch, git merge --no-commit master (where master == 45c9feb or that plus a commit bumping the htsget version) into it with an appropriate commit message, either bumping the version number in that merge commit or by committing the bump to your local master first. Then fast-forward master to that merge commit and push it, and you will have committed to origin/master this PR as a tidy branch merge and the small SAM change appropriately timed within its bubble.

@cyenyxe

Principal drafters @cyenyxe @jmarshall

mlin · 2019-05-14T03:31:47Z

OK, I took your first suggestion as I didn't feel I myself ought to effect merger of the SAM change. Thanks for that!

Policywise, I wouldn't claim to be "the only one empowered to push to htsget.md on master" and when I ask "Let me know how I can help" I wouldn't add "and no one else may do anything to master until I decide." I wouldn't think such rigidity called for in our small, volunteer contributor group. Somebody has to be the default person to coordinate (and follow the helpful checklist), but if it's broke, fix it!

jeromekelleher · 2019-05-14T08:45:51Z

Policywise, I wouldn't claim to be "the only one empowered to push to htsget.md on master" and when I ask "Let me know how I can help" I wouldn't add "and no one else may do anything to master until I decide." I wouldn't think such rigidity called for in our small, volunteer contributor group. Somebody has to be the default person to coordinate (and follow the helpful checklist), but if it's broke, fix it!

I agree with @mlin here --- once there's a consensus reached on a PR, anyone with write permissions on the repo should feel free to merge (following the very helpful checklist from @jmarshall above), so we don't need to block on Mike's limited availability. I'm assuming you have write permission on this repo @jmarshall , and I (for one) would be perfectly happy with you taking on the PR merge responsibility if it was a PR you're involved in.

jmarshall · 2019-05-14T09:06:16Z

Mike is the sole htsget maintainer listed in MAINTAINERS.md, and LSG leadership has previously castigated people for not respecting this. If you want a different policy, you'll have to promulgate it either in MAINTAINERS.md or on the mailing list.

jeromekelleher · 2019-05-14T09:14:06Z

Mike is the sole htsget maintainer listed in MAINTAINERS.md, and LSG leadership has previously castigated people for not respecting this. If you want a different policy, you'll have to promulgate it either in MAINTAINERS.md or on the mailing list.

Fair enough - sounds like a discussion point for the next meeting then.

Added class to each URL in the response

3eaeabd

cyenyxe added the htsget label Jun 7, 2018

jeromekelleher mentioned this pull request Jun 13, 2018

Header and data boolean flags for htsget. #311

Closed

jmarshall force-pushed the feature/url-class branch from edcea92 to 3eaeabd Compare July 12, 2018 09:40

jmarshall mentioned this pull request Jul 12, 2018

[DRAFT] Cache control for re-using previously-downloaded headers #325

Draft

Cristina Yenyxe Gonzalez Garcia added 2 commits September 7, 2018 15:09

All or none URL class in the response

8ae6f01

URL class as request parameter

1ece917

cyenyxe requested review from jmarshall and mlin September 7, 2018 14:23

mlin reviewed Sep 16, 2018

View reviewed changes

Cristina Yenyxe Gonzalez Garcia added 3 commits November 29, 2018 11:01

Clarification about class attribute in response

c9968e9

Capitalization according to RFC 2119

82791f1

Description of 'class' query parameter

677e158

mlin mentioned this pull request Jan 22, 2019

htsget start-end #369

Closed

jmarshall requested changes Jan 23, 2019

View reviewed changes

jmarshall approved these changes Jan 28, 2019

View reviewed changes

jmarshall mentioned this pull request Feb 20, 2019

Add POST method to htsget #285

Merged

mlin reviewed Apr 17, 2019

View reviewed changes

mlin added 2 commits April 22, 2019 11:19

Merge branch 'master' into feature/url-class

3cfc844

revert my merge error

d13b103

mlin merged commit 476afba into samtools:master Apr 24, 2019

mlin added a commit that referenced this pull request May 14, 2019

Merge htsget #322 ('class' attribute, formatting corrections, v1.2.0)

f62c928

Principal drafters @cyenyxe @jmarshall

jmarshall mentioned this pull request Jun 15, 2021

Is the variants endpoint of htsget protocal supported? igvteam/igv.js#1187

Closed

Added class to URLs in the response #322

Added class to URLs in the response #322

Conversation

cyenyxe commented Jun 7, 2018

jeromekelleher commented Jun 11, 2018

jmarshall commented Jun 11, 2018

jeromekelleher commented Jun 11, 2018

jmarshall commented Jun 12, 2018 • edited Loading

mlin Sep 16, 2018

Choose a reason for hiding this comment

mlin Sep 16, 2018

Choose a reason for hiding this comment

cyenyxe Nov 28, 2018

Choose a reason for hiding this comment

jmarshall Nov 28, 2018

Choose a reason for hiding this comment

cyenyxe Nov 29, 2018 • edited Loading

Choose a reason for hiding this comment

jmarshall Nov 29, 2018 • edited Loading

Choose a reason for hiding this comment

cyenyxe Nov 29, 2018

Choose a reason for hiding this comment

jmarshall Nov 29, 2018

Choose a reason for hiding this comment

jmarshall commented Sep 24, 2018 • edited Loading

jeromekelleher commented Sep 25, 2018

mlin commented Sep 25, 2018 • edited Loading

jmarshall commented Sep 25, 2018

cyenyxe commented Nov 28, 2018 • edited Loading

jmarshall commented Nov 28, 2018

cyenyxe commented Nov 30, 2018

mlin commented Jan 23, 2019

jmarshall commented Jan 23, 2019

jmarshall commented Jan 23, 2019 • edited Loading

jmarshall left a comment

Choose a reason for hiding this comment

jeromekelleher commented Jan 23, 2019

cyenyxe commented Jan 23, 2019 • edited Loading

mlin commented Feb 20, 2019 • edited Loading

jeromekelleher commented Feb 20, 2019

jmarshall commented Feb 20, 2019 • edited Loading

mlin commented Feb 20, 2019

jeromekelleher commented Feb 21, 2019

mlin Apr 17, 2019

Choose a reason for hiding this comment

mlin commented Apr 18, 2019

jmarshall commented Apr 25, 2019 • edited Loading

jeromekelleher commented Apr 25, 2019

jmarshall commented Apr 25, 2019

jmarshall commented Apr 25, 2019

mlin commented Apr 28, 2019

jmarshall commented May 2, 2019

jmarshall commented May 7, 2019

mlin commented May 14, 2019

jeromekelleher commented May 14, 2019

jmarshall commented May 14, 2019

jeromekelleher commented May 14, 2019

jmarshall commented Jun 12, 2018 •

edited

Loading

cyenyxe Nov 29, 2018 •

edited

Loading

jmarshall Nov 29, 2018 •

edited

Loading

jmarshall commented Sep 24, 2018 •

edited

Loading

mlin commented Sep 25, 2018 •

edited

Loading

cyenyxe commented Nov 28, 2018 •

edited

Loading

jmarshall commented Jan 23, 2019 •

edited

Loading

cyenyxe commented Jan 23, 2019 •

edited

Loading

mlin commented Feb 20, 2019 •

edited

Loading

jmarshall commented Feb 20, 2019 •

edited

Loading

jmarshall commented Apr 25, 2019 •

edited

Loading