-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added class to URLs in the response #322
Conversation
I think the idea is very elegant. My only issue here is that we may need to include more explanatory text. In #311 I tried to lay out the assumptions that we're making about the file formats, but maybe that was overboard. What do others think? |
I was mulling over explanatory text over the weekend, and can write something up if @cyenyxe prefers. Should this be optional so that existing implementations are still compliant? Leaving |
Good question. I think we have to support older implementations that don't know about this, so the burden is on the client. I think you're right: either ALL urls should have a class attribute, or none of them should have it. |
I've moved the suggestion of some explanatory text and the Cache-Control text to a separate PR — see #325. You may wish to mine that for its explanatory text. |
edcea92
to
3eaeabd
Compare
htsget.md
Outdated
`class` | ||
_optional string_ | ||
</td><td> | ||
A list of URL classes to include, see below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add the briefest explanation here too such as "allowing client to request the data header only, or to opt-out of receiving the header in subsequent requests"
htsget.md
Outdated
`class` | ||
_string_ | ||
</td><td> | ||
For file formats whose specification describes a header and a body, the class indicates which of the two will be retrieved when querying this URL. Either all or none of the URLs in the response must have a class attribute. The allowed values are `header` and `body`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For explicitness, I'd add something like "if class attributes are absent, client should assume data blocks include both header and body, possibly mixed"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good idea. For instance, the EVA server provides separate URLs for headers and body by default, to facilitate downloading them separately even when the class
field isn't available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the server doesn't annotate the URLs with class
fields, there is no way for the client to know that the EVA server is doing that. The text @mlin suggests just makes explicit the reality that clients can't make assumptions when class
fields are missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misunderstood what @mlin means. Which of these 2 is it?
- Each URL could contain header or body
- Each URL could contain header and body
If it is the latter, then a client would have to split the contents from each URL in order to build a valid output file. If it is the former, then I will think about a rewording that is completely unambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If class
fields are absent, the client can make no assumptions about the contents of individual ticket URLs or the boundaries between their contents: each URL could contain headers, body records, partial headers or records, or both.¹
(Consider e.g. a request with no referenceName/start/end thus a request for the entire file, and the server just returns a ticket that chops it up into 1 Mb chunks.)
It's implicit explicit (in the diagram of core mechanic section) that the client proceeds as if it's concatenating the contents of all the ticket URLs in order, to get the full file contents. It might choose to avoid redownloading an URL or two because it's already got it (effectively) cached, but that's its business. I don't quite see what you're getting at about the “client [having] to split the contents”…?
¹ Actually perhaps some implementations would like data records or compression blocks not to be split across ticket URLs, but at present I don't think the protocol says anything about that — so such splitting is allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, I hadn't considered how they could be mixed due to the fix-sized chunks 😅 Now it's all clear.
I thought it meant that all the URLs could contain header and body, which would make individual blocks correct, but not the response as a whole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see 😄 — yes: the client can't assume anything about the individual URL contents, but it is safe in expecting that the whole concatenated response is a valid header-body-body-…-body-EOFtrailer stream.
The original functionality desired was to enable requesting just the headers (see #311 (comment)). Later we decided it would also be useful to enable clients to avoid re-downloading headers they already have, but this is a comparatively minor piece of functionality. The elegance of the So IMHO |
I'd support this in the spirit of us making the smallest possible change that's useful --- particularly if we can get it implemented. If we have an active server I'm happy to put together a client interface. |
Opting-out of the header should be useful for RNA-seq data where the header is liable to include all the transcript IDs mapped to. However, I support the principle of simplifying a prototype implementation. |
@mlin: This PR now adds two related facilities. The main point of this PR, “Add[ing] class to URLs in the response“ is that such a request for RNA-seq data would get an htsget ticket back that says
and it would be able to opt out of getting the header (if it wished) by not bothering to download the In addition to this, since September 7th the PR also adds a So not only is |
I feel it would be confusing to have a query parameter and response field with the same name, but only allowing a subset of values to be requested. What are your thoughts about using a parameter like |
I don't think it is confusing. Describing the query parameter as “Use IMHO it is elegant for the protocol to use the same |
I hope the last changes address all the concerns about lack of clarity, please let me know if that is not the case. |
I've made an initial implementation of this in https://github.com/dnanexus-rnd/htsnexus/pull/29/files and think it looks very nice. Great job @cyenyxe @jmarshall @jeromekelleher! I think we can ask for +1s on this on the call today. |
This describes the new
|
Re the query parameter
IMHO we should write this parameter's description something like the following, to facilitate future additions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to the design, but text can be improved as per recent comments.
I basically agree to everything that @jmarshall has put in above. Perhaps he could open a PR on @cyenyxe's feature branch with these changes so that they will show up here in this PR? (Git history minutia: since both John and Cristina have put in so much here, I think it's probably fair if we have two commits, one authored by each, representing the actual history of the feature.) |
Given that it is just a change in wording, and I wasn't too convinced about mine to being with, I am happy with @jmarshall pushing to my branch without a PR. Please let me know if you don't have permissions to do so. |
Describe the class=header request query parameter more fully. Add description of using response URL classes to the "diagram of core mechanic" section. [SQUASH INTO PREVIOUS] Add "optional", some wordsmithing, and add blank lines so formatting works when displayed in the GH repo too.]
bumping @jmarshall questions
|
Weighing in here:
I think we should always return a ticket, the alternative seems like it would cause a lot of issues.
Would you mind explaining when you think we would get an EOF marker @jmarshall? It's not clear to me what the tradeoffs are here. |
Re returning headers directly, @daviesrob's point (made during the meeting) that clients expecting a ticket might be greatly surprised to get something that's not a ticket is a very good one. Re EOF markers (as in the 28-byte empty BGZF block at the end of BGZFed files, and the CRAM equivalent): The proposed So for formats that have an EOF marker (BAM, BCF, and CRAM; not non-bgzipped SAM or VCF), this text spells out that the complete file datastream will have one. For a Hmmm… I guess if we're always returning a ticket for a |
We have the way of returning a
I suppose the worry is that the response includes header & EOF marker, and the client later tries to be clever by reusing a large header by concatenation with body responses of subsequent requests, and ends up with an EOF marker in the middle of their file. For BAM and VCF this might actually be OK, in view of how the EOF marker is just gzip of the empty string. The CRAM EOF container might be more likely to cause problems, though. On the other hand, including the EOF preserves the property that the concatenated response is a fully-formed file of the requested format, which is nice. Perhaps the clever client described above should be expected to be so clever as to detect and strip the EOF bytes from the header response, if they're present. |
I think this a good general principle, and the clever client should be expected to be clever enough to know about the EOF markers. |
`header` | ||
</td><td> | ||
|
||
Request the SAM/CRAM/VCF headers only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmarshall What do you think of adding "(and EOF marker)" here? Useful clarification, thanks for bringing it up earlier.
The discussion on today's call leaned toward merging this and leaving the EOF as something we can clarify later if it's an issue. Rob Davies reported that the htslib CRAM reader isn't misled by having a CRAM EOF marker in the middle of the file, and even that is something that a carefully written client should be able to avoid. I will merge this in another day if there are nothing further is raised. Thanks all! |
Mike, when merging it would be good to remember things like #322 (comment) that had been agreed on. I thought you accepted my offer during the meeting to prepare this PR accordingly… |
This was my understanding as well. It probably doesn't make much difference for a low-traffic repo like this, but a clean and linear(ish) git history is worth having. |
I've pushed a small formatting fix that was in what I was preparing. Also, we probably ought to bump the spec version number for this. |
And a second formatting fix. This is the sort of thing that ought to be caught by previewing the Jekyll formatting while doing the final PR merge or at least by checking the published website afterwards. |
OK, sorry all, I evidently mis-remembered the action assignment. Let me know how I can help clean up. I agree on bumping the spec version number. |
It doesn't much matter who does what (as long as attributions are maintained), but merging to master does need to be done with care and attention. You're the only one empowered to push to htsget.md on master and IMHO the checklist should include:
This pretty much means you get to do this locally rather than by using GitHub's UI merge buttons.
Fortunately nothing else has yet been pushed to master, so you could clean this up by bumping the version number in/via an explicit merge commit so that the whole PR arrives on master as a unit. (This will make for a bubble in the history, but that's probably the lesser of two evils…) In particular: reset your master to ad374ab and |
@mlin: It's not the end of the world for the git history if this is not cleaned up, but you asked for advice on how to clean up. You have only a limited amount of time to do so, as other specifications want to carry on with their work. In particular, the SAM people want to commit f48b522. If you wished, you could use this to clean up nicely: check out the sam-tp branch, |
Principal drafters @cyenyxe @jmarshall
OK, I took your first suggestion as I didn't feel I myself ought to effect merger of the SAM change. Thanks for that! Policywise, I wouldn't claim to be "the only one empowered to push to htsget.md on master" and when I ask "Let me know how I can help" I wouldn't add "and no one else may do anything to master until I decide." I wouldn't think such rigidity called for in our small, volunteer contributor group. Somebody has to be the default person to coordinate (and follow the helpful checklist), but if it's broke, fix it! |
I agree with @mlin here --- once there's a consensus reached on a PR, anyone with write permissions on the repo should feel free to merge (following the very helpful checklist from @jmarshall above), so we don't need to block on Mike's limited availability. I'm assuming you have write permission on this repo @jmarshall , and I (for one) would be perfectly happy with you taking on the PR merge responsibility if it was a PR you're involved in. |
Mike is the sole htsget maintainer listed in MAINTAINERS.md, and LSG leadership has previously castigated people for not respecting this. If you want a different policy, you'll have to promulgate it either in MAINTAINERS.md or on the mailing list. |
Fair enough - sounds like a discussion point for the next meeting then. |
New
class
attribute for URLs in the htsget response, with supported values 'header' and 'body'.In the example, I first try writing the class after the body but it was a bit confusing because the URL was the same, only the byte range changed. So I decided to write the class at the end of each object.