Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Throw on unrecognized CRAM encoding tag. #593
Conversation
vadimzalunin
was assigned
by cmnbroad
Apr 27, 2016
|
@vadimzalunin can you review this and make sure it makes sense ? |
|
Hm. 4 of the tests fail with this change because the code detects an unrecognized encoding tag "BB_bases". test_processMultiContainer Maybe there is a deeper problem. Deferring to @vadimzalunin . |
yfarjoun
added the
Review-party candidate
label
Jun 28, 2016
|
@vadimzalunin Can you have a look at this, and give us your thoughts on why this simple patch has caused tests to fail? |
|
Data series may be optional and this can be implemented in two way:
The proposed change basically prohibits 1), so we require all data series to be present even if not used. But I think this does not solve any problems because data series can still be effectively missing via 2). I would suggest to replace the first exception with a warning. Would it be useful to take into account validation stringency? The second exception on line 174 should be ok to throw as in most cases this means the impl is outdated. PS The decision for silent decoding was due to TM encoding tag I think: C and Java impls conflicted on this some time ago and it was decided to ignore/warn about unknown data series. This provides some forward compatibility as well if new data series are added. |
|
@vadimzalunin Thanks - that helps a bit. I was under the impression that the data series were all required (maybe I just made that assumption since the spec enumerates them and I don't recall it saying they were optional). Reverting the throw in the DataFactory code does allow the tests to pass - I'm not sure why we'd want to leave a warning there though if its really that case that ALL data series are optional - we'd be warning about something harmless. If on the other hand some data series really are required, maybe we should add an optional/required tag to the DataSeries annotation to use to decide how to handle the missing case. It sounds like we should leave the throw in the CompressionHeader code path, though - thats the case where we see a data series in the input stream that the code can't handle. Is it the case that the list of data series are fixed for a given CRAM version ? I don't, for example, see a data series tag named "BB" in the spec anywhere, though the code seems to know about it. |
|
@cmnbroad the question about required data series is still open. Compare this for example with BAM file where all bases and quality scores are missing (replaced with a single star symbol). Another odd example would be all reads having the same name (effectively no name), perhaps user is not interested in read identities. These sort of tricks are ok for low-level BAM parsers but they may fail higher level validation checks. |
coveralls
commented
Jul 26, 2016
•
cmnbroad
was assigned
by yfarjoun
Aug 9, 2016
|
@cmnbroad The ball seems to be in your court. |
coveralls
commented
Aug 17, 2016
|
I think this reflects the changes we settled on and should be ready to merge now. |
cmnbroad commentedApr 27, 2016
•
edited
Description
Fixes #549. Change the CRAM reader to throw when it sees an encoding tag it doesn't recognize.
Explain the motivation for making this change. What existing problem does the pull request solve?
The CRAM code currently logs and ignores unrecognized encoding tags, but we really should throw if we ever see one of these.
This is a purely defensive change - I don't really know of a way to force this condition.
Checklist