Throw on unrecognized CRAM encoding tag. #593

Merged
merged 1 commit into from Aug 23, 2016

Conversation

Projects
None yet
6 participants
Contributor

cmnbroad commented Apr 27, 2016 edited

Description

Fixes #549. Change the CRAM reader to throw when it sees an encoding tag it doesn't recognize.

Explain the motivation for making this change. What existing problem does the pull request solve?

The CRAM code currently logs and ignores unrecognized encoding tags, but we really should throw if we ever see one of these.

This is a purely defensive change - I don't really know of a way to force this condition.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
Contributor

cmnbroad commented Apr 27, 2016

@vadimzalunin can you review this and make sure it makes sense ?

Contributor

cmnbroad commented Apr 27, 2016 edited

Hm. 4 of the tests fail with this change because the code detects an unrecognized encoding tag "BB_bases".

test_processMultiContainer
testCRAMContainerStreamWithIndex
testMappedAndUnmappedContainer
testMultirefContainer

Maybe there is a deeper problem. Deferring to @vadimzalunin .

Contributor

droazen commented Jun 28, 2016

@vadimzalunin Can you have a look at this, and give us your thoughts on why this simple patch has caused tests to fail?

droazen referenced this pull request Jun 28, 2016

Open

Review "Party" #548

Contributor

vadimzalunin commented Jul 13, 2016

Data series may be optional and this can be implemented in two way:

  1. data series is not mentioned in the data series encoding map (part of compression header)
  2. unused/empty data series encodings are explicitly set to NULL encoding.

The proposed change basically prohibits 1), so we require all data series to be present even if not used. But I think this does not solve any problems because data series can still be effectively missing via 2). I would suggest to replace the first exception with a warning.

Would it be useful to take into account validation stringency?

The second exception on line 174 should be ok to throw as in most cases this means the impl is outdated.

PS The decision for silent decoding was due to TM encoding tag I think: C and Java impls conflicted on this some time ago and it was decided to ignore/warn about unknown data series. This provides some forward compatibility as well if new data series are added.
In a way data series tags are similar to (generalization of) SAM tags, some of them have special meaning (like NM and MD) whereas by default SAM tags are not interpreted.

Contributor

cmnbroad commented Jul 13, 2016

@vadimzalunin Thanks - that helps a bit. I was under the impression that the data series were all required (maybe I just made that assumption since the spec enumerates them and I don't recall it saying they were optional). Reverting the throw in the DataFactory code does allow the tests to pass - I'm not sure why we'd want to leave a warning there though if its really that case that ALL data series are optional - we'd be warning about something harmless. If on the other hand some data series really are required, maybe we should add an optional/required tag to the DataSeries annotation to use to decide how to handle the missing case.

It sounds like we should leave the throw in the CompressionHeader code path, though - thats the case where we see a data series in the input stream that the code can't handle. Is it the case that the list of data series are fixed for a given CRAM version ? I don't, for example, see a data series tag named "BB" in the spec anywhere, though the code seems to know about it.

Contributor

vadimzalunin commented Jul 14, 2016

@cmnbroad the question about required data series is still open. Compare this for example with BAM file where all bases and quality scores are missing (replaced with a single star symbol). Another odd example would be all reads having the same name (effectively no name), perhaps user is not interested in read identities. These sort of tricks are ok for low-level BAM parsers but they may fail higher level validation checks.

coveralls commented Jul 26, 2016 edited

Coverage Status

Coverage decreased (-0.002%) to 68.454% when pulling 301f23d on cmnbroad:cn_fix_cram_encoder into 87b1e87 on samtools:master.

cmnbroad was assigned by yfarjoun Aug 9, 2016

Contributor

yfarjoun commented Aug 9, 2016

@cmnbroad The ball seems to be in your court.

@cmnbroad cmnbroad Throw on unrecognized CRAM encoding tag.
8cafd20

Coverage Status

Coverage decreased (-0.008%) to 68.799% when pulling 8cafd20 on cmnbroad:cn_fix_cram_encoder into c8202f2 on samtools:master.

Contributor

cmnbroad commented Aug 19, 2016

I think this reflects the changes we settled on and should be ready to merge now.

@lbergelson lbergelson merged commit 0d0da87 into samtools:master Aug 23, 2016

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.008%) to 68.799%
Details

cmnbroad deleted the cmnbroad:cn_fix_cram_encoder branch Mar 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment