New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bam->CRAM loses MQ, start, cigar on unmapped reads #714
Comments
This would be great as it is affecting a project I'm working on....:+1: On Thu, Sep 22, 2016 at 4:42 PM, Chris Norman notifications@github.com
|
On a closer look, it appears that for unmapped segments, the mapping quality is NOT emitted to the stream, so the fix is probably not as simple as I originally thought. |
This was done deliberately i'm afraid - non-zero mapping quality score as well as non-empty CIGAR string on unmapped reads do not seem valid to me. Here is from the SAM specs:
POS is the only field that sort of makes sense to allow for half-mapped pairs stored together. So the case is in the grey area where BAM format as transport just doesn't care but validators would probably classify such reads as invalid or raise a warning. |
I think that a file format should not be in the business of validating the On Tue, Sep 27, 2016 at 5:18 AM, Vadim Zalunin notifications@github.com
|
This is a demonstration of the argument we had a while back as to whether the spec should say that meaningless fields (like RNAME/POS/CIGAR/MAPQ for unmapped reads) are undefined / don't care, or whether it should prescribe particular dummy placeholder values to be used. Unfortunately the former viewpoint won and is now coming home to roost. Me, I am 👍 with @vadimzalunin. You can expect to roundtrip if your meaningless unmapped MAPQ fields use the sensible 0 placeholder value (and we should add that to the best practices section of the spec). Insisting that meaningless nonzero unmapped MAPQ values roundtrip successfully would over-constrain the formats. |
@yfarjoun A file format surely relies on data validity, like bases and scores length must match - text SAM for example can handle it. Are there any examples of valuable data stored in RNAME/CIGAR/MAPQ fields for unmapped reads somewhere? Is this just for testing or we have to prove losslessness :D to people that don't accept explanations? I can accept the latter but hopefully it is the former! |
See the best-practices section 2.4 for when RNAME (and POS) are meaningful, namely placing an unmapped read next to its mapped mate. Its cigar may be meaningful if hard-clipping exists, though that is neither in the spec nor the best practices. Nonetheless, I agree that it is perfectly fine to return a |
In the case where a read is unmapped but has a non-zero mapping quality (!), the mapping quality (and alignment start, if any) are dropped on round trip through CRAM. It appears that the actual values are in fact maintained in the CRAM stream, but the code just doesn't restore them in the unmapped case.
This was discovered in the effort to try to characterize what we can expect in round trip fidelity for Bam->CRAM-Bam. Admittedly probably a rare case, but do we need to drop these values if they;re there ?
See the file ce#unmapp.sam in the cram test folder (in the src/test/resources/htsjdk/samtools/cram test folder).
@vadimzalunin Any thoughts on if this is required for any reason - it looks like a pretty minor change would fix it.
The text was updated successfully, but these errors were encountered: