Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bam->CRAM loses MQ, start, cigar on unmapped reads #714

Open
cmnbroad opened this issue Sep 22, 2016 · 7 comments
Open

Bam->CRAM loses MQ, start, cigar on unmapped reads #714

cmnbroad opened this issue Sep 22, 2016 · 7 comments
Labels

Comments

@cmnbroad
Copy link
Collaborator

In the case where a read is unmapped but has a non-zero mapping quality (!), the mapping quality (and alignment start, if any) are dropped on round trip through CRAM. It appears that the actual values are in fact maintained in the CRAM stream, but the code just doesn't restore them in the unmapped case.

This was discovered in the effort to try to characterize what we can expect in round trip fidelity for Bam->CRAM-Bam. Admittedly probably a rare case, but do we need to drop these values if they;re there ?

See the file ce#unmapp.sam in the cram test folder (in the src/test/resources/htsjdk/samtools/cram test folder).

@vadimzalunin Any thoughts on if this is required for any reason - it looks like a pretty minor change would fix it.

@yfarjoun
Copy link
Contributor

This would be great as it is affecting a project I'm working on....:+1:

On Thu, Sep 22, 2016 at 4:42 PM, Chris Norman notifications@github.com
wrote:

In the case where a read is unmapped but has a non-zero mapping quality
(!), the mapping quality (and alignment start, if any) are dropped on round
trip through CRAM. It appears that the actual values are in fact maintained
in the CRAM stream, but the code just doesn't restore them in the unmapped
case.

This was discovered in the effort to try to characterize what we can
expect in round trip fidelity for Bam->CRAM-Bam. Admittedly probably a rare
case, but do we need to drop these values if they;re there ?

See the file ce#unmapp.sam in the cram test folder (in the
src/test/resources/htsjdk/samtools/cram test folder).

@vadimzalunin https://github.com/vadimzalunin Any thoughts on if this
is required for any reason - it looks like a pretty minor change would fix
it.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#714, or mute the thread
https://github.com/notifications/unsubscribe-auth/ACnk0kPZhyeKrh0xeY3mMxfUn2fNHC3Qks5qsugkgaJpZM4KEVu4
.

@cmnbroad
Copy link
Collaborator Author

On a closer look, it appears that for unmapped segments, the mapping quality is NOT emitted to the stream, so the fix is probably not as simple as I originally thought.

@vadimzalunin
Copy link
Contributor

This was done deliberately i'm afraid - non-zero mapping quality score as well as non-empty CIGAR string on unmapped reads do not seem valid to me.

Here is from the SAM specs:

Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions
can be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2, 0x100, and 0x800.

POS is the only field that sort of makes sense to allow for half-mapped pairs stored together.

So the case is in the grey area where BAM format as transport just doesn't care but validators would probably classify such reads as invalid or raise a warning.
What do others think? I understand the desire for easy roundtrip tests but this alone should not justify loose validation.

@yfarjoun
Copy link
Contributor

I think that a file format should not be in the business of validating the
spec. a warning would be fine, but changing the data should be avoided as
much as possible.

On Tue, Sep 27, 2016 at 5:18 AM, Vadim Zalunin notifications@github.com
wrote:

This was done deliberately i'm afraid - non-zero mapping quality score as
well as non-empty CIGAR string on unmapped reads do not seem valid to me.

Here is from the SAM specs:

Bit 0x4 is the only reliable place to tell whether the read is unmapped.
If 0x4 is set, no assumptions
can be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2, 0x100, and 0x800.

POS is the only field that sort of makes sense to allow for half-mapped
pairs stored together.

So the case is in the grey area where BAM format as transport just doesn't
care but validators would probably classify such reads as invalid or raise
a warning.
What do others think? I understand the desire for easy roundtrip tests but
this alone should not justify loose validation.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#714 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACnk0kmr7J7R6BJkCVk3iX00kHXkxqDAks5quN9ggaJpZM4KEVu4
.

@jmarshall
Copy link
Member

jmarshall commented Sep 27, 2016

This is a demonstration of the argument we had a while back as to whether the spec should say that meaningless fields (like RNAME/POS/CIGAR/MAPQ for unmapped reads) are undefined / don't care, or whether it should prescribe particular dummy placeholder values to be used. Unfortunately the former viewpoint won and is now coming home to roost.

Me, I am 👍 with @vadimzalunin. You can expect to roundtrip if your meaningless unmapped MAPQ fields use the sensible 0 placeholder value (and we should add that to the best practices section of the spec). Insisting that meaningless nonzero unmapped MAPQ values roundtrip successfully would over-constrain the formats.

@vadimzalunin
Copy link
Contributor

@yfarjoun A file format surely relies on data validity, like bases and scores length must match - text SAM for example can handle it.

Are there any examples of valuable data stored in RNAME/CIGAR/MAPQ fields for unmapped reads somewhere?

Is this just for testing or we have to prove losslessness :D to people that don't accept explanations? I can accept the latter but hopefully it is the former!

@nh13
Copy link
Member

nh13 commented Sep 27, 2016

See the best-practices section 2.4 for when RNAME (and POS) are meaningful, namely placing an unmapped read next to its mapped mate. Its cigar may be meaningful if hard-clipping exists, though that is neither in the spec nor the best practices. Nonetheless, I agree that it is perfectly fine to return a 0 for the MAPQ unmapped reads when reading from CRAM, since the spec says no assumption can be made about the MAPQ for unmapped reads, and therefore an implementing format should be able to exploit that for its performance goals.

@cmnbroad cmnbroad changed the title Bam->CRAM loses mapping quality on unmapped reads Bam->CRAM loses MQ, start, cigar on unmapped reads Oct 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants