Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmapped read validation causes validation errors in specifications-compliant records. #1584

Open
d-cameron opened this issue Dec 3, 2021 · 3 comments

Comments

@d-cameron
Copy link
Contributor

Description of the issue:

Validation of specifications-compliant unmapped records raises validation failures.

The SAM specifications state:
• Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptionscan be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2, 0x100, and 0x800."

Your environment:

htsjdk 2.23.0

Steps to reproduce

	@Test
	public void htsjdk_mapq_unmapped_validation_violates_sam_specifications() {
		SAMRecord r = new SAMRecord(null);
		r.setReferenceName("chr");
		r.setAlignmentStart(1);
		r.setMappingQuality(1);
		r.setCigarString("1M");
		r.setReadUnmappedFlag(true);
		r.setSupplementaryAlignmentFlag(true);
		r.setSecondaryAlignment(true);
		r.setProperPairFlag(true);
		List<SAMValidationError> errors = r.isValid();
		assertEquals(0, errors.size());
	}

Expected behaviour

Test case should pass. Maybe a warning is more appropriate?

Actual behaviour

4 unnecessary validation failures are raised:

0 = {SAMValidationError@1269} "ERROR::INVALID_FLAG_PROPER_PAIR:Proper pair flag should not be set for unpaired read."
1 = {SAMValidationError@1270} "ERROR::INVALID_FLAG_NOT_PRIM_ALIGNMENT:Secondary alignment flag should not be set for unmapped read."
2 = {SAMValidationError@1271} "ERROR::INVALID_FLAG_SUPPLEMENTARY_ALIGNMENT:Supplementary alignment flag should not be set for unmapped read."
3 = {SAMValidationError@1272} "ERROR::INVALID_MAPPING_QUALITY:MAPQ should be 0 for unmapped read."
d-cameron pushed a commit to PapenfussLab/gridss that referenced this issue Dec 3, 2021
@lbergelson
Copy link
Member

@d-cameron I think this is a place where someone decided to be more strict than is specified in order to catch things that are most likely caused by bugs elsewhere in the pipeline, and to reduce the space state we have to deal with. It looks like you're working around it, but it's not clear to me if this is a situation where there is value to you in maintaining the now meaningless flags. (I.e. Do you want to keep the mapping quality from before you clipped it?). Or is it just annoying that these technically valid reads are causing errors and it's a pain to have to reset the values manually?

You're definitely correct in that these values are allowed by the spec, but I read the intent of the spec to be more of "if unmapped is set than the listed values are meaningless and should be ignored" more than "if unmapped is set then any combination of these values is valid."

Relaxing these errors to a warning seems like a good potential solution.

@d-cameron
Copy link
Contributor Author

(I.e. Do you want to keep the mapping quality from before you clipped it?)

In this particular instance, it's bwa reporting a 50S50I alignment, I'm cleaning up the start/end indels thus it gets unmapped. The mapq is still as meaningful as it was before it became unmapped as there were 0 bases actually aligned to the ref and 0 afterwards.

SevenBridges do a similar thing where they graph align reads to a non-reference insertion which then gets mapped to 100I in the SAM output. 0 bases consumed by the reference (so according to the SAM specs the alignment position doesn't have a meaning) but the alignment location and mapq is meaningful.

DanielNaro added a commit to DanielNaro/gridss that referenced this issue Jan 26, 2022
@rwestram
Copy link

Currently I am working with SAM/BAM files containing circular consensus sequences generated with PacBio ccs. These sequences are always unmapped, so FLAG is always set to 4, RNAME is always set to * and MAPQ is always set to 255 (as documented by PacBio in their SMRT Tools Reference Guide, page 9).

While trying to read these files using the SAMRecordIterator (from
htsjdk 3.0.4), calling the next() method unfortunately triggers the following exception:
htsjdk.samtools.SAMFormatException: Error parsing text SAM file. MAPQ must be zero if RNAME is not specified; Line 1

The only possibility I've found so far to read these files, is to completely disable all validations using validationStringency(ValidationStringency.SILENT) (or LENIENT) on the reader factory. But completely switching off validation really feels uncomfortable to me, like missing all its benefits.

Is there another recommended way to read such files, which i am missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants