Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bam record should validate sequence and quality length before writing #59

Closed
veldsla opened this issue Nov 23, 2021 · 2 comments · Fixed by #61
Closed

Bam record should validate sequence and quality length before writing #59

veldsla opened this issue Nov 23, 2021 · 2 comments · Fixed by #61
Labels

Comments

@veldsla
Copy link
Contributor

veldsla commented Nov 23, 2021

In the encoded bam there is one field that encodes the length for both the sequence and the base qualities. When writing a bam::Record this is not validated resulting in possibly corrupt BAM files.

According to the bam spec when base qualities are omitted 0xFF should be written instead.

This can be done in the writer code, but it's probably cleaner to do a validate function on the record? Should I do a PR?

@zaeleus
Copy link
Owner

zaeleus commented Nov 23, 2021

Sure, if you'd like to do this, let'd do it in the BAM record writer. The SAM record writer can be used as reference:

// § 4.2.3 SEQ and QUAL encoding (2021-06-03)
let sequence = record.sequence();
let quality_scores = record.quality_scores();
if !sequence.is_empty() {
write_seq(writer, sequence)?;
if sequence.len() == quality_scores.len() {
write_qual(writer, quality_scores)?;
} else if quality_scores.is_empty() {
for _ in 0..sequence.len() {
writer.write_u8(NULL_QUALITY_SCORE)?;
}
} else {
return Err(io::Error::new(
io::ErrorKind::InvalidInput,
format!(
"quality scores length mismatch: expected {}, got {}",
sequence.len(),
quality_scores.len()
),
));
}
}
.

@zaeleus zaeleus added the bam label Nov 23, 2021
@veldsla
Copy link
Contributor Author

veldsla commented Nov 24, 2021

Looking at the specification there's another constraint that is enforced by the current version of samtools:

Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ.

It's probably best to also require this.

veldsla added a commit to veldsla/noodles that referenced this issue Nov 24, 2021
To avoid writing corrupt or incompatible SAM/BAM files an extra
validation step is applied before writing. This makes sure:
 - The number of base qualities is equal to the sequence length or empty
 - The length of the cigar operations matches the sequence length

Closes zaeleus#59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants