fix(bam): subtract Phred+33 offset from QUAL before writing to BAM#38
Open
pinin4fjords wants to merge 2 commits into
Open
fix(bam): subtract Phred+33 offset from QUAL before writing to BAM#38pinin4fjords wants to merge 2 commits into
pinin4fjords wants to merge 2 commits into
Conversation
The BAM binary QUAL field stores raw Phred values (0-93) per SAM spec §4.2.3. rustar was passing FASTQ ASCII bytes (Phred+33 encoded, e.g. 'B' = ASCII 66 = Phred 33) straight to noodles' QualityScores::from, which expects raw Phred values. Every QUAL byte in a rustar BAM was therefore +33 above what the spec mandates; samtools view rendered the SAM text with another +33 on top, producing biologically impossible quality strings like 'cccgg...'. Subtract 33 from each FASTQ ASCII byte at every BAM-write call site before constructing the QualityScores. Use saturating_sub so a malformed FASTQ byte < 33 clamps to 0 rather than underflowing. samtools stats average_quality on the nf-core/rnaseq test profile now reads 35.3 (matching STAR) rather than the previous 68.3. Fixes scverse#34
Author
|
Verified end-to-end on macOS/aarch64 against the rebuilt fix branch.
Pre-fix on the same input the same first record showed The LGTM. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The BAM binary QUAL field stores raw Phred values (0-93) per SAM spec §4.2.3. rustar was passing FASTQ ASCII bytes (Phred+33 encoded) straight to
noodles::sam::QualityScores::from, which expects raw Phred. Result: every QUAL byte in a rustar BAM was +33 above the spec value;samtools viewre-added 33 for the SAM text rendering, yielding biologically impossible quality strings.samtools stats average_qualitywas reading 68.3 vs STAR's 35.3 on the nf-core/rnaseq test profile (consistent +33 across every sample);error_ratewas 0 (compounding the NM/nM bug in #29).Fix
Subtract 33 from each FASTQ ASCII byte at every BAM-write call site before constructing the
QualityScores. Usessaturating_subso malformed FASTQ bytes < 33 clamp to 0 rather than underflowing.Three call sites in
src/io/sam.rs(SE genome, PE genome, transcriptome) plus the doc comment onEncodedRead.qualityinsrc/io/fastq.rsupdated to flag the offset.Test plan
b"III"produces BAM Phred bytes[40, 40, 40]cargo buildcargo clippy --all-targets -- -D warningscargo fmt --checkAfter this fix,
samtools stats <rustar.bam> | grep 'average quality'returns the same value as STAR on the same FASTQ inputs.Fixes #34