Skip to content

fix(bam): emit SAM-spec NM:i: tag distinct from STAR-internal nM:i:#42

Open
pinin4fjords wants to merge 1 commit into
scverse:mainfrom
pinin4fjords:fix/nm-tag-sam-spec
Open

fix(bam): emit SAM-spec NM:i: tag distinct from STAR-internal nM:i:#42
pinin4fjords wants to merge 1 commit into
scverse:mainfrom
pinin4fjords:fix/nm-tag-sam-spec

Conversation

@pinin4fjords
Copy link
Copy Markdown

Summary

--outSAMattributes NM was routed to the existing nM:i: writer (substitutions only). The SAM-spec NM tag is the edit distance to the reference: substitutions + inserted bases + deleted bases. STAR and SAM-conformant tools (samtools stats, Picard, MultiQC) read NM:i: and were getting nothing back.

BAM NM:i: records nM:i: records
rustar (pre-fix) 0 88 592
STAR 88 072 0

On records with indels, the per-record values also differed (rustar nM:i:0 where STAR's NM:i:2 reflected the deleted bases).

Fix

Two changes:

  1. In the attribute consumer, treat NM and nM as distinct tokens - request for one no longer silently enables the other.
  2. When NM is requested, compute SAM-spec edit distance: n_mismatch + sum(I-op lengths) + sum(D-op lengths) from the existing Transcript::cigar and Transcript::n_mismatch. Intron N skips are excluded per SAM v1 section 1.4.

Both tags can now be emitted together when both are in --outSAMattributes.

Test plan

  • New unit test asserts an 80M2D21M record with one mismatch produces NM:i:3 and nM:i:1
  • Existing tests pass
  • cargo build
  • cargo clippy --lib -- -D warnings
  • cargo fmt --check

After this fix, samtools view <bam> | grep -c 'NM:i:' matches the total record count and samtools stats error_rate reports a non-zero edit distance.

Fixes #29

--outSAMattributes NM was being routed to the existing nM writer
(substitutions only), so requests for the SAM-spec edit-distance tag
silently produced wrong values. Downstream tools that read NM:i:
(samtools stats, picard, MultiQC) saw nothing.

Treat NM and nM as distinct attribute tokens. When NM is requested,
compute SAM-spec edit distance per the SAM v1 spec section 1.4:
substitutions + inserted bases + deleted bases (excluding intron N
skips). Keep nM emission unchanged when explicitly requested.

Fixes scverse#29
@pinin4fjords
Copy link
Copy Markdown
Author

Verified end-to-end on macOS/aarch64 against the rebuilt fix branch.

$ samtools view <bam> | grep -c 'NM:i:'    -> 88496
$ samtools view <bam> | grep -c 'nM:i:'    -> 0    (only NM was in --outSAMattributes)

Per-record NM values on indel-containing records match STAR's exactly:

CIGAR n_mismatch rustar NM:i: STAR NM:i: (same read)
80M2D21M 0 2 2
43M1D58M 0 1 1
18M1I82M 0 1 1
77M1D24M 1 2 2
77M1I23M 0 1 1

samtools stats after the fix:

SN  mismatches:  77025  # from NM fields
SN  error rate:  9.074168e-03  # mismatches / bases mapped (cigar)

Pre-fix the same run produced 0 NM:i: tags and samtools stats reported mismatches: 0 / error rate: 0. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--outSAMattributes NM emits nM:i: instead of NM:i: (SAM-spec NM tag missing)

1 participant