Skip to content

fix(transcriptome-bam): populate mate fields on paired-end records#41

Open
pinin4fjords wants to merge 1 commit into
scverse:mainfrom
pinin4fjords:fix/transcriptome-mate-fields
Open

fix(transcriptome-bam): populate mate fields on paired-end records#41
pinin4fjords wants to merge 1 commit into
scverse:mainfrom
pinin4fjords:fix/transcriptome-mate-fields

Conversation

@pinin4fjords
Copy link
Copy Markdown

Summary

Paired-end records in Aligned.toTranscriptome.out.bam were missing every mate field: no PROPERLY_SEGMENTED (0x2) flag, no MATE_REVERSE_COMPLEMENTED (0x20), RNEXT=*, PNEXT=0, TLEN=0. Salmon's alignment-mode fragment-length inference reads these fields; with them missing it falls back to its default prior (mean=250, sd=25) and EffectiveLength for short transcripts gets over-shrunk, inflating their TPMs.

samtools flagstat pre-fix: 0 / 77 300 properly-paired records. STAR on the same data: 78 886 / 78 886.

Fix

In build_transcriptome_records_pe, after the existing flag-stamping loops and before the interleave, walk the projected (rec1, rec2) pairs alongside their (p1, p2) transcripts and stamp:

  • PROPERLY_SEGMENTED on both (concordance is guaranteed at this point - both mates project to the same transcript by construction).
  • MATE_REVERSE_COMPLEMENTED mirroring the other mate's REVERSE_COMPLEMENTED.
  • mate_reference_sequence_id and mate_alignment_start from the other projection.
  • template_length with STAR's sign convention: magnitude max(end1,end2) - min(start1,start2) + 1, leftmost mate positive, rightmost negative.

Mirrors the genome-space build_paired_mate_record (which already does all of this) so the transcriptome path is now byte-symmetric for shared fields. The TLEN-signing logic is the same calculate_insert_size helper the genome path uses (lifted to pub(crate)); the mate-bookkeeping itself is factored into a new apply_pe_transcriptome_mate_fields helper in src/io/sam.rs so it can be unit-tested in isolation.

Test plan

  • New unit tests assert PROPERLY_SEGMENTED, populated mate ref id, non-zero signed TLEN (forward and reverse cluster cases)
  • Existing transcriptome tests pass
  • cargo build
  • cargo clippy --lib -- -D warnings
  • cargo fmt --check

After the fix, samtools view -c -f 2 <transcriptome.bam> returns the count of primary paired records (not 0).

Fixes #22

build_transcriptome_records_pe was stamping SEGMENTED + FIRST/LAST_SEGMENT
but leaving PROPERLY_SEGMENTED, MATE_REVERSE_COMPLEMENTED, RNEXT, PNEXT,
and TLEN unset on every transcriptome record. Salmon's alignment-mode
fragment-length inference reads those mate fields; with them missing it
falls back to its 250+/-25 prior, distorting paired-end TPMs.

Walk the projected (rec1, rec2) pairs after the existing flag-stamping
loops and mirror the mate-bookkeeping the genome-space build_paired_mate_record
already does: PROPERLY_SEGMENTED + MATE_REVERSE_COMPLEMENTED + mate ref id +
mate position + signed TLEN (leftmost +, rightmost -).

Fixes scverse#22
@pinin4fjords
Copy link
Copy Markdown
Author

Verified end-to-end on macOS/aarch64 against the rebuilt fix branch with the issue's exact reproducer (PE yeast + --quantMode TranscriptomeSAM --quantTranscriptomeSAMoutput BanSingleEnd).

$ samtools flagstat <transcriptome.bam>
77300 + 0 in total (QC-passed reads + QC-failed reads)
68408 + 0 primary
68408 + 0 paired in sequencing
34204 + 0 read1
34204 + 0 read2
68408 + 0 properly paired (100.00% : N/A)

First record pair SRR6357072.7414694 on YAR009C:

FLAG=83   POS=847   RNEXT==  PNEXT=776  TLEN=-172
FLAG=163  POS=776   RNEXT==  PNEXT=847  TLEN=+172
  • FLAG 83 / 163 both carry 0x2 (PROPERLY_SEGMENTED)
  • RNEXT = matches
  • PNEXT populated from the mate's POS
  • TLEN signed correctly (leftmost mate at POS 776 gets +172, rightmost gets −172)

Pre-fix the same invocation produced 0 / 73172 properly-paired records and every record had RNEXT=* PNEXT=0 TLEN=0. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--quantMode TranscriptomeSAM paired-end BAM omits mate fields, breaking Salmon TPMs

1 participant