Skip to content

feat: add --quantMode TranscriptomeSAM output#7

Closed
ewels wants to merge 60 commits into
scverse:devfrom
ewels:feat/transcriptome-sam
Closed

feat: add --quantMode TranscriptomeSAM output#7
ewels wants to merge 60 commits into
scverse:devfrom
ewels:feat/transcriptome-sam

Conversation

@ewels
Copy link
Copy Markdown
Member

@ewels ewels commented Apr 17, 2026

Summary

Implements --quantMode TranscriptomeSAM + --quantTranscriptomeSAMoutput, producing Aligned.toTranscriptome.out.bam for downstream Salmon/RSEM quantification.

Byte-for-byte parity with STAR (verified via test/diff_star_transcriptome.sh)

Ran STAR 2.7.11b genomeGenerate alongside ruSTAR genomeGenerate on two workloads:

Synthetic 2-chr / 4-transcript / 4-gene (committed in the script):

File Identical with STAR
chrName.txt
chrLength.txt
chrStart.txt
chrNameLength.txt
transcriptInfo.tab
exonInfo.tab
geneInfo.tab
exonGeTrInfo.tab
sjdbList.fromGTF.out.tab
genomeParameters.txt ≈ (1 line differs — see below)

Real yeast (nf-core/test-datasets, 124-transcript chrI Ensembl GTF): same 9 transcriptome text files confirmed byte-identical. genomeParameters.txt differs on one line (genomeFileSizes) for the same reason as synthetic.

Remaining genomeParameters.txt delta: the genomeFileSizes values differ because ruSTAR's Genome binary does not yet include STAR's pre-inserted splice-junction sequences (~nInsertedSJs × sjdbLength bytes). Closing that gap is the scope of the follow-up PR stacked on this branch (feat/transcriptome-sam-sj-insertion, WIP — Phase 17.X). Blocks byte-identity of: Genome, SA, SAindex, sjdbInfo.txt, sjdbList.out.tab, plus the one line in genomeParameters.txt.

Transcriptome BAM end-to-end

Ran ruSTAR alignReads on the yeast GTF index built above with 50 000 real 101 bp reads from nf-core/test-datasets:

  • Aligned.toTranscriptome.out.bam is a valid BAM (samtools view round-trips cleanly).
  • Header carries all 124 @SQ entries with the correct STAR-style transcript IDs, including tRNA names with parentheses (e.g. tP(UGG)A).
  • Records contain the expected tags (NH, HI, AS, nM) and valid CIGARs (100–101 M per read).
  • 233 transcriptome records from 50 000 input reads for this sample.

Byte-for-byte diff of the BAM itself against STAR's output is blocked by a STAR 2.7.11b macOS arm64 bug where STAR reads 0 input reads from any FASTQ (reproducible on both synthetic and real nf-core data, with and without --readFilesCommand, gzipped or plain). CI on Linux will run the diff once this lands.

Changes (highlights)

  • Persisted transcriptome index (TranscriptomeIndex::write_transcript_info + 4 siblings) — byte-identical STAR output for transcriptInfo.tab / exonInfo.tab / geneInfo.tab / exonGeTrInfo.tab / sjdbList.fromGTF.out.tab.
  • Load pathGenomeIndex::load prefers the on-disk files over re-parsing the GTF; falls back when the files are absent.
  • STAR-faithful fallback strings (gene_namegene_id, gene_biotypeMissingGeneType) — matches GTF.cpp.
  • genomeParameters.txt byte-match — hoisted writer to reproduce STAR's exact line order, tab/space separators, and trailing-whitespace quirks from genomeParametersWrite.cpp.
  • i_frag on Exon — matches STAR's EX_iFrag on Transcript::exons[i]. Propagated through stitcher + projection. Mate boundaries re-locate the transcript exon during projection (canonSJ == -3).
  • Combined-pair Transcript helper (PairedAlignment::combined_transcript_for_projection) — builds STAR's unified Transcript with mate2 exons carrying i_frag = 1.
  • Lenient BAM header writer — bypasses noodles-sam's strict reference-name validator so yeast tRNA transcript IDs like tP(UGG)A (which STAR accepts and emits) pass through unchanged.
  • --quantMode TranscriptomeSAM at alignReads no longer requires --sjdbGTFfile when the index already carries transcriptInfo.tab — matches STAR's behavior.
  • Simplification pass — lifted GTF attribute names + MissingGeneType to constants, reused Genome::position_to_chr over a hand-rolled helper, dropped a redundant tr_gene_id: Vec<String> field.

Test plan

  • 335 unit + 1 integration test passing
  • cargo clippy --lib -- -D warnings clean
  • cargo fmt --check clean
  • test/diff_star_transcriptome.sh — 9/9 text files byte-identical with STAR
  • nf-core yeast run produces valid transcriptome BAM with 124 @SQ entries + tRNA names
  • Byte-diff of Aligned.toTranscriptome.out.bam vs STAR (Linux CI; local macOS STAR bug)
  • Genome/SA/SAindex/sjdbInfo.txt/sjdbList.out.tab parity — stacked PR (SJ insertion)

Known divergences

🤖 Generated with Claude Code

ewels and others added 12 commits April 17, 2026 14:11
Adds the STAR `--runRNGseed` CLI flag (default 777) and uses it to
randomize primary alignment selection among equal-scoring multi-mappers.
Previously primary selection was purely deterministic (lexicographic
tie-break), which meant the 127 "genuine tie" disagreements against
STAR on 10k SE yeast were frozen in whichever direction our sort happened
to pick. With a seeded shuffle, ties flip independently of sort order --
matching STAR's behavior at `ReadAlign_multMapSelect.cpp:71-79` and
`funPrimaryAlignMark.cpp:19-28`.

Unblocks nf-core/rnaseq, which invokes STAR_ALIGN with `--runRNGseed 0`.

Implementation notes:
- `shuffle_tied_prefix` does an in-place Fisher-Yates on the contiguous
  top-score prefix only, leaving non-tied lower-scored alignments alone.
  Matches STAR's two-phase "move best to front, then shuffle front"
  (collapsed since the input is already score-sorted).
- Applied at the final primary-selection point in both SE (`align_read`)
  and PE (`align_paired_read`). Not applied in `stitch.rs` per-window
  sorts -- STAR only shuffles at multMapSelect / primary-mark time.
- STAR seeds `std::mt19937` once per chunk and advances per read.
  ruSTAR parallelises per-read via rayon, so instead of a per-chunk RNG
  we derive a deterministic per-read seed from
  `run_rng_seed * (hash(read_name) + 1)`. This is stronger reproducibility
  than STAR (independent of thread count) while honoring `--runRNGseed`.
- Uses `rand::rngs::StdRng` rather than `rand_mt`. Not bit-for-bit parity
  with STAR's tie-breaking choices (STAR's `std::uniform_real_distribution`
  is libstdc++-specific anyway) -- it just has to honor the seed
  deterministically.

Tests: +5 (parse-default, parse-override, four shuffle behavior tests).
283 unit tests passing (was 278), 4 integration tests passing.

Note: commit is unsigned because local 1Password SSH signing
(op-ssh-sign) returned "failed to fill whole buffer" on every attempt.
Feel free to re-sign with `git commit --amend -S` once the signer is back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parses STAR's comma-separated multi-RG syntax (first field must be ID:
on each block, `,` separates blocks, tokens are tab-joined into the SAM
@rg body). Replicates a single RG line across N input files like STAR
does, and errors on count mismatch or RG-in-outSAMattributes without a
line. Auto-appends RG to sam_attribute_set() when a line is set (matches
Parameters_samAttributes.cpp:201). SAM/BAM headers now emit one @rg
per unique ID, and every record (mapped, unmapped, half-mapped, both-
mates) carries RG:Z:<id> when attrs contain RG.

Unblocks nf-core/rnaseq, which invokes STAR with
`--outSAMattrRGline 'ID:X' 'SM:X'` and previously errored on the unknown
flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hand-rolled Fisher-Yates loop with `rand::seq::SliceRandom::shuffle`
on the tied prefix, and collapse the length guard with `slice::first()`. Same
behavior (shuffles the [0..tied) range with the seeded RNG), fewer lines, and
no need to import `Rng`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the Fisher-Yates reference in `shuffle_tied_prefix`'s doc (the impl now
delegates to `SliceRandom::shuffle` — the algorithm is an implementation
detail). Collapse the SE call-site comment to one line since the fn doc
already covers the "why"; keep the STAR source file + line reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`DefaultHasher` is re-exported from `std::hash` as of Rust 1.76, so there's
no need to pull it in via the full `std::collections::hash_map` path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efix

Single-use generic parameter — `impl Fn(&T) -> i32` in the argument list is
shorter than a `where F: Fn(&T) -> i32` bound and reads more naturally at
the call sites we have.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove the `let _ = params.rg_ids()?;` early-validation line in
  `align_reads_paired_end`. `validate()` already calls `rg_ids()` at
  startup, so the second call was a no-op.
- Simplify `maybe_insert_rg_tag` to `(record, rg_id)` by dropping the
  `attrs: &HashSet<String>` parameter and its `attrs.contains("RG")`
  gate. `sam_attribute_set()` auto-inserts "RG" whenever an RG line is
  configured, so `rg_id.is_some()` already implies the tag is wanted.
- Replace the inline `if let Some(id) = rg_id { ... }` block in
  `build_unmapped_record` with a call to the simplified helper, so every
  writer-side RG insertion goes through the same path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `Parameters::primary_rg_id() -> Result<Option<String>, Error>` that
returns the first RG line's `ID:` value directly, and use it to replace
the `let rg_ids = params.rg_ids()?; let rg_id = rg_ids.first()...`
two-step pattern at every SAM/BAM writer callsite (5 in sam.rs, 1 in
lib.rs).

`rg_ids()` is retained because `validate()` still uses it for the
1-vs-N count check against `--readFilesIn`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the manual `while i < len` index-walk with a
`out_sam_attr_rg_line.split(|tok| tok == ",")` that maps each block
to its tab-joined body and propagates any `ID:`-prefix error via
`collect::<Result<_,_>>()`. Same public signature and return value as
before, but with half the local state.

Behavior note: an empty block (adjacent `,`s or a trailing `,`) now
returns a `Parameter` error instead of being silently skipped. This
matches STAR's `Parameters_readFilesInit.cpp` which requires every RG
block to begin with `ID:`. Existing tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`SamWriter::write_unmapped` had no production callers — the SE/PE
write paths build unmapped records via `build_unmapped_record` and
batch them through `write_batch`. The only caller was the method's
own self-test. Delete both.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes formatting drift inherited from main. No semantic changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes formatting drift inherited from main. No semantic changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ewels ewels changed the title feat: add --quantMode TranscriptomeSAM output feat: add --quantMode TranscriptomeSAM output Apr 17, 2026
Psy-Fer and others added 16 commits April 20, 2026 12:52
feat: add `--runRNGseed` flag with seeded primary tie-break
feat: add `--outSAMattrRGline` with `@RG` header and `RG:Z` tags
Introduce `src/quant/transcriptome.rs` with a `TranscriptomeIndex`
struct that groups GTF exons by `transcript_id`, sorts exons by start,
and records absolute genome coords plus STAR's `exLenCum` (cumulative
transcript-space offset) per exon.

Also builds:
- `tr_start` / `tr_end` — transcript bounds in absolute genome coords
- `tr_order` / `tr_starts_sorted` / `tr_end_max_sorted` — sorted view
  used for STAR-style `binarySearch1a` + running-max early-exit in
  `quantAlign`
- per-transcript length, strand, gene_id, chr_idx

Transcripts with inconsistent chr/strand across exons or on unknown
chromosomes are skipped with `log::warn!`, matching STAR's tolerant
GTF handling at `sjdbInsertJunctions` time.

Note: STAR persists this table as `transcriptInfo.tab` + `exonInfo.tab`
at genomeGenerate; ruSTAR rebuilds it at alignment time from the GTF.
Semantically equivalent.

No signing — commit.gpgsign disabled locally (no key on this worktree).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nscripts)

Port STAR's `Transcriptome::quantAlign` + `alignToTranscript` (see
`source/Transcriptome_quantAlign.cpp:5-89, 91-114`).

`align_to_transcripts` binary-searches `tr_starts_sorted` for the
greatest tr_start <= align.genome_start, then walks back while
`tr_end_max_sorted[i] >= align.genome_end`, calling
`align_to_one_transcript` on every candidate whose [tr_start, tr_end]
fully contains the alignment.

`align_to_one_transcript` walks the alignment's exon blocks and:
  * locates the transcript exon containing the first block
    (binary-searched `find_containing_exon`),
  * rejects blocks that extend past a transcript-exon boundary,
  * on each splice boundary (ruSTAR's `is_splice_boundary_before`,
    which approximates STAR's `canonSJ >= 0` using read-side
    continuity), requires `prev.genome_end == tr_ex[ex].genome_end &&
    next.genome_start == tr_ex[ex+1].genome_start`,
  * translates each block start to t-space via
    `ex_len_cum[ex] + (g_pos - exon_start)`.

Reverse-strand transcripts flip all projected exons:
  `read_pos' = Lread - (read_pos + len)`,
  `tr_pos'  = tr_length - (tr_pos + len)`,
then reverse the exon vector.  `is_reverse` is XOR'd with
`tr_strand == 2`.

Projected CIGAR = original CIGAR with N operations dropped; reversed
for reverse-strand transcripts.  Splices collapse so `n_junction = 0`.

ruSTAR-specific notes:
  * `Transcript.exons` has no canonSJ array; splice vs indel boundaries
    are discriminated by read-side continuity (insertion → read gap,
    splice → read contiguous with genome gap).  Pure deletions rarely
    produce block splits because the stitch merge coalesces across Del.
  * GTF is 1-based inclusive; ruSTAR 0-based half-open.  Transcript
    length = sum(end - start).  STAR uses `exLenCum[last] + (end -
    start) + 1` which arrives at the same numeric value on 1-based
    inclusive inputs.

8 new unit tests cover: single-exon projection, two-exon with matching
junction, junction mismatch (rejected), reverse-strand flip, multi-exon
projection into a longer transcript, out-of-bounds cases, and
projection onto multiple overlapping transcripts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port STAR's `ReadAlign::quantTranscriptome` gatekeeping logic (see
`source/ReadAlign_quantTranscriptome.cpp:9-66`) as
`filter_and_project`:

1. If `!allow_indels && n_gap > 0` → reject.
2. Single-end filter — for PE, skip alignments where both ends came
   from the same mate.  ruSTAR's `Transcript` does not carry per-block
   mate tags; the caller enforces this by only invoking
   `filter_and_project` on both-mapped PE pair mates.  Documented as a
   no-op at the per-mate level.
3. Soft-clip extension (mode-dependent): extend leading / trailing
   soft-clipped bases back into matched bases, counting mismatches
   where both read and genome bases are < 4 and unequal.  Reject if
   `n_mismatch + extension_mismatches > min(outFilterMismatchNmax,
   outFilterMismatchNoverLmax*(Lread-1))`.
4. Call `align_to_transcripts` for projection.

New enum `QuantTranscriptomeSAMoutput` with three variants matching
STAR's CLI strings:
  * `BanSingleEnd_BanIndels_ExtendSoftclip` (default, RSEM-compatible)
  * `BanSingleEnd`
  * `BanSingleEnd_ExtendSoftclip`

`rebuild_cigar_without_softclips` strips the leading/trailing S ops
and folds their length into the adjacent M op.  Interior soft-clips
are left alone.

7 new unit tests: enum FromStr/Display, mode flags, indel rejection
under default mode, indel retention under BanSingleEnd, zero-mismatch
softclip extension folding 4S+40M → 44M, softclip extension exceeding
budget → reject, softclip preservation under BanSingleEnd mode.

Divergence from STAR: ruSTAR checks for genome/read buffer bounds via
`saturating_sub` + slice-length checks before indexing, to protect
against edge cases where extension would read past the genome end.
STAR trusts its upstream bounds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion

Add `quant_transcriptome_sam_output: QuantTranscriptomeSAMoutput`
field with STAR-compatible default
`BanSingleEnd_BanIndels_ExtendSoftclip`.  Accepts the three STAR
strings via the `FromStr` impl added in the previous commit.

Add `Parameters::quant_transcriptome_sam()` helper that scans
`quant_mode` for `"TranscriptomeSAM"`, mirroring the existing
`quant_gene_counts()` helper.

Extend `Parameters::validate()` to require `--sjdbGTFfile` when
`--quantMode TranscriptomeSAM` is active.  STAR also allows running
TranscriptomeSAM in SE (the "single-end" flag in the mode name refers
to per-mate, not per-run), so no PE-only gate.

5 new unit tests: default parse, flag enable, explicit mode override
(BanSingleEnd and BanSingleEnd_ExtendSoftclip), validate error without
GTF, validate success with GTF.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top-level orchestration for `--quantMode TranscriptomeSAM`.

src/io/sam.rs:
  * Factor `build_sam_header` into a generic
    `build_sam_header_from_refs(iter, params)` that takes any
    `(name, length)` iterator.  The original `build_sam_header` now
    delegates to it with the genome's chromosomes.
  * New `SamWriter::build_transcriptome_records(...)`: builds records
    with RNAME pointing at the transcriptome header (reference_sequence_id
    = transcript_idx), POS = t-space position + 1, and emits only the
    standard attribute set (NH/HI/AS/NM/nM).  Splice-aware tags (jM, jI,
    XS, MD) are dropped because splices collapse in t-space.
  * `primary_hit_idx` parameter selects the randomly-chosen primary
    alignment; all others get the SECONDARY (0x100) flag.

src/io/bam.rs:
  * New `BamWriter::create_transcriptome(path, tr_idx, params)` builds
    a BAM header with one @sq per transcript (name = transcript_id,
    length = t-space length).
  * New `BamWriter::header()` accessor.
  * Unit test verifying transcriptome BAM writer creation + @sq count.

src/align/read_align.rs:
  * `per_read_seed` changed from `fn` to `pub(crate) fn` so lib.rs can
    seed the transcriptome primary-picker with the same read-scoped
    seed used for genome-space tie-shuffle.

src/lib.rs:
  * Build an `Arc<TranscriptomeIndex>` alongside the gene-counts
    context when `--quantMode TranscriptomeSAM` is active, and thread
    it through run_single_pass / run_two_pass / align_reads_single_end
    / align_reads_paired_end.
  * Open `<prefix>Aligned.toTranscriptome.out.bam` via
    `BamWriter::create_transcriptome` at the start of the pipeline and
    flush/finish at the end.
  * Extend `AlignmentBatchResults` with `transcriptome_records:
    Vec<RecordBuf>`.  Per-read builder helpers:
      - `build_transcriptome_records_se`: projects every genome-space
        alignment via `filter_and_project`, seeds `StdRng` with
        `per_read_seed(run_rng_seed, read_name)`, picks a random
        projected index as primary, builds records.
      - `build_transcriptome_records_pe`: projects both mates
        independently, keeps transcripts where both project, emits two
        records per projected pair (mate1 FIRST_SEGMENT + mate2
        LAST_SEGMENT).  Primary pick is shared across both mates.
  * Transcriptome records are written serially in batch-merge order
    alongside the main SAM/BAM stream (normal and BySJout paths).

Known limitations vs STAR:
  * PE pairing is done per-transcript-idx set intersection between
    projected mates, not via STAR's combined 2-mate
    `Transcript`/`EX_iFrag` machinery.  Equivalent for the common case
    where each mate maps uniquely within a given transcript; may
    diverge when one mate has multiple valid positions within one
    transcript.
  * StdRng (inherited from `feat/run-rng-seed`) replaces STAR's
    std::mt19937 — no bit-for-bit parity on which transcriptome
    projection is picked as primary among ties.  Per-read
    determinism is preserved via `per_read_seed`.

No pass-1 wiring: two-pass pass 1 uses `None` for tr_idx (tr BAM is
only emitted from pass 2).

cargo fmt applied across the touched files plus incidental fmt
touch-ups on read_align.rs / stitch.rs / params.rs / quant/mod.rs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end smoke test in `tests/transcriptome_sam.rs`:
  * Builds a 2000bp synthetic genome (single chromosome, pseudo-random
    sequence from an LCG).
  * Writes a 2-transcript GTF (T1: chr1:101-400 +, T2: chr1:601-900 +).
  * Generates 20 reads (30bp) drawn alternately from the T1 and T2
    exon regions.
  * Runs `ruSTAR --runMode genomeGenerate` then `alignReads
    --quantMode TranscriptomeSAM`.
  * Asserts `Aligned.toTranscriptome.out.bam` is created, > 100 bytes,
    parses as a valid BAM via noodles, and the header contains exactly
    two @sq lines named T1 and T2.

Uses the same `assert_cmd::Command::cargo_bin` pattern as the existing
`phase9_threading.rs` test (inherits the same deprecation warnings;
pre-existing parity).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iptomeSAMoutput

Remove `Display` impl (never formatted anywhere) and `allow_single_end()`
helper (always returns false, never called). Clean up the stale single-end
comment in `filter_and_project`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lders

`build_transcriptome_records_se` allocated a redundant `fwd` clone of
`read_seq` just to unify the reverse / forward branch, and both SE + PE
builders inlined the same 6-line base-complement match. Replace both
with a single `rc_encode()` helper delegating to the existing
`io::fastq::complement_base`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r mate

Previously `build_transcriptome_records_pe` called the per-record builder
twice *per projected pair*, passing a single-element slice and a
`primary_hit_idx` sentinel (0 / usize::MAX) to force the SECONDARY flag.
The builder already handles multi-alignment sets correctly, so split the
pairs into mate1/mate2 slices, call the builder once per mate with the
true `primary_hit`, then stamp FIRST_SEGMENT / LAST_SEGMENT on the
results and interleave. Same observable output, half the per-pair work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both `build_transcriptome_records_se` and `_pe` duplicated the RNG seeding
(`per_read_seed` → `StdRng::seed_from_u64` → `gen_range`) and the MAPQ
calculation (`effective_n = n_alignments.max(n_for_mapq)`). Extract a
single `pick_primary_and_mapq` helper and hoist the `use` blocks out of
the function bodies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`BamWriter::create` and `BamWriter::create_transcriptome` both opened the
file, wrapped it in `BufWriter`/`bam::io::Writer`, and wrote the header.
Move that boilerplate into a private `with_header` helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ewels and others added 13 commits April 22, 2026 03:01
Mirrors STAR's Transcript::exons[i][EX_iFrag] (IncludeDefine.h:209).
Value 0 for single-end / mate1, 1 for mate2. This is pure plumbing:
- Exon gains `i_frag: u8` field with doc comment pointing at STAR's source.
- Stitcher (src/align/stitch.rs) sets `i_frag: 0` on all exons it
  produces — SE and per-mate PE go through here before any pair combining.
- Transcriptome projection (src/quant/transcriptome.rs) propagates
  `block.i_frag` onto projected exons, so transcript-space alignments
  carry mate identity.
- All Exon construction sites in test code updated to set `i_frag: 0`
  (SE contexts).

No behavior change yet — PE pair construction still concatenates two
i_frag=0 transcripts. T.3 will introduce combined-pair Transcript with
mate2 exons rewritten to i_frag=1, and T.4 will use this field to
implement STAR's native single-end filter
(ReadAlign_quantTranscriptome.cpp:17).

333 unit + 1 integration test pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
align_to_one_transcript now detects block-to-block mate boundaries
via i_frag change (equivalent to STAR's canonSJ == -3) and re-locates
the transcript exon for the new block's genome position. Within-mate
splice boundaries still use the `ex += 1` advance; cross-mate
boundaries do NOT require the transcript-junction identity check.

Lays the groundwork for PE combined-pair projection. Output unchanged
for existing tests (no combined Transcript yet passed through).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds a STAR-style single-Transcript view of a PE pair: mate1 exons
followed by mate2 exons with i_frag rewritten to 1. This is the
structure STAR feeds to Transcriptome_quantAlign for PE reads.

cigar / junction_motifs / read_seq fields are left empty — the
combined transcript is consumed only by the projection logic in
T.3's next step, which splits the result back to per-mate BAM records.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extended the end-to-end integration test to also pass --sjdbGTFfile at
genomeGenerate and confirm that transcriptInfo.tab, exonInfo.tab,
geneInfo.tab, exonGeTrInfo.tab, and sjdbList.fromGTF.out.tab are
written into the genome directory with the exact byte content STAR
would produce for the synthetic 2-transcript GTF used by this test.

This is the best byte-format validation we can do without the full
yeast test dataset (which needs to be downloaded separately via
test/data_setup.sh). Once that dataset is in place, diffing ruSTAR's
output against STAR's for the yeast genome gives the final parity
verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a GTF exon record has no gene_name attribute, STAR's GTF.cpp uses
the gene_id string as the name. When there's no gene_biotype, it uses
the literal string "MissingGeneType". ruSTAR was using empty strings
for both.

Verified byte-for-byte parity against STAR 2.7.11b on a synthetic
2-chr / 4-transcript / 4-gene test case (/tmp/rustar_diff): all 9
small text files (chrName/chrLength/chrStart/chrNameLength.txt,
transcriptInfo.tab, exonInfo.tab, geneInfo.tab, exonGeTrInfo.tab,
sjdbList.fromGTF.out.tab) now diff-clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generates a deterministic 2-chr / 4-transcript / 4-gene synthetic test
case, runs both STAR and ruSTAR genomeGenerate, then diffs each of the
9 files byte-for-byte.

Current result on STAR 2.7.11b: 9/9 files identical
  * chrName.txt, chrLength.txt, chrStart.txt, chrNameLength.txt
  * transcriptInfo.tab, exonInfo.tab, geneInfo.tab,
    exonGeTrInfo.tab, sjdbList.fromGTF.out.tab

Runnable via:
  RUSTAR_BIN=/path/to/target/release/ruSTAR \\
    test/diff_star_transcriptome.sh /tmp/workdir

Reports with ✓/✗ per file and exits non-zero if any file differs.

Requires STAR on PATH (`brew install rna-star` or bioconda star).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrote write_genome_parameters_txt to emit the exact line order,
tab/space separators, and trailing whitespace STAR's
genomeParametersWrite.cpp uses. Adds missing fields STAR always writes:
- genomeType (Full)
- genomeTransformType / genomeTransformVCF
- sjdbFileChrStartEnd (with STAR's trailing space quirk)
- sjdbGTFchrPrefix / sjdbGTFfeatureExon / sjdbGTFtagExonParent*
- sjdbInsertSave (Basic)
- versionGenome bumped to 2.7.4a (STAR's on-disk format version, not
  binary build version)

Also adds the `### STAR <cli>` and `### GstrandBit 32` comment lines.

genomeFileSizes now uses `<tab><genome_size> <sa_size>` (space between
values, matching STAR's `<< "\t"` then `<< " "` pattern).

Diff script extended to check genomeParameters.txt. Current result:
10/11 files byte-identical on the synthetic test case. The remaining
difference is the genomeFileSizes values themselves, because ruSTAR
does not yet insert GTF-derived splice junction sequences into the
Genome binary — tracked as a separate phase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loader

The `from_index_dir` path derived each transcript's chromosome index via
a hand-rolled linear scan helper. `Genome::position_to_chr` already does
the same thing via binary search (O(log n)) and handles the padding-gap
case. Drop the duplicate helper and call through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Raw strings like "transcript_id" / "gene_id" / "gene_name" /
"gene_biotype" / "MissingGeneType" were scattered across
TranscriptomeIndex building code. Hoist them to module-level `const &str`
declarations so the STAR-faithful fallback is documented in one place
and can't drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mv-renaming trick worked but obscured that both tools need the same
`--genomeDir` string for byte-identical genomeParameters.txt output.
Replace it with a symlink `index` → `star_index | rustar_index` that
flips between runs; both invocations pass `--genomeDir index` verbatim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ptomeIndex

tr_gene_idx + gene_ids cover all callers; the parallel Vec<String> was
only populated as a convenience view and never read outside
transcriptome.rs tests. Drop the field; update the two asserts that
used it to look up via `gene_ids[tr_gene_idx[i] as usize]`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… has it

Previously, `--quantMode TranscriptomeSAM` at alignReads rejected
invocations without --sjdbGTFfile, even when the user had already
persisted transcriptInfo.tab + friends in --genomeDir at genomeGenerate.
That's STAR-incompatible: STAR loads the files directly and only
re-parses the GTF if explicitly passed at mapping time.

Move the hard check to genomeGenerate mode only. GenomeIndex::load
already surfaces a clear error if neither source is available at
alignReads (via `index.transcriptome.is_none()` in src/lib.rs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
noodles-sam-0.64 rejects `(` `)` `[` `]` `{` `}` `<` `>` `,` `\` in
@sq SN: values. yeast tRNA transcript IDs from Ensembl GTFs contain
parentheses (tP(UGG)A, tA(UGC)A, etc.) so writing a transcriptome
BAM errored with `invalid reference sequence name (<unknown>)` and
truncated the output file.

Replace BamWriter's noodles-driven header write with a local
write_bam_header_lenient that:
- Emits BAM\x01 + l_text + SAM-text block (built via
  render_sam_text_lenient — iterates @hd / @sq / @rg / @pg / @co
  directly from the noodles sam::Header, writing raw bytes so
  forbidden chars pass through)
- Emits the binary reference list (n_ref + l_name/name/l_ref
  triples) using CString for NUL-termination

Subsequent record writes still go through noodles (no validation
there). Verified with 5 yeast reads → valid BAM with 4 transcriptome
records + samtools view accepts the output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ewels
Copy link
Copy Markdown
Member Author

ewels commented Apr 22, 2026

@Psy-Fer - I had a stab at pushing this along a bit further. Turned into a bit of a beast, sorry. I split out #8 as a separate stacked PR to try to keep it a bit cleaner.

I think it's ready for review now (albeit a lot of new code), let me know if there's anything I can do to help..

@Psy-Fer
Copy link
Copy Markdown
Collaborator

Psy-Fer commented Apr 22, 2026

Hmm this is going to be fun to merge with complex changes on dev branch. Especially given the align and stitcher changes with iFrag.

I'll get through this in the next few days

James

@ewels
Copy link
Copy Markdown
Member Author

ewels commented Apr 22, 2026

Arg, I did not notice you had a dev branch sorry 🙈

I can point Claude at dev and ask it to rebase. Better to do that with its existing context I suspect.

Resolves conflicts between transcriptome-SAM work and dev's PE
stitcher rewrites (Phase E/E3/E4 — diagonal dedup, combined seed
search, combined-threshold fallback, PE-CHECK2 unconditional).

Conflict resolutions in src/align/read_align.rs:

1. Imports — union of both sides: keep dev's `finalize_transcript`,
   `split_combined_wt`, `stitch_seeds_core`, `PE_SPACER_BASE` and
   keep this branch's `Exon` import (transcriptome projection uses
   `Exon::i_frag`).

2. PE stitching block — accept dev. The obsolete per-mate stitching
   loops (`mate1_transcripts` / `mate2_transcripts` via
   `stitch_seeds_with_jdb_debug` per mate) are replaced by dev's
   combined-cluster approach (`stitch_seeds_core` + `split_combined_wt`
   on a PE_SPACER_BASE-joined combined read). The result populates
   `single_mate1_transcripts` / `single_mate2_transcripts` for the
   half-mapped fallback path, which this branch already uses.

3. Half-mapped threshold — accept dev's combined threshold (Lread-1 =
   len1+len2), matching STAR's `ReadAlign_mappedFilter.cpp`. The
   per-mate threshold on this branch was a STAR deviation.

4. Minor formatting on `best_pa = paired_alns.iter().max_by_key(...)`
   — accept dev.

src/align/stitch.rs — auto-merged cleanly: dev's mate_id-aware
diagonal dedup (per-fragment `HashMap<(i64, u8), ...>`) plus this
branch's `pub(crate)` visibility bump on `PE_SPACER_BASE` and
`i_frag: 0` on the Exon constructor (PE pair-building rewrites
mate2's exons to `i_frag = 1`).

CLAUDE.md — auto-merged: dev's Phase E4 status section.

Verification:
- cargo test: 335 unit + 4 integration + 1 doc, all passing
- cargo clippy --all-targets: 27 warnings (pre-existing baseline,
  0 new)
- cargo fmt --check: clean
- test/diff_star_transcriptome.sh: 9/10 files byte-identical to
  STAR 2.7.11b (genomeParameters.txt still diverges on
  genomeFileSizes because this branch doesn't yet bake sjdb into
  the genome — that's PR scverse#8's scope).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Psy-Fer
Copy link
Copy Markdown
Collaborator

Psy-Fer commented Apr 22, 2026

Nah it's okay I can take care of it. Mad ea Dev branch specifically so it would tie in better with doing PRs that would then trigger all the CI and release stuff after I fixed some really annoying things Claude was doing.

…upply the combined score

After merging dev's combined-cluster PE stitching (commit 45792dd), the
only caller of `try_pair_transcripts` always supplies a combined WT
score — the per-mate fallback path that drove the `Option<i32>`
override was removed with the merge.

- Collapse `combined_wt_score_override: Option<i32>` → `combined_wt_score: i32`.
- Drop the `scorer` parameter (its only use was in the fallback closure).
- Delete the dead per-mate fallback that recomputed the score from
  t1.score + t2.score - p1 - p2 + combined_p.
- Fold identical `thresh1` / `thresh2` half-mapped thresholds into a
  single `single_mate_threshold` variable (both expressions were
  verbatim identical post-merge).

Net -19 lines; same test pass-count (340), same STAR diff-script
baseline (9/10 byte-identical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ewels
Copy link
Copy Markdown
Member Author

ewels commented Apr 22, 2026

oops sorry, already pushed a commit where I merged in your dev branch to this PR. So should be possible to merge fairly cleanly now I think.

@ewels ewels changed the base branch from main to dev April 22, 2026 15:24
@ewels
Copy link
Copy Markdown
Member Author

ewels commented Apr 22, 2026

Updated the PR to point to dev instead of main

@ewels
Copy link
Copy Markdown
Member Author

ewels commented Apr 22, 2026

..looks like dev is a bit behind main. Because I based this off main it means that this PR looks a bit bigger as a result.

Bit messy sorry 🫠 Shouldn't really matter though.

@Psy-Fer
Copy link
Copy Markdown
Collaborator

Psy-Fer commented Apr 24, 2026

I've pulled all of this into dev locally with all git commits preserved. Thanks

@Psy-Fer Psy-Fer closed this Apr 24, 2026
Psy-Fer added a commit that referenced this pull request Apr 24, 2026
- Transcriptome index built from GTF at genomeGenerate, persisted to disk
- Genome alignment projection onto transcripts (align_to_transcripts)
- Aligned.toTranscriptome.out.bam pipeline
- fix(params): don't require --sjdbGTFfile at alignReads when index has it
- fix(io/bam): lenient BAM header validator for STAR tRNA names
- fix(genome): match STAR's genomeParameters.txt byte-for-byte
- fix(quant): STAR-exact geneInfo/exonGeTrInfo/sjdbList tab formats
- Exon gains i_frag: u8 field (mirrors STAR's EX_iFrag)
- try_pair_transcripts simplified (single caller always supplies score)
@ewels ewels deleted the feat/transcriptome-sam branch April 24, 2026 05:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants