feat: add `--quantMode TranscriptomeSAM` output by ewels · Pull Request #7 · scverse/rustar-aligner

ewels · 2026-04-17T13:56:31Z

Summary

Implements --quantMode TranscriptomeSAM + --quantTranscriptomeSAMoutput, producing Aligned.toTranscriptome.out.bam for downstream Salmon/RSEM quantification.

Byte-for-byte parity with STAR (verified via `test/diff_star_transcriptome.sh`)

Ran STAR 2.7.11b genomeGenerate alongside ruSTAR genomeGenerate on two workloads:

Synthetic 2-chr / 4-transcript / 4-gene (committed in the script):

File	Identical with STAR
`chrName.txt`	✓
`chrLength.txt`	✓
`chrStart.txt`	✓
`chrNameLength.txt`	✓
`transcriptInfo.tab`	✓
`exonInfo.tab`	✓
`geneInfo.tab`	✓
`exonGeTrInfo.tab`	✓
`sjdbList.fromGTF.out.tab`	✓
`genomeParameters.txt`	≈ (1 line differs — see below)

Real yeast (nf-core/test-datasets, 124-transcript chrI Ensembl GTF): same 9 transcriptome text files confirmed byte-identical. genomeParameters.txt differs on one line (genomeFileSizes) for the same reason as synthetic.

Remaining genomeParameters.txt delta: the genomeFileSizes values differ because ruSTAR's Genome binary does not yet include STAR's pre-inserted splice-junction sequences (~nInsertedSJs × sjdbLength bytes). Closing that gap is the scope of the follow-up PR stacked on this branch (feat/transcriptome-sam-sj-insertion, WIP — Phase 17.X). Blocks byte-identity of: Genome, SA, SAindex, sjdbInfo.txt, sjdbList.out.tab, plus the one line in genomeParameters.txt.

Transcriptome BAM end-to-end

Ran ruSTAR alignReads on the yeast GTF index built above with 50 000 real 101 bp reads from nf-core/test-datasets:

Aligned.toTranscriptome.out.bam is a valid BAM (samtools view round-trips cleanly).
Header carries all 124 @SQ entries with the correct STAR-style transcript IDs, including tRNA names with parentheses (e.g. tP(UGG)A).
Records contain the expected tags (NH, HI, AS, nM) and valid CIGARs (100–101 M per read).
233 transcriptome records from 50 000 input reads for this sample.

Byte-for-byte diff of the BAM itself against STAR's output is blocked by a STAR 2.7.11b macOS arm64 bug where STAR reads 0 input reads from any FASTQ (reproducible on both synthetic and real nf-core data, with and without --readFilesCommand, gzipped or plain). CI on Linux will run the diff once this lands.

Changes (highlights)

Persisted transcriptome index (TranscriptomeIndex::write_transcript_info + 4 siblings) — byte-identical STAR output for transcriptInfo.tab / exonInfo.tab / geneInfo.tab / exonGeTrInfo.tab / sjdbList.fromGTF.out.tab.
Load path — GenomeIndex::load prefers the on-disk files over re-parsing the GTF; falls back when the files are absent.
STAR-faithful fallback strings (gene_name → gene_id, gene_biotype → MissingGeneType) — matches GTF.cpp.
genomeParameters.txt byte-match — hoisted writer to reproduce STAR's exact line order, tab/space separators, and trailing-whitespace quirks from genomeParametersWrite.cpp.
i_frag on Exon — matches STAR's EX_iFrag on Transcript::exons[i]. Propagated through stitcher + projection. Mate boundaries re-locate the transcript exon during projection (canonSJ == -3).
Combined-pair Transcript helper (PairedAlignment::combined_transcript_for_projection) — builds STAR's unified Transcript with mate2 exons carrying i_frag = 1.
Lenient BAM header writer — bypasses noodles-sam's strict reference-name validator so yeast tRNA transcript IDs like tP(UGG)A (which STAR accepts and emits) pass through unchanged.
--quantMode TranscriptomeSAM at alignReads no longer requires --sjdbGTFfile when the index already carries transcriptInfo.tab — matches STAR's behavior.
Simplification pass — lifted GTF attribute names + MissingGeneType to constants, reused Genome::position_to_chr over a hand-rolled helper, dropped a redundant tr_gene_id: Vec<String> field.

Test plan

335 unit + 1 integration test passing
cargo clippy --lib -- -D warnings clean
cargo fmt --check clean
test/diff_star_transcriptome.sh — 9/9 text files byte-identical with STAR
nf-core yeast run produces valid transcriptome BAM with 124 @SQ entries + tRNA names
Byte-diff of Aligned.toTranscriptome.out.bam vs STAR (Linux CI; local macOS STAR bug)
Genome/SA/SAindex/sjdbInfo.txt/sjdbList.out.tab parity — stacked PR (SJ insertion)

Known divergences

StdRng vs mt19937 — inherited from feat: add --runRNGseed flag with seeded primary tie-break #5. Per-read deterministic, not bit-for-bit with STAR's mt19937 primary-pick order.

🤖 Generated with Claude Code

Adds the STAR `--runRNGseed` CLI flag (default 777) and uses it to randomize primary alignment selection among equal-scoring multi-mappers. Previously primary selection was purely deterministic (lexicographic tie-break), which meant the 127 "genuine tie" disagreements against STAR on 10k SE yeast were frozen in whichever direction our sort happened to pick. With a seeded shuffle, ties flip independently of sort order -- matching STAR's behavior at `ReadAlign_multMapSelect.cpp:71-79` and `funPrimaryAlignMark.cpp:19-28`. Unblocks nf-core/rnaseq, which invokes STAR_ALIGN with `--runRNGseed 0`. Implementation notes: - `shuffle_tied_prefix` does an in-place Fisher-Yates on the contiguous top-score prefix only, leaving non-tied lower-scored alignments alone. Matches STAR's two-phase "move best to front, then shuffle front" (collapsed since the input is already score-sorted). - Applied at the final primary-selection point in both SE (`align_read`) and PE (`align_paired_read`). Not applied in `stitch.rs` per-window sorts -- STAR only shuffles at multMapSelect / primary-mark time. - STAR seeds `std::mt19937` once per chunk and advances per read. ruSTAR parallelises per-read via rayon, so instead of a per-chunk RNG we derive a deterministic per-read seed from `run_rng_seed * (hash(read_name) + 1)`. This is stronger reproducibility than STAR (independent of thread count) while honoring `--runRNGseed`. - Uses `rand::rngs::StdRng` rather than `rand_mt`. Not bit-for-bit parity with STAR's tie-breaking choices (STAR's `std::uniform_real_distribution` is libstdc++-specific anyway) -- it just has to honor the seed deterministically. Tests: +5 (parse-default, parse-override, four shuffle behavior tests). 283 unit tests passing (was 278), 4 integration tests passing. Note: commit is unsigned because local 1Password SSH signing (op-ssh-sign) returned "failed to fill whole buffer" on every attempt. Feel free to re-sign with `git commit --amend -S` once the signer is back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@rg

Parses STAR's comma-separated multi-RG syntax (first field must be ID: on each block, `,` separates blocks, tokens are tab-joined into the SAM @rg body). Replicates a single RG line across N input files like STAR does, and errors on count mismatch or RG-in-outSAMattributes without a line. Auto-appends RG to sam_attribute_set() when a line is set (matches Parameters_samAttributes.cpp:201). SAM/BAM headers now emit one @rg per unique ID, and every record (mapped, unmapped, half-mapped, both- mates) carries RG:Z:<id> when attrs contain RG. Unblocks nf-core/rnaseq, which invokes STAR with `--outSAMattrRGline 'ID:X' 'SM:X'` and previously errored on the unknown flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the hand-rolled Fisher-Yates loop with `rand::seq::SliceRandom::shuffle` on the tied prefix, and collapse the length guard with `slice::first()`. Same behavior (shuffles the [0..tied) range with the seeded RNG), fewer lines, and no need to import `Rng`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop the Fisher-Yates reference in `shuffle_tied_prefix`'s doc (the impl now delegates to `SliceRandom::shuffle` — the algorithm is an implementation detail). Collapse the SE call-site comment to one line since the fn doc already covers the "why"; keep the STAR source file + line reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`DefaultHasher` is re-exported from `std::hash` as of Rust 1.76, so there's no need to pull it in via the full `std::collections::hash_map` path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…efix Single-use generic parameter — `impl Fn(&T) -> i32` in the argument list is shorter than a `where F: Fn(&T) -> i32` bound and reads more naturally at the call sites we have. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Remove the `let _ = params.rg_ids()?;` early-validation line in `align_reads_paired_end`. `validate()` already calls `rg_ids()` at startup, so the second call was a no-op. - Simplify `maybe_insert_rg_tag` to `(record, rg_id)` by dropping the `attrs: &HashSet<String>` parameter and its `attrs.contains("RG")` gate. `sam_attribute_set()` auto-inserts "RG" whenever an RG line is configured, so `rg_id.is_some()` already implies the tag is wanted. - Replace the inline `if let Some(id) = rg_id { ... }` block in `build_unmapped_record` with a call to the simplified helper, so every writer-side RG insertion goes through the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add `Parameters::primary_rg_id() -> Result<Option<String>, Error>` that returns the first RG line's `ID:` value directly, and use it to replace the `let rg_ids = params.rg_ids()?; let rg_id = rg_ids.first()...` two-step pattern at every SAM/BAM writer callsite (5 in sam.rs, 1 in lib.rs). `rg_ids()` is retained because `validate()` still uses it for the 1-vs-N count check against `--readFilesIn`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the manual `while i < len` index-walk with a `out_sam_attr_rg_line.split(|tok| tok == ",")` that maps each block to its tab-joined body and propagates any `ID:`-prefix error via `collect::<Result<_,_>>()`. Same public signature and return value as before, but with half the local state. Behavior note: an empty block (adjacent `,`s or a trailing `,`) now returns a `Parameter` error instead of being silently skipped. This matches STAR's `Parameters_readFilesInit.cpp` which requires every RG block to begin with `ID:`. Existing tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`SamWriter::write_unmapped` had no production callers — the SE/PE write paths build unmapped records via `build_unmapped_record` and batch them through `write_batch`. The only caller was the method's own self-test. Delete both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fixes formatting drift inherited from main. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat: add `--runRNGseed` flag with seeded primary tie-break

feat: add `--outSAMattrRGline` with `@RG` header and `RG:Z` tags

Introduce `src/quant/transcriptome.rs` with a `TranscriptomeIndex` struct that groups GTF exons by `transcript_id`, sorts exons by start, and records absolute genome coords plus STAR's `exLenCum` (cumulative transcript-space offset) per exon. Also builds: - `tr_start` / `tr_end` — transcript bounds in absolute genome coords - `tr_order` / `tr_starts_sorted` / `tr_end_max_sorted` — sorted view used for STAR-style `binarySearch1a` + running-max early-exit in `quantAlign` - per-transcript length, strand, gene_id, chr_idx Transcripts with inconsistent chr/strand across exons or on unknown chromosomes are skipped with `log::warn!`, matching STAR's tolerant GTF handling at `sjdbInsertJunctions` time. Note: STAR persists this table as `transcriptInfo.tab` + `exonInfo.tab` at genomeGenerate; ruSTAR rebuilds it at alignment time from the GTF. Semantically equivalent. No signing — commit.gpgsign disabled locally (no key on this worktree). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nscripts) Port STAR's `Transcriptome::quantAlign` + `alignToTranscript` (see `source/Transcriptome_quantAlign.cpp:5-89, 91-114`). `align_to_transcripts` binary-searches `tr_starts_sorted` for the greatest tr_start <= align.genome_start, then walks back while `tr_end_max_sorted[i] >= align.genome_end`, calling `align_to_one_transcript` on every candidate whose [tr_start, tr_end] fully contains the alignment. `align_to_one_transcript` walks the alignment's exon blocks and: * locates the transcript exon containing the first block (binary-searched `find_containing_exon`), * rejects blocks that extend past a transcript-exon boundary, * on each splice boundary (ruSTAR's `is_splice_boundary_before`, which approximates STAR's `canonSJ >= 0` using read-side continuity), requires `prev.genome_end == tr_ex[ex].genome_end && next.genome_start == tr_ex[ex+1].genome_start`, * translates each block start to t-space via `ex_len_cum[ex] + (g_pos - exon_start)`. Reverse-strand transcripts flip all projected exons: `read_pos' = Lread - (read_pos + len)`, `tr_pos' = tr_length - (tr_pos + len)`, then reverse the exon vector. `is_reverse` is XOR'd with `tr_strand == 2`. Projected CIGAR = original CIGAR with N operations dropped; reversed for reverse-strand transcripts. Splices collapse so `n_junction = 0`. ruSTAR-specific notes: * `Transcript.exons` has no canonSJ array; splice vs indel boundaries are discriminated by read-side continuity (insertion → read gap, splice → read contiguous with genome gap). Pure deletions rarely produce block splits because the stitch merge coalesces across Del. * GTF is 1-based inclusive; ruSTAR 0-based half-open. Transcript length = sum(end - start). STAR uses `exLenCum[last] + (end - start) + 1` which arrives at the same numeric value on 1-based inclusive inputs. 8 new unit tests cover: single-exon projection, two-exon with matching junction, junction mismatch (rejected), reverse-strand flip, multi-exon projection into a longer transcript, out-of-bounds cases, and projection onto multiple overlapping transcripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Port STAR's `ReadAlign::quantTranscriptome` gatekeeping logic (see `source/ReadAlign_quantTranscriptome.cpp:9-66`) as `filter_and_project`: 1. If `!allow_indels && n_gap > 0` → reject. 2. Single-end filter — for PE, skip alignments where both ends came from the same mate. ruSTAR's `Transcript` does not carry per-block mate tags; the caller enforces this by only invoking `filter_and_project` on both-mapped PE pair mates. Documented as a no-op at the per-mate level. 3. Soft-clip extension (mode-dependent): extend leading / trailing soft-clipped bases back into matched bases, counting mismatches where both read and genome bases are < 4 and unequal. Reject if `n_mismatch + extension_mismatches > min(outFilterMismatchNmax, outFilterMismatchNoverLmax*(Lread-1))`. 4. Call `align_to_transcripts` for projection. New enum `QuantTranscriptomeSAMoutput` with three variants matching STAR's CLI strings: * `BanSingleEnd_BanIndels_ExtendSoftclip` (default, RSEM-compatible) * `BanSingleEnd` * `BanSingleEnd_ExtendSoftclip` `rebuild_cigar_without_softclips` strips the leading/trailing S ops and folds their length into the adjacent M op. Interior soft-clips are left alone. 7 new unit tests: enum FromStr/Display, mode flags, indel rejection under default mode, indel retention under BanSingleEnd, zero-mismatch softclip extension folding 4S+40M → 44M, softclip extension exceeding budget → reject, softclip preservation under BanSingleEnd mode. Divergence from STAR: ruSTAR checks for genome/read buffer bounds via `saturating_sub` + slice-length checks before indexing, to protect against edge cases where extension would read past the genome end. STAR trusts its upstream bounds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tion Add `quant_transcriptome_sam_output: QuantTranscriptomeSAMoutput` field with STAR-compatible default `BanSingleEnd_BanIndels_ExtendSoftclip`. Accepts the three STAR strings via the `FromStr` impl added in the previous commit. Add `Parameters::quant_transcriptome_sam()` helper that scans `quant_mode` for `"TranscriptomeSAM"`, mirroring the existing `quant_gene_counts()` helper. Extend `Parameters::validate()` to require `--sjdbGTFfile` when `--quantMode TranscriptomeSAM` is active. STAR also allows running TranscriptomeSAM in SE (the "single-end" flag in the mode name refers to per-mate, not per-run), so no PE-only gate. 5 new unit tests: default parse, flag enable, explicit mode override (BanSingleEnd and BanSingleEnd_ExtendSoftclip), validate error without GTF, validate success with GTF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Top-level orchestration for `--quantMode TranscriptomeSAM`. src/io/sam.rs: * Factor `build_sam_header` into a generic `build_sam_header_from_refs(iter, params)` that takes any `(name, length)` iterator. The original `build_sam_header` now delegates to it with the genome's chromosomes. * New `SamWriter::build_transcriptome_records(...)`: builds records with RNAME pointing at the transcriptome header (reference_sequence_id = transcript_idx), POS = t-space position + 1, and emits only the standard attribute set (NH/HI/AS/NM/nM). Splice-aware tags (jM, jI, XS, MD) are dropped because splices collapse in t-space. * `primary_hit_idx` parameter selects the randomly-chosen primary alignment; all others get the SECONDARY (0x100) flag. src/io/bam.rs: * New `BamWriter::create_transcriptome(path, tr_idx, params)` builds a BAM header with one @sq per transcript (name = transcript_id, length = t-space length). * New `BamWriter::header()` accessor. * Unit test verifying transcriptome BAM writer creation + @sq count. src/align/read_align.rs: * `per_read_seed` changed from `fn` to `pub(crate) fn` so lib.rs can seed the transcriptome primary-picker with the same read-scoped seed used for genome-space tie-shuffle. src/lib.rs: * Build an `Arc<TranscriptomeIndex>` alongside the gene-counts context when `--quantMode TranscriptomeSAM` is active, and thread it through run_single_pass / run_two_pass / align_reads_single_end / align_reads_paired_end. * Open `<prefix>Aligned.toTranscriptome.out.bam` via `BamWriter::create_transcriptome` at the start of the pipeline and flush/finish at the end. * Extend `AlignmentBatchResults` with `transcriptome_records: Vec<RecordBuf>`. Per-read builder helpers: - `build_transcriptome_records_se`: projects every genome-space alignment via `filter_and_project`, seeds `StdRng` with `per_read_seed(run_rng_seed, read_name)`, picks a random projected index as primary, builds records. - `build_transcriptome_records_pe`: projects both mates independently, keeps transcripts where both project, emits two records per projected pair (mate1 FIRST_SEGMENT + mate2 LAST_SEGMENT). Primary pick is shared across both mates. * Transcriptome records are written serially in batch-merge order alongside the main SAM/BAM stream (normal and BySJout paths). Known limitations vs STAR: * PE pairing is done per-transcript-idx set intersection between projected mates, not via STAR's combined 2-mate `Transcript`/`EX_iFrag` machinery. Equivalent for the common case where each mate maps uniquely within a given transcript; may diverge when one mate has multiple valid positions within one transcript. * StdRng (inherited from `feat/run-rng-seed`) replaces STAR's std::mt19937 — no bit-for-bit parity on which transcriptome projection is picked as primary among ties. Per-read determinism is preserved via `per_read_seed`. No pass-1 wiring: two-pass pass 1 uses `None` for tr_idx (tr BAM is only emitted from pass 2). cargo fmt applied across the touched files plus incidental fmt touch-ups on read_align.rs / stitch.rs / params.rs / quant/mod.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end smoke test in `tests/transcriptome_sam.rs`: * Builds a 2000bp synthetic genome (single chromosome, pseudo-random sequence from an LCG). * Writes a 2-transcript GTF (T1: chr1:101-400 +, T2: chr1:601-900 +). * Generates 20 reads (30bp) drawn alternately from the T1 and T2 exon regions. * Runs `ruSTAR --runMode genomeGenerate` then `alignReads --quantMode TranscriptomeSAM`. * Asserts `Aligned.toTranscriptome.out.bam` is created, > 100 bytes, parses as a valid BAM via noodles, and the header contains exactly two @sq lines named T1 and T2. Uses the same `assert_cmd::Command::cargo_bin` pattern as the existing `phase9_threading.rs` test (inherits the same deprecation warnings; pre-existing parity). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…iptomeSAMoutput Remove `Display` impl (never formatted anywhere) and `allow_single_end()` helper (always returns false, never called). Clean up the stale single-end comment in `filter_and_project`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lders `build_transcriptome_records_se` allocated a redundant `fwd` clone of `read_seq` just to unify the reverse / forward branch, and both SE + PE builders inlined the same 6-line base-complement match. Replace both with a single `rc_encode()` helper delegating to the existing `io::fastq::complement_base`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…r mate Previously `build_transcriptome_records_pe` called the per-record builder twice *per projected pair*, passing a single-element slice and a `primary_hit_idx` sentinel (0 / usize::MAX) to force the SECONDARY flag. The builder already handles multi-alignment sets correctly, so split the pairs into mate1/mate2 slices, call the builder once per mate with the true `primary_hit`, then stamp FIRST_SEGMENT / LAST_SEGMENT on the results and interleave. Same observable output, half the per-pair work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both `build_transcriptome_records_se` and `_pe` duplicated the RNG seeding (`per_read_seed` → `StdRng::seed_from_u64` → `gen_range`) and the MAPQ calculation (`effective_n = n_alignments.max(n_for_mapq)`). Extract a single `pick_primary_and_mapq` helper and hoist the `use` blocks out of the function bodies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`BamWriter::create` and `BamWriter::create_transcriptome` both opened the file, wrapped it in `BufWriter`/`bam::io::Writer`, and wrote the header. Move that boilerplate into a private `with_header` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors STAR's Transcript::exons[i][EX_iFrag] (IncludeDefine.h:209). Value 0 for single-end / mate1, 1 for mate2. This is pure plumbing: - Exon gains `i_frag: u8` field with doc comment pointing at STAR's source. - Stitcher (src/align/stitch.rs) sets `i_frag: 0` on all exons it produces — SE and per-mate PE go through here before any pair combining. - Transcriptome projection (src/quant/transcriptome.rs) propagates `block.i_frag` onto projected exons, so transcript-space alignments carry mate identity. - All Exon construction sites in test code updated to set `i_frag: 0` (SE contexts). No behavior change yet — PE pair construction still concatenates two i_frag=0 transcripts. T.3 will introduce combined-pair Transcript with mate2 exons rewritten to i_frag=1, and T.4 will use this field to implement STAR's native single-end filter (ReadAlign_quantTranscriptome.cpp:17). 333 unit + 1 integration test pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

align_to_one_transcript now detects block-to-block mate boundaries via i_frag change (equivalent to STAR's canonSJ == -3) and re-locates the transcript exon for the new block's genome position. Within-mate splice boundaries still use the `ex += 1` advance; cross-mate boundaries do NOT require the transcript-junction identity check. Lays the groundwork for PE combined-pair projection. Output unchanged for existing tests (no combined Transcript yet passed through). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds a STAR-style single-Transcript view of a PE pair: mate1 exons followed by mate2 exons with i_frag rewritten to 1. This is the structure STAR feeds to Transcriptome_quantAlign for PE reads. cigar / junction_motifs / read_seq fields are left empty — the combined transcript is consumed only by the projection logic in T.3's next step, which splits the result back to per-mate BAM records. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extended the end-to-end integration test to also pass --sjdbGTFfile at genomeGenerate and confirm that transcriptInfo.tab, exonInfo.tab, geneInfo.tab, exonGeTrInfo.tab, and sjdbList.fromGTF.out.tab are written into the genome directory with the exact byte content STAR would produce for the synthetic 2-transcript GTF used by this test. This is the best byte-format validation we can do without the full yeast test dataset (which needs to be downloaded separately via test/data_setup.sh). Once that dataset is in place, diffing ruSTAR's output against STAR's for the yeast genome gives the final parity verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a GTF exon record has no gene_name attribute, STAR's GTF.cpp uses the gene_id string as the name. When there's no gene_biotype, it uses the literal string "MissingGeneType". ruSTAR was using empty strings for both. Verified byte-for-byte parity against STAR 2.7.11b on a synthetic 2-chr / 4-transcript / 4-gene test case (/tmp/rustar_diff): all 9 small text files (chrName/chrLength/chrStart/chrNameLength.txt, transcriptInfo.tab, exonInfo.tab, geneInfo.tab, exonGeTrInfo.tab, sjdbList.fromGTF.out.tab) now diff-clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Generates a deterministic 2-chr / 4-transcript / 4-gene synthetic test case, runs both STAR and ruSTAR genomeGenerate, then diffs each of the 9 files byte-for-byte. Current result on STAR 2.7.11b: 9/9 files identical * chrName.txt, chrLength.txt, chrStart.txt, chrNameLength.txt * transcriptInfo.tab, exonInfo.tab, geneInfo.tab, exonGeTrInfo.tab, sjdbList.fromGTF.out.tab Runnable via: RUSTAR_BIN=/path/to/target/release/ruSTAR \\ test/diff_star_transcriptome.sh /tmp/workdir Reports with ✓/✗ per file and exits non-zero if any file differs. Requires STAR on PATH (`brew install rna-star` or bioconda star). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rewrote write_genome_parameters_txt to emit the exact line order, tab/space separators, and trailing whitespace STAR's genomeParametersWrite.cpp uses. Adds missing fields STAR always writes: - genomeType (Full) - genomeTransformType / genomeTransformVCF - sjdbFileChrStartEnd (with STAR's trailing space quirk) - sjdbGTFchrPrefix / sjdbGTFfeatureExon / sjdbGTFtagExonParent* - sjdbInsertSave (Basic) - versionGenome bumped to 2.7.4a (STAR's on-disk format version, not binary build version) Also adds the `### STAR <cli>` and `### GstrandBit 32` comment lines. genomeFileSizes now uses `<tab><genome_size> <sa_size>` (space between values, matching STAR's `<< "\t"` then `<< " "` pattern). Diff script extended to check genomeParameters.txt. Current result: 10/11 files byte-identical on the synthetic test case. The remaining difference is the genomeFileSizes values themselves, because ruSTAR does not yet insert GTF-derived splice junction sequences into the Genome binary — tracked as a separate phase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…loader The `from_index_dir` path derived each transcript's chromosome index via a hand-rolled linear scan helper. `Genome::position_to_chr` already does the same thing via binary search (O(log n)) and handles the padding-gap case. Drop the duplicate helper and call through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Raw strings like "transcript_id" / "gene_id" / "gene_name" / "gene_biotype" / "MissingGeneType" were scattered across TranscriptomeIndex building code. Hoist them to module-level `const &str` declarations so the STAR-faithful fallback is documented in one place and can't drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The mv-renaming trick worked but obscured that both tools need the same `--genomeDir` string for byte-identical genomeParameters.txt output. Replace it with a symlink `index` → `star_index | rustar_index` that flips between runs; both invocations pass `--genomeDir index` verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ptomeIndex tr_gene_idx + gene_ids cover all callers; the parallel Vec<String> was only populated as a convenience view and never read outside transcriptome.rs tests. Drop the field; update the two asserts that used it to look up via `gene_ids[tr_gene_idx[i] as usize]`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… has it Previously, `--quantMode TranscriptomeSAM` at alignReads rejected invocations without --sjdbGTFfile, even when the user had already persisted transcriptInfo.tab + friends in --genomeDir at genomeGenerate. That's STAR-incompatible: STAR loads the files directly and only re-parses the GTF if explicitly passed at mapping time. Move the hard check to genomeGenerate mode only. GenomeIndex::load already surfaces a clear error if neither source is available at alignReads (via `index.transcriptome.is_none()` in src/lib.rs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@hd

noodles-sam-0.64 rejects `(` `)` `[` `]` `{` `}` `<` `>` `,` `\` in @sq SN: values. yeast tRNA transcript IDs from Ensembl GTFs contain parentheses (tP(UGG)A, tA(UGC)A, etc.) so writing a transcriptome BAM errored with `invalid reference sequence name (<unknown>)` and truncated the output file. Replace BamWriter's noodles-driven header write with a local write_bam_header_lenient that: - Emits BAM\x01 + l_text + SAM-text block (built via render_sam_text_lenient — iterates @hd / @sq / @rg / @pg / @co directly from the noodles sam::Header, writing raw bytes so forbidden chars pass through) - Emits the binary reference list (n_ref + l_name/name/l_ref triples) using CString for NUL-termination Subsequent record writes still go through noodles (no validation there). Verified with 5 yeast reads → valid BAM with 4 transcriptome records + samtools view accepts the output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ewels · 2026-04-22T09:59:48Z

@Psy-Fer - I had a stab at pushing this along a bit further. Turned into a bit of a beast, sorry. I split out #8 as a separate stacked PR to try to keep it a bit cleaner.

I think it's ready for review now (albeit a lot of new code), let me know if there's anything I can do to help..

Psy-Fer · 2026-04-22T12:59:26Z

Hmm this is going to be fun to merge with complex changes on dev branch. Especially given the align and stitcher changes with iFrag.

I'll get through this in the next few days

James

ewels · 2026-04-22T13:34:48Z

Arg, I did not notice you had a dev branch sorry 🙈

I can point Claude at dev and ask it to rebase. Better to do that with its existing context I suspect.

Resolves conflicts between transcriptome-SAM work and dev's PE stitcher rewrites (Phase E/E3/E4 — diagonal dedup, combined seed search, combined-threshold fallback, PE-CHECK2 unconditional). Conflict resolutions in src/align/read_align.rs: 1. Imports — union of both sides: keep dev's `finalize_transcript`, `split_combined_wt`, `stitch_seeds_core`, `PE_SPACER_BASE` and keep this branch's `Exon` import (transcriptome projection uses `Exon::i_frag`). 2. PE stitching block — accept dev. The obsolete per-mate stitching loops (`mate1_transcripts` / `mate2_transcripts` via `stitch_seeds_with_jdb_debug` per mate) are replaced by dev's combined-cluster approach (`stitch_seeds_core` + `split_combined_wt` on a PE_SPACER_BASE-joined combined read). The result populates `single_mate1_transcripts` / `single_mate2_transcripts` for the half-mapped fallback path, which this branch already uses. 3. Half-mapped threshold — accept dev's combined threshold (Lread-1 = len1+len2), matching STAR's `ReadAlign_mappedFilter.cpp`. The per-mate threshold on this branch was a STAR deviation. 4. Minor formatting on `best_pa = paired_alns.iter().max_by_key(...)` — accept dev. src/align/stitch.rs — auto-merged cleanly: dev's mate_id-aware diagonal dedup (per-fragment `HashMap<(i64, u8), ...>`) plus this branch's `pub(crate)` visibility bump on `PE_SPACER_BASE` and `i_frag: 0` on the Exon constructor (PE pair-building rewrites mate2's exons to `i_frag = 1`). CLAUDE.md — auto-merged: dev's Phase E4 status section. Verification: - cargo test: 335 unit + 4 integration + 1 doc, all passing - cargo clippy --all-targets: 27 warnings (pre-existing baseline, 0 new) - cargo fmt --check: clean - test/diff_star_transcriptome.sh: 9/10 files byte-identical to STAR 2.7.11b (genomeParameters.txt still diverges on genomeFileSizes because this branch doesn't yet bake sjdb into the genome — that's PR scverse#8's scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Psy-Fer · 2026-04-22T14:06:33Z

Nah it's okay I can take care of it. Mad ea Dev branch specifically so it would tie in better with doing PRs that would then trigger all the CI and release stuff after I fixed some really annoying things Claude was doing.

…upply the combined score After merging dev's combined-cluster PE stitching (commit 45792dd), the only caller of `try_pair_transcripts` always supplies a combined WT score — the per-mate fallback path that drove the `Option<i32>` override was removed with the merge. - Collapse `combined_wt_score_override: Option<i32>` → `combined_wt_score: i32`. - Drop the `scorer` parameter (its only use was in the fallback closure). - Delete the dead per-mate fallback that recomputed the score from t1.score + t2.score - p1 - p2 + combined_p. - Fold identical `thresh1` / `thresh2` half-mapped thresholds into a single `single_mate_threshold` variable (both expressions were verbatim identical post-merge). Net -19 lines; same test pass-count (340), same STAR diff-script baseline (9/10 byte-identical). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ewels · 2026-04-22T15:24:04Z

oops sorry, already pushed a commit where I merged in your dev branch to this PR. So should be possible to merge fairly cleanly now I think.

ewels · 2026-04-22T15:25:10Z

Updated the PR to point to dev instead of main

ewels · 2026-04-22T15:27:04Z

..looks like dev is a bit behind main. Because I based this off main it means that this PR looks a bit bigger as a result.

Bit messy sorry 🫠 Shouldn't really matter though.

Psy-Fer · 2026-04-24T03:32:39Z

I've pulled all of this into dev locally with all git commits preserved. Thanks

- Transcriptome index built from GTF at genomeGenerate, persisted to disk - Genome alignment projection onto transcripts (align_to_transcripts) - Aligned.toTranscriptome.out.bam pipeline - fix(params): don't require --sjdbGTFfile at alignReads when index has it - fix(io/bam): lenient BAM header validator for STAR tRNA names - fix(genome): match STAR's genomeParameters.txt byte-for-byte - fix(quant): STAR-exact geneInfo/exonGeTrInfo/sjdbList tab formats - Exon gains i_frag: u8 field (mirrors STAR's EX_iFrag) - try_pair_transcripts simplified (single caller always supplies score)

ewels and others added 12 commits April 17, 2026 14:11

chore(simplify): use std::hash::DefaultHasher directly

ca458d4

`DefaultHasher` is re-exported from `std::hash` as of Rust 1.76, so there's no need to pull it in via the full `std::collections::hash_map` path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: apply cargo fmt

1514169

Fixes formatting drift inherited from main. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: apply cargo fmt

8dc4262

Fixes formatting drift inherited from main. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ewels mentioned this pull request Apr 17, 2026

feat: add --quantMode TranscriptomeSAM output ewels/rustar-aligner#4

Closed

6 tasks

ewels changed the title ~~feat: add --quantMode TranscriptomeSAM output~~ feat: add --quantMode TranscriptomeSAM output Apr 17, 2026

Psy-Fer and others added 16 commits April 20, 2026 12:52

Merge branch 'main' into feat/run-rng-seed

baeea81

Merge pull request #5 from ewels/feat/run-rng-seed

e67864d

feat: add `--runRNGseed` flag with seeded primary tie-break

Merge branch 'main' into feat/sam-rg-line

038c23f

Fix merge conflict squashing } braket

e306046

Merge pull request scverse#6 from ewels/feat/sam-rg-line

0763745

feat: add `--outSAMattrRGline` with `@RG` header and `RG:Z` tags

ewels and others added 13 commits April 22, 2026 03:01

This was referenced Apr 22, 2026

feat(genome): GTF-derived splice-junction insertion into Genome + SA (Phase 17.X) ewels/rustar-aligner#5

Closed

feat: SJ insertion into Genome + SA at genomeGenerate #8

Closed

ewels marked this pull request as ready for review April 22, 2026 09:58

ewels changed the base branch from main to dev April 22, 2026 15:24

Psy-Fer closed this Apr 24, 2026

ewels deleted the feat/transcriptome-sam branch April 24, 2026 05:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `--quantMode TranscriptomeSAM` output#7

feat: add `--quantMode TranscriptomeSAM` output#7
ewels wants to merge 60 commits into
scverse:devfrom
ewels:feat/transcriptome-sam

ewels commented Apr 17, 2026 •

edited

Loading

Uh oh!

ewels commented Apr 22, 2026

Uh oh!

Psy-Fer commented Apr 22, 2026

Uh oh!

ewels commented Apr 22, 2026

Uh oh!

Psy-Fer commented Apr 22, 2026

Uh oh!

ewels commented Apr 22, 2026

Uh oh!

ewels commented Apr 22, 2026

Uh oh!

ewels commented Apr 22, 2026

Uh oh!

Psy-Fer commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ewels commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Byte-for-byte parity with STAR (verified via test/diff_star_transcriptome.sh)

Transcriptome BAM end-to-end

Changes (highlights)

Test plan

Known divergences

Uh oh!

ewels commented Apr 22, 2026

Uh oh!

Psy-Fer commented Apr 22, 2026

Uh oh!

ewels commented Apr 22, 2026

Uh oh!

Psy-Fer commented Apr 22, 2026

Uh oh!

ewels commented Apr 22, 2026

Uh oh!

ewels commented Apr 22, 2026

Uh oh!

ewels commented Apr 22, 2026

Uh oh!

Psy-Fer commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ewels commented Apr 17, 2026 •

edited

Loading

Byte-for-byte parity with STAR (verified via `test/diff_star_transcriptome.sh`)