feat: SJ insertion into Genome + SA at genomeGenerate by ewels · Pull Request #8 · scverse/rustar-aligner

ewels · 2026-04-22T09:09:11Z

Note

Stacked on top of #7 (--quantMode TranscriptomeSAM). This PR currently includes the 41 commits from that branch plus the 9 new SJ-insertion commits. Once #7 is merged, this branch will be rebased onto main and the diff will shrink to the SJ-insertion work only. Review the latest 9 commits (from bc1aa6e onwards) for the new material.

Summary

Ports STAR's sjdbPrepare.cpp + sjdbBuildIndex.cpp to ruSTAR. At genomeGenerate, GTF-derived splice junctions are now baked into the Genome binary and indexed by the suffix array, matching STAR's layout byte-for-byte.

Validation — 15/15 files match STAR at genomeGenerate

test/diff_star_transcriptome.sh against STAR 2.7.11b on the synthetic 2-chr / 4-transcript / 3-junction test case:

✓ chrName.txt
✓ chrLength.txt
✓ chrStart.txt
✓ chrNameLength.txt
✓ transcriptInfo.tab
✓ exonInfo.tab
✓ geneInfo.tab
✓ exonGeTrInfo.tab
✓ sjdbList.fromGTF.out.tab
✓ sjdbInfo.txt
✓ sjdbList.out.tab
✓ Genome
✓ genomeParameters.txt
~ SA size-only match (51929 bytes) — content differs (expected, different SA algorithm)
~ SAindex size-only match (6027 bytes) — content differs (expected, different SA algorithm)
RESULT: 15/15 files identical (or size-matched) to STAR's output

SA/SAindex content divergence is expected: STAR's bucketed parallel SA construction breaks ties differently than ruSTAR's comparator-based sort. Size parity proves we indexed the same genome+Gsj length. Byte-for-byte SA parity would require reimplementing STAR's SA algorithm — out of scope for this PR.

Round-trip verified: ruSTAR can load the extended index it built and align reads end-to-end (load path correctly reads n_genome=524585 from genomeFileSizes, confirming Gsj bytes are included alongside the real genome bytes).

What changed (SJ-insertion commits only)

`src/junction/sjdb_insert.rs` (new module, 22 unit tests)

compute_shifts — STAR's sjdbShiftLeft / sjdbShiftRight (repeat length across intron boundary, capped at 255, stops at N).
PreparedJunction + prepare_junction — combines motif detection, shifts, strand derivation. stored_start/end (for sjdbInfo.txt) vs original_start/end (for Gsj extraction) follow STAR's convention where canonical motifs store the pre-shift ORIGINAL coord and non-canonical ones store the SHIFTED coord.
sort_and_dedup — STAR's post-sort + cross-strand dedup (sjdbPrepare.cpp:141-192).
build_gsj — concatenated flanking sequences, 2*overhang+1 bytes per junction (donor + acceptor + spacer=5).
write_sjdb_info_tab + write_sjdb_list_out_tab — byte-exact file writers.

`src/genome/mod.rs` — `Genome::append_sjdb()`

Extends the forward buffer with Gsj, rebuilds the RC over the extended range. chr_start[n_chr_real] stays pinned at the pre-sjdb forward total (matches STAR's chrStart.txt). n_genome grows to include Gsj.

`src/index/mod.rs` — `GenomeIndex::build` orchestration

Parse GTF once (shared between junction_db, transcriptome, and sjdb_insert — was parsed twice before).
Convert chr-local 1-based GTF intron coords to 0-based absolute genome offsets.
Sort + dedup prepared junctions.
Build Gsj and call Genome::append_sjdb() BEFORE the SA is built (STAR indexes Gsj alongside the real genome in a single SA, sjdbBuildIndex.cpp:293).
Stash sorted prepared junctions in a new prepared_junctions field so write() can emit sjdbInfo.txt + sjdbList.out.tab.

`src/index/io.rs` — load path

Reads n_genome from genomeParameters.txt's genomeFileSizes line. The chr_start boundary is only the pre-sjdb forward total, so with sjdb-extended indices the two values diverge and chr_start cannot be used as the total genome size.

Other

src/junction/mod.rs — SpliceJunctionDb::from_raw_junctions helper (shared by from_gtf and the build path, avoiding GTF-parse-twice).
test/diff_star_transcriptome.sh — extended from 10 files to 15 (added Genome, sjdbInfo.txt, sjdbList.out.tab, plus size-only check for SA/SAindex).

Test plan

cargo test — 359 unit + 4 integration + 1 doc test, all passing
cargo clippy --all-targets — 0 new warnings
cargo fmt --check — clean
test/diff_star_transcriptome.sh — 15/15 files identical/size-matched against STAR 2.7.11b
Round-trip: ruSTAR reads its own sjdb-extended index and aligns reads end-to-end

🤖 Generated with Claude Code

Adds the STAR `--runRNGseed` CLI flag (default 777) and uses it to randomize primary alignment selection among equal-scoring multi-mappers. Previously primary selection was purely deterministic (lexicographic tie-break), which meant the 127 "genuine tie" disagreements against STAR on 10k SE yeast were frozen in whichever direction our sort happened to pick. With a seeded shuffle, ties flip independently of sort order -- matching STAR's behavior at `ReadAlign_multMapSelect.cpp:71-79` and `funPrimaryAlignMark.cpp:19-28`. Unblocks nf-core/rnaseq, which invokes STAR_ALIGN with `--runRNGseed 0`. Implementation notes: - `shuffle_tied_prefix` does an in-place Fisher-Yates on the contiguous top-score prefix only, leaving non-tied lower-scored alignments alone. Matches STAR's two-phase "move best to front, then shuffle front" (collapsed since the input is already score-sorted). - Applied at the final primary-selection point in both SE (`align_read`) and PE (`align_paired_read`). Not applied in `stitch.rs` per-window sorts -- STAR only shuffles at multMapSelect / primary-mark time. - STAR seeds `std::mt19937` once per chunk and advances per read. ruSTAR parallelises per-read via rayon, so instead of a per-chunk RNG we derive a deterministic per-read seed from `run_rng_seed * (hash(read_name) + 1)`. This is stronger reproducibility than STAR (independent of thread count) while honoring `--runRNGseed`. - Uses `rand::rngs::StdRng` rather than `rand_mt`. Not bit-for-bit parity with STAR's tie-breaking choices (STAR's `std::uniform_real_distribution` is libstdc++-specific anyway) -- it just has to honor the seed deterministically. Tests: +5 (parse-default, parse-override, four shuffle behavior tests). 283 unit tests passing (was 278), 4 integration tests passing. Note: commit is unsigned because local 1Password SSH signing (op-ssh-sign) returned "failed to fill whole buffer" on every attempt. Feel free to re-sign with `git commit --amend -S` once the signer is back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@rg

Parses STAR's comma-separated multi-RG syntax (first field must be ID: on each block, `,` separates blocks, tokens are tab-joined into the SAM @rg body). Replicates a single RG line across N input files like STAR does, and errors on count mismatch or RG-in-outSAMattributes without a line. Auto-appends RG to sam_attribute_set() when a line is set (matches Parameters_samAttributes.cpp:201). SAM/BAM headers now emit one @rg per unique ID, and every record (mapped, unmapped, half-mapped, both- mates) carries RG:Z:<id> when attrs contain RG. Unblocks nf-core/rnaseq, which invokes STAR with `--outSAMattrRGline 'ID:X' 'SM:X'` and previously errored on the unknown flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the hand-rolled Fisher-Yates loop with `rand::seq::SliceRandom::shuffle` on the tied prefix, and collapse the length guard with `slice::first()`. Same behavior (shuffles the [0..tied) range with the seeded RNG), fewer lines, and no need to import `Rng`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop the Fisher-Yates reference in `shuffle_tied_prefix`'s doc (the impl now delegates to `SliceRandom::shuffle` — the algorithm is an implementation detail). Collapse the SE call-site comment to one line since the fn doc already covers the "why"; keep the STAR source file + line reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`DefaultHasher` is re-exported from `std::hash` as of Rust 1.76, so there's no need to pull it in via the full `std::collections::hash_map` path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…efix Single-use generic parameter — `impl Fn(&T) -> i32` in the argument list is shorter than a `where F: Fn(&T) -> i32` bound and reads more naturally at the call sites we have. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Remove the `let _ = params.rg_ids()?;` early-validation line in `align_reads_paired_end`. `validate()` already calls `rg_ids()` at startup, so the second call was a no-op. - Simplify `maybe_insert_rg_tag` to `(record, rg_id)` by dropping the `attrs: &HashSet<String>` parameter and its `attrs.contains("RG")` gate. `sam_attribute_set()` auto-inserts "RG" whenever an RG line is configured, so `rg_id.is_some()` already implies the tag is wanted. - Replace the inline `if let Some(id) = rg_id { ... }` block in `build_unmapped_record` with a call to the simplified helper, so every writer-side RG insertion goes through the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add `Parameters::primary_rg_id() -> Result<Option<String>, Error>` that returns the first RG line's `ID:` value directly, and use it to replace the `let rg_ids = params.rg_ids()?; let rg_id = rg_ids.first()...` two-step pattern at every SAM/BAM writer callsite (5 in sam.rs, 1 in lib.rs). `rg_ids()` is retained because `validate()` still uses it for the 1-vs-N count check against `--readFilesIn`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the manual `while i < len` index-walk with a `out_sam_attr_rg_line.split(|tok| tok == ",")` that maps each block to its tab-joined body and propagates any `ID:`-prefix error via `collect::<Result<_,_>>()`. Same public signature and return value as before, but with half the local state. Behavior note: an empty block (adjacent `,`s or a trailing `,`) now returns a `Parameter` error instead of being silently skipped. This matches STAR's `Parameters_readFilesInit.cpp` which requires every RG block to begin with `ID:`. Existing tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`SamWriter::write_unmapped` had no production callers — the SE/PE write paths build unmapped records via `build_unmapped_record` and batch them through `write_batch`. The only caller was the method's own self-test. Delete both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fixes formatting drift inherited from main. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat: add `--runRNGseed` flag with seeded primary tie-break

feat: add `--outSAMattrRGline` with `@RG` header and `RG:Z` tags

Introduce `src/quant/transcriptome.rs` with a `TranscriptomeIndex` struct that groups GTF exons by `transcript_id`, sorts exons by start, and records absolute genome coords plus STAR's `exLenCum` (cumulative transcript-space offset) per exon. Also builds: - `tr_start` / `tr_end` — transcript bounds in absolute genome coords - `tr_order` / `tr_starts_sorted` / `tr_end_max_sorted` — sorted view used for STAR-style `binarySearch1a` + running-max early-exit in `quantAlign` - per-transcript length, strand, gene_id, chr_idx Transcripts with inconsistent chr/strand across exons or on unknown chromosomes are skipped with `log::warn!`, matching STAR's tolerant GTF handling at `sjdbInsertJunctions` time. Note: STAR persists this table as `transcriptInfo.tab` + `exonInfo.tab` at genomeGenerate; ruSTAR rebuilds it at alignment time from the GTF. Semantically equivalent. No signing — commit.gpgsign disabled locally (no key on this worktree). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nscripts) Port STAR's `Transcriptome::quantAlign` + `alignToTranscript` (see `source/Transcriptome_quantAlign.cpp:5-89, 91-114`). `align_to_transcripts` binary-searches `tr_starts_sorted` for the greatest tr_start <= align.genome_start, then walks back while `tr_end_max_sorted[i] >= align.genome_end`, calling `align_to_one_transcript` on every candidate whose [tr_start, tr_end] fully contains the alignment. `align_to_one_transcript` walks the alignment's exon blocks and: * locates the transcript exon containing the first block (binary-searched `find_containing_exon`), * rejects blocks that extend past a transcript-exon boundary, * on each splice boundary (ruSTAR's `is_splice_boundary_before`, which approximates STAR's `canonSJ >= 0` using read-side continuity), requires `prev.genome_end == tr_ex[ex].genome_end && next.genome_start == tr_ex[ex+1].genome_start`, * translates each block start to t-space via `ex_len_cum[ex] + (g_pos - exon_start)`. Reverse-strand transcripts flip all projected exons: `read_pos' = Lread - (read_pos + len)`, `tr_pos' = tr_length - (tr_pos + len)`, then reverse the exon vector. `is_reverse` is XOR'd with `tr_strand == 2`. Projected CIGAR = original CIGAR with N operations dropped; reversed for reverse-strand transcripts. Splices collapse so `n_junction = 0`. ruSTAR-specific notes: * `Transcript.exons` has no canonSJ array; splice vs indel boundaries are discriminated by read-side continuity (insertion → read gap, splice → read contiguous with genome gap). Pure deletions rarely produce block splits because the stitch merge coalesces across Del. * GTF is 1-based inclusive; ruSTAR 0-based half-open. Transcript length = sum(end - start). STAR uses `exLenCum[last] + (end - start) + 1` which arrives at the same numeric value on 1-based inclusive inputs. 8 new unit tests cover: single-exon projection, two-exon with matching junction, junction mismatch (rejected), reverse-strand flip, multi-exon projection into a longer transcript, out-of-bounds cases, and projection onto multiple overlapping transcripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Port STAR's `ReadAlign::quantTranscriptome` gatekeeping logic (see `source/ReadAlign_quantTranscriptome.cpp:9-66`) as `filter_and_project`: 1. If `!allow_indels && n_gap > 0` → reject. 2. Single-end filter — for PE, skip alignments where both ends came from the same mate. ruSTAR's `Transcript` does not carry per-block mate tags; the caller enforces this by only invoking `filter_and_project` on both-mapped PE pair mates. Documented as a no-op at the per-mate level. 3. Soft-clip extension (mode-dependent): extend leading / trailing soft-clipped bases back into matched bases, counting mismatches where both read and genome bases are < 4 and unequal. Reject if `n_mismatch + extension_mismatches > min(outFilterMismatchNmax, outFilterMismatchNoverLmax*(Lread-1))`. 4. Call `align_to_transcripts` for projection. New enum `QuantTranscriptomeSAMoutput` with three variants matching STAR's CLI strings: * `BanSingleEnd_BanIndels_ExtendSoftclip` (default, RSEM-compatible) * `BanSingleEnd` * `BanSingleEnd_ExtendSoftclip` `rebuild_cigar_without_softclips` strips the leading/trailing S ops and folds their length into the adjacent M op. Interior soft-clips are left alone. 7 new unit tests: enum FromStr/Display, mode flags, indel rejection under default mode, indel retention under BanSingleEnd, zero-mismatch softclip extension folding 4S+40M → 44M, softclip extension exceeding budget → reject, softclip preservation under BanSingleEnd mode. Divergence from STAR: ruSTAR checks for genome/read buffer bounds via `saturating_sub` + slice-length checks before indexing, to protect against edge cases where extension would read past the genome end. STAR trusts its upstream bounds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tion Add `quant_transcriptome_sam_output: QuantTranscriptomeSAMoutput` field with STAR-compatible default `BanSingleEnd_BanIndels_ExtendSoftclip`. Accepts the three STAR strings via the `FromStr` impl added in the previous commit. Add `Parameters::quant_transcriptome_sam()` helper that scans `quant_mode` for `"TranscriptomeSAM"`, mirroring the existing `quant_gene_counts()` helper. Extend `Parameters::validate()` to require `--sjdbGTFfile` when `--quantMode TranscriptomeSAM` is active. STAR also allows running TranscriptomeSAM in SE (the "single-end" flag in the mode name refers to per-mate, not per-run), so no PE-only gate. 5 new unit tests: default parse, flag enable, explicit mode override (BanSingleEnd and BanSingleEnd_ExtendSoftclip), validate error without GTF, validate success with GTF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Top-level orchestration for `--quantMode TranscriptomeSAM`. src/io/sam.rs: * Factor `build_sam_header` into a generic `build_sam_header_from_refs(iter, params)` that takes any `(name, length)` iterator. The original `build_sam_header` now delegates to it with the genome's chromosomes. * New `SamWriter::build_transcriptome_records(...)`: builds records with RNAME pointing at the transcriptome header (reference_sequence_id = transcript_idx), POS = t-space position + 1, and emits only the standard attribute set (NH/HI/AS/NM/nM). Splice-aware tags (jM, jI, XS, MD) are dropped because splices collapse in t-space. * `primary_hit_idx` parameter selects the randomly-chosen primary alignment; all others get the SECONDARY (0x100) flag. src/io/bam.rs: * New `BamWriter::create_transcriptome(path, tr_idx, params)` builds a BAM header with one @sq per transcript (name = transcript_id, length = t-space length). * New `BamWriter::header()` accessor. * Unit test verifying transcriptome BAM writer creation + @sq count. src/align/read_align.rs: * `per_read_seed` changed from `fn` to `pub(crate) fn` so lib.rs can seed the transcriptome primary-picker with the same read-scoped seed used for genome-space tie-shuffle. src/lib.rs: * Build an `Arc<TranscriptomeIndex>` alongside the gene-counts context when `--quantMode TranscriptomeSAM` is active, and thread it through run_single_pass / run_two_pass / align_reads_single_end / align_reads_paired_end. * Open `<prefix>Aligned.toTranscriptome.out.bam` via `BamWriter::create_transcriptome` at the start of the pipeline and flush/finish at the end. * Extend `AlignmentBatchResults` with `transcriptome_records: Vec<RecordBuf>`. Per-read builder helpers: - `build_transcriptome_records_se`: projects every genome-space alignment via `filter_and_project`, seeds `StdRng` with `per_read_seed(run_rng_seed, read_name)`, picks a random projected index as primary, builds records. - `build_transcriptome_records_pe`: projects both mates independently, keeps transcripts where both project, emits two records per projected pair (mate1 FIRST_SEGMENT + mate2 LAST_SEGMENT). Primary pick is shared across both mates. * Transcriptome records are written serially in batch-merge order alongside the main SAM/BAM stream (normal and BySJout paths). Known limitations vs STAR: * PE pairing is done per-transcript-idx set intersection between projected mates, not via STAR's combined 2-mate `Transcript`/`EX_iFrag` machinery. Equivalent for the common case where each mate maps uniquely within a given transcript; may diverge when one mate has multiple valid positions within one transcript. * StdRng (inherited from `feat/run-rng-seed`) replaces STAR's std::mt19937 — no bit-for-bit parity on which transcriptome projection is picked as primary among ties. Per-read determinism is preserved via `per_read_seed`. No pass-1 wiring: two-pass pass 1 uses `None` for tr_idx (tr BAM is only emitted from pass 2). cargo fmt applied across the touched files plus incidental fmt touch-ups on read_align.rs / stitch.rs / params.rs / quant/mod.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end smoke test in `tests/transcriptome_sam.rs`: * Builds a 2000bp synthetic genome (single chromosome, pseudo-random sequence from an LCG). * Writes a 2-transcript GTF (T1: chr1:101-400 +, T2: chr1:601-900 +). * Generates 20 reads (30bp) drawn alternately from the T1 and T2 exon regions. * Runs `ruSTAR --runMode genomeGenerate` then `alignReads --quantMode TranscriptomeSAM`. * Asserts `Aligned.toTranscriptome.out.bam` is created, > 100 bytes, parses as a valid BAM via noodles, and the header contains exactly two @sq lines named T1 and T2. Uses the same `assert_cmd::Command::cargo_bin` pattern as the existing `phase9_threading.rs` test (inherits the same deprecation warnings; pre-existing parity). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…iptomeSAMoutput Remove `Display` impl (never formatted anywhere) and `allow_single_end()` helper (always returns false, never called). Clean up the stale single-end comment in `filter_and_project`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lders `build_transcriptome_records_se` allocated a redundant `fwd` clone of `read_seq` just to unify the reverse / forward branch, and both SE + PE builders inlined the same 6-line base-complement match. Replace both with a single `rc_encode()` helper delegating to the existing `io::fastq::complement_base`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…r mate Previously `build_transcriptome_records_pe` called the per-record builder twice *per projected pair*, passing a single-element slice and a `primary_hit_idx` sentinel (0 / usize::MAX) to force the SECONDARY flag. The builder already handles multi-alignment sets correctly, so split the pairs into mate1/mate2 slices, call the builder once per mate with the true `primary_hit`, then stamp FIRST_SEGMENT / LAST_SEGMENT on the results and interleave. Same observable output, half the per-pair work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both `build_transcriptome_records_se` and `_pe` duplicated the RNG seeding (`per_read_seed` → `StdRng::seed_from_u64` → `gen_range`) and the MAPQ calculation (`effective_n = n_alignments.max(n_for_mapq)`). Extract a single `pick_primary_and_mapq` helper and hoist the `use` blocks out of the function bodies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`BamWriter::create` and `BamWriter::create_transcriptome` both opened the file, wrapped it in `BufWriter`/`bam::io::Writer`, and wrote the header. Move that boilerplate into a private `with_header` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Merge the two `actual_left` shadowed lets into one `.min(..).min(..)` chain. - Drop the `actual_right` alias (right-clip never needs clamping). - Replace `ext.genome_start = ext.exons.first().map(..).unwrap_or(..)` with a simple `if let Some(first) = ext.exons.first()` — the fallback branch was unreachable because extend_softclips always enters with a non-empty exon list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eq clone - The 5-line single-end carve-out inside the `filter_and_project` doc said the same thing as the inline comment; fold both into a one-line mention. - Drop redundant `// (1) ... (4) ...` step labels from the function body. - `align_to_one_transcript` was cloning `align.read_seq` into the projected transcript but no consumer reads that field on a projected record (transcriptome SAM takes the read from the caller). Use `Vec::new()` to avoid the per-projection allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extended the end-to-end integration test to also pass --sjdbGTFfile at genomeGenerate and confirm that transcriptInfo.tab, exonInfo.tab, geneInfo.tab, exonGeTrInfo.tab, and sjdbList.fromGTF.out.tab are written into the genome directory with the exact byte content STAR would produce for the synthetic 2-transcript GTF used by this test. This is the best byte-format validation we can do without the full yeast test dataset (which needs to be downloaded separately via test/data_setup.sh). Once that dataset is in place, diffing ruSTAR's output against STAR's for the yeast genome gives the final parity verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a GTF exon record has no gene_name attribute, STAR's GTF.cpp uses the gene_id string as the name. When there's no gene_biotype, it uses the literal string "MissingGeneType". ruSTAR was using empty strings for both. Verified byte-for-byte parity against STAR 2.7.11b on a synthetic 2-chr / 4-transcript / 4-gene test case (/tmp/rustar_diff): all 9 small text files (chrName/chrLength/chrStart/chrNameLength.txt, transcriptInfo.tab, exonInfo.tab, geneInfo.tab, exonGeTrInfo.tab, sjdbList.fromGTF.out.tab) now diff-clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Generates a deterministic 2-chr / 4-transcript / 4-gene synthetic test case, runs both STAR and ruSTAR genomeGenerate, then diffs each of the 9 files byte-for-byte. Current result on STAR 2.7.11b: 9/9 files identical * chrName.txt, chrLength.txt, chrStart.txt, chrNameLength.txt * transcriptInfo.tab, exonInfo.tab, geneInfo.tab, exonGeTrInfo.tab, sjdbList.fromGTF.out.tab Runnable via: RUSTAR_BIN=/path/to/target/release/ruSTAR \\ test/diff_star_transcriptome.sh /tmp/workdir Reports with ✓/✗ per file and exits non-zero if any file differs. Requires STAR on PATH (`brew install rna-star` or bioconda star). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rewrote write_genome_parameters_txt to emit the exact line order, tab/space separators, and trailing whitespace STAR's genomeParametersWrite.cpp uses. Adds missing fields STAR always writes: - genomeType (Full) - genomeTransformType / genomeTransformVCF - sjdbFileChrStartEnd (with STAR's trailing space quirk) - sjdbGTFchrPrefix / sjdbGTFfeatureExon / sjdbGTFtagExonParent* - sjdbInsertSave (Basic) - versionGenome bumped to 2.7.4a (STAR's on-disk format version, not binary build version) Also adds the `### STAR <cli>` and `### GstrandBit 32` comment lines. genomeFileSizes now uses `<tab><genome_size> <sa_size>` (space between values, matching STAR's `<< "\t"` then `<< " "` pattern). Diff script extended to check genomeParameters.txt. Current result: 10/11 files byte-identical on the synthetic test case. The remaining difference is the genomeFileSizes values themselves, because ruSTAR does not yet insert GTF-derived splice junction sequences into the Genome binary — tracked as a separate phase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…loader The `from_index_dir` path derived each transcript's chromosome index via a hand-rolled linear scan helper. `Genome::position_to_chr` already does the same thing via binary search (O(log n)) and handles the padding-gap case. Drop the duplicate helper and call through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Raw strings like "transcript_id" / "gene_id" / "gene_name" / "gene_biotype" / "MissingGeneType" were scattered across TranscriptomeIndex building code. Hoist them to module-level `const &str` declarations so the STAR-faithful fallback is documented in one place and can't drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The mv-renaming trick worked but obscured that both tools need the same `--genomeDir` string for byte-identical genomeParameters.txt output. Replace it with a symlink `index` → `star_index | rustar_index` that flips between runs; both invocations pass `--genomeDir index` verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ptomeIndex tr_gene_idx + gene_ids cover all callers; the parallel Vec<String> was only populated as a convenience view and never read outside transcriptome.rs tests. Drop the field; update the two asserts that used it to look up via `gene_ids[tr_gene_idx[i] as usize]`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… has it Previously, `--quantMode TranscriptomeSAM` at alignReads rejected invocations without --sjdbGTFfile, even when the user had already persisted transcriptInfo.tab + friends in --genomeDir at genomeGenerate. That's STAR-incompatible: STAR loads the files directly and only re-parses the GTF if explicitly passed at mapping time. Move the hard check to genomeGenerate mode only. GenomeIndex::load already surfaces a clear error if neither source is available at alignReads (via `index.transcriptome.is_none()` in src/lib.rs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@hd

noodles-sam-0.64 rejects `(` `)` `[` `]` `{` `}` `<` `>` `,` `\` in @sq SN: values. yeast tRNA transcript IDs from Ensembl GTFs contain parentheses (tP(UGG)A, tA(UGC)A, etc.) so writing a transcriptome BAM errored with `invalid reference sequence name (<unknown>)` and truncated the output file. Replace BamWriter's noodles-driven header write with a local write_bam_header_lenient that: - Emits BAM\x01 + l_text + SAM-text block (built via render_sam_text_lenient — iterates @hd / @sq / @rg / @pg / @co directly from the noodles sam::Header, writing raw bytes so forbidden chars pass through) - Emits the binary reference list (n_ref + l_name/name/l_ref triples) using CString for NUL-termination Subsequent record writes still go through noodles (no validation there). Verified with 5 yeast reads → valid BAM with 4 transcriptome records + samtools view accepts the output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New module src/junction/sjdb_insert.rs with classify_motif() that returns the 0-6 motif code STAR uses (sjdbPrepare.cpp:37-50): 0 = non-canonical 1 = GT/AG(+) (donor bases GT, acceptor bases AG) 2 = CT/AC(-) 3 = GC/AG(+) 4 = CT/GC(-) 5 = AT/AC(+) 6 = GT/AT(-) First of several pieces of sjdbPrepare + sjdbBuildIndex needed to reach byte-parity on Genome / SA / SAindex / sjdbInfo.txt / sjdbList.out.tab. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ports STAR's sjdbPrepare.cpp:52-73 repeat-shift calculation: counts how many bases the intron boundary can shift left / right while preserving the donor/acceptor base identity. Caps at 255 per junction (STAR limit). Stops on genome bounds and on any N-base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New `PreparedJunction` struct carrying STAR's per-junction fields (chr_idx, shift-adjusted start/end, motif, shift_left, shift_right, strand) and a `prepare_junction` free function that pulls them together from raw db coordinates. Follows STAR's sjdbPrepare.cpp semantics: - Compute motif via classify_motif - Compute shifts via compute_shifts - Subtract shift_left from both coordinates (lines 71-72) to sit at the canonical left-most representation - Derive strand from motif when db_strand is unknown (lines 184-188: 2 - motif % 2, or 0 if motif is non-canonical) Three tests cover: canonical + strand, strand derivation fallback, shift-left applied to coords. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The /simplify review flagged my classify_motif + MOTIF_* consts as a duplicate of src/align/score.rs::detect_splice_motif (behind a SpliceMotif enum) + src/junction/sj_output.rs::encode_motif (0-6 code). Two independent motif truth tables is a drift risk. - Move detect_splice_motif from AlignmentScorer method (stateless) to a free fn in src/align/score.rs; the method is now a thin wrapper. - Drop classify_motif + MOTIF_* consts from sjdb_insert. - prepare_junction calls detect_splice_motif + encode_motif instead. - compute_shifts signature changed from &[u8] → &Genome so both helpers share the same input shape; takes n_genome_real to scope the forward-strand slice and avoids redundant bounds checks per iteration. - Tests updated to build Genome instances via a local helper. 343 unit tests pass (up from 335 — the test-coverage consolidation netted more tests via score.rs's existing detect_splice_motif coverage plus the new sjdb_insert tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds PreparedJunction::stored_start/end/original_start/original_end so callers can request either STAR's on-disk coordinate form (canonical → original; non-canonical → shifted) or the canonical-for-extraction form used when reading Gsj flanking bytes. sort_and_dedup() ports STAR's second-pass sort + cross-strand dedup from sjdbPrepare.cpp:141-192. ruSTAR's SpliceJunctionDb already deduplicates on (chr, start, end, strand), so the first-pass intra-strand dedup branches never fire for our inputs — kept second-pass cross-strand logic: - Undefined vs defined strand → keep defined - Both non-canonical → collapse to strand 0 (undefined) - One canonical → keep canonical - Both canonical → keep the one whose strand matches 2 - motif%2 6 new tests exercise each branch of merge_cross_strand plus the sort ordering; all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New build_gsj() in src/junction/sjdb_insert.rs reproduces STAR's sjdbPrepare.cpp:203-215 in-memory buffer: per junction, sjdb_overhang donor bases + sjdb_overhang acceptor bases + 1 spacer byte (value 5, STAR's GENOME_spacingChar). Extraction uses ORIGINAL (pre-shift) coordinates regardless of motif, relying on the new PreparedJunction::original_{start,end} helpers. 5 tests: canonical single junction, non-canonical junction (verifies shift reversal), multi-junction concatenation with independent spacers, plus two bounds-error cases (donor underflow, acceptor overrun). Next commit will wire the Gsj bytes onto the Genome binary + extend the SA. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two STAR-faithful text writers matching `sjdbPrepare.cpp:196-222`: - `write_sjdb_info_tab`: header `<n>\t<overhang>\n`, then `stored_start\tstored_end\tmotif\tshiftL\tshiftR\tstrand\n` per junction. - `write_sjdb_list_out_tab`: `chr\t<1-based start>\t<1-based end>\t<strand_char>\n` with chromosome-local, pre-shift ("original") coordinates. Both use BufWriter + Error::io(path) matching the existing writer convention in sj_output.rs. Strand char table matches STAR's `{'.','+','-'}` for 0/1/2. Three byte-match tests covering the header, canonical vs non-canonical coord handling, and multi-chromosome chr_start offsets. 22/22 sjdb_insert tests pass; full suite 357/357, 0 new clippy warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Matches STAR's in-memory layout after `sjdbBuildIndex.cpp:293`: Gsj bytes live immediately after the forward real-genome bytes; chr_start[n_chr_real] stays pinned at the pre-sjdb forward total (verified against STAR's chrStart.txt for the 10k yeast test case: pre-sjdb boundary = 262144, n_genome post-sjdb = 263139 = 262144 + 5*199). n_genome grows to include Gsj. chr_start / chr_length / chr_name / n_chr_real are NOT updated — Gsj lives outside the chromosome accounting. Two unit tests: one verifies forward layout + RC regeneration; one verifies empty-Gsj is a no-op. 359/359 unit + 4/4 integration tests pass; 0 clippy warnings in genome/mod.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…iff harness With a GTF provided, GenomeIndex::build now: 1. Parses the GTF once (shared between junction_db, transcriptome, and sjdb_insert — was parsed twice before). 2. Converts chr-local 1-based GTF intron coords to 0-based absolute genome offsets for sjdb_insert::prepare_junction. 3. Sort + dedup prepared junctions via sjdb_insert::sort_and_dedup. 4. Build Gsj buffer and append to Genome (Genome::append_sjdb) BEFORE the suffix array is built — STAR indexes Gsj alongside the real genome in a single SA (sjdbBuildIndex.cpp:293). 5. Stash sorted prepared junctions in a new `prepared_junctions` field so GenomeIndex::write() can emit sjdbInfo.txt + sjdbList.out.tab. GenomeIndex::load() now reads n_genome from genomeParameters.txt's `genomeFileSizes` line (the chr_start boundary is only the real-genome forward total, not the post-sjdb total). Also adds SpliceJunctionDb::from_raw_junctions helper so from_gtf and the build path share the HashMap-building code. **Validation** (test/diff_star_transcriptome.sh, extended): 15/15 files byte-identical to STAR docker output, including: - Genome (real + Gsj appended; verified against STAR 2.7.11b) - sjdbInfo.txt - sjdbList.out.tab - genomeParameters.txt (genomeFileSizes reflects post-sjdb n_genome) SA / SAindex: size-matched (51929 / 6027 bytes), content differs as expected — STAR's bucketed-parallel SA construction breaks ties differently than ruSTAR's comparator-based sort. Size parity proves we indexed the same genome+Gsj length. Round-trip verified: ruSTAR can load the extended index it built and align reads end-to-end (load path reads `n_genome=524585` from the `genomeFileSizes` line, confirming Gsj bytes are included). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves conflicts between transcriptome-SAM work and dev's PE stitcher rewrites (Phase E/E3/E4 — diagonal dedup, combined seed search, combined-threshold fallback, PE-CHECK2 unconditional). Conflict resolutions in src/align/read_align.rs: 1. Imports — union of both sides: keep dev's `finalize_transcript`, `split_combined_wt`, `stitch_seeds_core`, `PE_SPACER_BASE` and keep this branch's `Exon` import (transcriptome projection uses `Exon::i_frag`). 2. PE stitching block — accept dev. The obsolete per-mate stitching loops (`mate1_transcripts` / `mate2_transcripts` via `stitch_seeds_with_jdb_debug` per mate) are replaced by dev's combined-cluster approach (`stitch_seeds_core` + `split_combined_wt` on a PE_SPACER_BASE-joined combined read). The result populates `single_mate1_transcripts` / `single_mate2_transcripts` for the half-mapped fallback path, which this branch already uses. 3. Half-mapped threshold — accept dev's combined threshold (Lread-1 = len1+len2), matching STAR's `ReadAlign_mappedFilter.cpp`. The per-mate threshold on this branch was a STAR deviation. 4. Minor formatting on `best_pa = paired_alns.iter().max_by_key(...)` — accept dev. src/align/stitch.rs — auto-merged cleanly: dev's mate_id-aware diagonal dedup (per-fragment `HashMap<(i64, u8), ...>`) plus this branch's `pub(crate)` visibility bump on `PE_SPACER_BASE` and `i_frag: 0` on the Exon constructor (PE pair-building rewrites mate2's exons to `i_frag = 1`). CLAUDE.md — auto-merged: dev's Phase E4 status section. Verification: - cargo test: 335 unit + 4 integration + 1 doc, all passing - cargo clippy --all-targets: 27 warnings (pre-existing baseline, 0 new) - cargo fmt --check: clean - test/diff_star_transcriptome.sh: 9/10 files byte-identical to STAR 2.7.11b (genomeParameters.txt still diverges on genomeFileSizes because this branch doesn't yet bake sjdb into the genome — that's PR scverse#8's scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…upply the combined score After merging dev's combined-cluster PE stitching (commit 45792dd), the only caller of `try_pair_transcripts` always supplies a combined WT score — the per-mate fallback path that drove the `Option<i32>` override was removed with the merge. - Collapse `combined_wt_score_override: Option<i32>` → `combined_wt_score: i32`. - Drop the `scorer` parameter (its only use was in the fallback closure). - Delete the dead per-mate fallback that recomputed the score from t1.score + t2.score - p1 - p2 + combined_p. - Fold identical `thresh1` / `thresh2` half-mapped thresholds into a single `single_mate_threshold` variable (both expressions were verbatim identical post-merge). Net -19 lines; same test pass-count (340), same STAR diff-script baseline (9/10 byte-identical). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t/transcriptome-sam-sj-insertion

Psy-Fer · 2026-04-24T03:32:01Z

Okay i've pull this over into dev and resolved the conflicts. all of your commits are preserved. cheers

- PreparedJunction pipeline: motif detection, shift_left/shift_right - build_gsj(): concatenated SJ flanking-sequence buffer - sort + cross-strand dedup for prepared junctions - Genome::append_sjdb(): extend forward genome + rebuild RC - sjdbInfo.txt + sjdbList.out.tab writers - Wire sjdb insertion into GenomeIndex::build - Extended byte-parity diff harness

ewels and others added 30 commits April 17, 2026 14:11

chore(simplify): use std::hash::DefaultHasher directly

ca458d4

`DefaultHasher` is re-exported from `std::hash` as of Rust 1.76, so there's no need to pull it in via the full `std::collections::hash_map` path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: apply cargo fmt

1514169

Fixes formatting drift inherited from main. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: apply cargo fmt

8dc4262

Fixes formatting drift inherited from main. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge branch 'main' into feat/run-rng-seed

baeea81

Merge pull request #5 from ewels/feat/run-rng-seed

e67864d

feat: add `--runRNGseed` flag with seeded primary tie-break

Merge branch 'main' into feat/sam-rg-line

038c23f

Fix merge conflict squashing } braket

e306046

Merge pull request scverse#6 from ewels/feat/sam-rg-line

0763745

feat: add `--outSAMattrRGline` with `@RG` header and `RG:Z` tags

ewels and others added 19 commits April 22, 2026 03:11

ewels mentioned this pull request Apr 22, 2026

feat(genome): GTF-derived splice-junction insertion into Genome + SA (Phase 17.X) ewels/rustar-aligner#5

Closed

5 tasks

ewels marked this pull request as ready for review April 22, 2026 09:58

ewels mentioned this pull request Apr 22, 2026

feat: add --quantMode TranscriptomeSAM output #7

Closed

7 tasks

ewels and others added 3 commits April 22, 2026 16:05

Merge remote-tracking branch 'origin/feat/transcriptome-sam' into fea…

148f9a4

…t/transcriptome-sam-sj-insertion

ewels changed the base branch from main to dev April 22, 2026 15:33

Psy-Fer closed this Apr 24, 2026

ewels deleted the feat/transcriptome-sam-sj-insertion branch April 24, 2026 05:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SJ insertion into Genome + SA at genomeGenerate#8

feat: SJ insertion into Genome + SA at genomeGenerate#8
ewels wants to merge 70 commits into
scverse:devfrom
ewels:feat/transcriptome-sam-sj-insertion

ewels commented Apr 22, 2026

Uh oh!

Psy-Fer commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ewels commented Apr 22, 2026

Summary

Validation — 15/15 files match STAR at genomeGenerate

What changed (SJ-insertion commits only)

src/junction/sjdb_insert.rs (new module, 22 unit tests)

src/genome/mod.rs — Genome::append_sjdb()

src/index/mod.rs — GenomeIndex::build orchestration

src/index/io.rs — load path

Other

Test plan

Uh oh!

Psy-Fer commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`src/junction/sjdb_insert.rs` (new module, 22 unit tests)

`src/genome/mod.rs` — `Genome::append_sjdb()`

`src/index/mod.rs` — `GenomeIndex::build` orchestration

`src/index/io.rs` — load path