feat: SJ insertion into Genome + SA at genomeGenerate#8
Closed
ewels wants to merge 70 commits into
Closed
Conversation
Adds the STAR `--runRNGseed` CLI flag (default 777) and uses it to randomize primary alignment selection among equal-scoring multi-mappers. Previously primary selection was purely deterministic (lexicographic tie-break), which meant the 127 "genuine tie" disagreements against STAR on 10k SE yeast were frozen in whichever direction our sort happened to pick. With a seeded shuffle, ties flip independently of sort order -- matching STAR's behavior at `ReadAlign_multMapSelect.cpp:71-79` and `funPrimaryAlignMark.cpp:19-28`. Unblocks nf-core/rnaseq, which invokes STAR_ALIGN with `--runRNGseed 0`. Implementation notes: - `shuffle_tied_prefix` does an in-place Fisher-Yates on the contiguous top-score prefix only, leaving non-tied lower-scored alignments alone. Matches STAR's two-phase "move best to front, then shuffle front" (collapsed since the input is already score-sorted). - Applied at the final primary-selection point in both SE (`align_read`) and PE (`align_paired_read`). Not applied in `stitch.rs` per-window sorts -- STAR only shuffles at multMapSelect / primary-mark time. - STAR seeds `std::mt19937` once per chunk and advances per read. ruSTAR parallelises per-read via rayon, so instead of a per-chunk RNG we derive a deterministic per-read seed from `run_rng_seed * (hash(read_name) + 1)`. This is stronger reproducibility than STAR (independent of thread count) while honoring `--runRNGseed`. - Uses `rand::rngs::StdRng` rather than `rand_mt`. Not bit-for-bit parity with STAR's tie-breaking choices (STAR's `std::uniform_real_distribution` is libstdc++-specific anyway) -- it just has to honor the seed deterministically. Tests: +5 (parse-default, parse-override, four shuffle behavior tests). 283 unit tests passing (was 278), 4 integration tests passing. Note: commit is unsigned because local 1Password SSH signing (op-ssh-sign) returned "failed to fill whole buffer" on every attempt. Feel free to re-sign with `git commit --amend -S` once the signer is back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parses STAR's comma-separated multi-RG syntax (first field must be ID: on each block, `,` separates blocks, tokens are tab-joined into the SAM @rg body). Replicates a single RG line across N input files like STAR does, and errors on count mismatch or RG-in-outSAMattributes without a line. Auto-appends RG to sam_attribute_set() when a line is set (matches Parameters_samAttributes.cpp:201). SAM/BAM headers now emit one @rg per unique ID, and every record (mapped, unmapped, half-mapped, both- mates) carries RG:Z:<id> when attrs contain RG. Unblocks nf-core/rnaseq, which invokes STAR with `--outSAMattrRGline 'ID:X' 'SM:X'` and previously errored on the unknown flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hand-rolled Fisher-Yates loop with `rand::seq::SliceRandom::shuffle` on the tied prefix, and collapse the length guard with `slice::first()`. Same behavior (shuffles the [0..tied) range with the seeded RNG), fewer lines, and no need to import `Rng`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the Fisher-Yates reference in `shuffle_tied_prefix`'s doc (the impl now delegates to `SliceRandom::shuffle` — the algorithm is an implementation detail). Collapse the SE call-site comment to one line since the fn doc already covers the "why"; keep the STAR source file + line reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`DefaultHasher` is re-exported from `std::hash` as of Rust 1.76, so there's no need to pull it in via the full `std::collections::hash_map` path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efix Single-use generic parameter — `impl Fn(&T) -> i32` in the argument list is shorter than a `where F: Fn(&T) -> i32` bound and reads more naturally at the call sites we have. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove the `let _ = params.rg_ids()?;` early-validation line in
`align_reads_paired_end`. `validate()` already calls `rg_ids()` at
startup, so the second call was a no-op.
- Simplify `maybe_insert_rg_tag` to `(record, rg_id)` by dropping the
`attrs: &HashSet<String>` parameter and its `attrs.contains("RG")`
gate. `sam_attribute_set()` auto-inserts "RG" whenever an RG line is
configured, so `rg_id.is_some()` already implies the tag is wanted.
- Replace the inline `if let Some(id) = rg_id { ... }` block in
`build_unmapped_record` with a call to the simplified helper, so every
writer-side RG insertion goes through the same path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `Parameters::primary_rg_id() -> Result<Option<String>, Error>` that returns the first RG line's `ID:` value directly, and use it to replace the `let rg_ids = params.rg_ids()?; let rg_id = rg_ids.first()...` two-step pattern at every SAM/BAM writer callsite (5 in sam.rs, 1 in lib.rs). `rg_ids()` is retained because `validate()` still uses it for the 1-vs-N count check against `--readFilesIn`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the manual `while i < len` index-walk with a `out_sam_attr_rg_line.split(|tok| tok == ",")` that maps each block to its tab-joined body and propagates any `ID:`-prefix error via `collect::<Result<_,_>>()`. Same public signature and return value as before, but with half the local state. Behavior note: an empty block (adjacent `,`s or a trailing `,`) now returns a `Parameter` error instead of being silently skipped. This matches STAR's `Parameters_readFilesInit.cpp` which requires every RG block to begin with `ID:`. Existing tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`SamWriter::write_unmapped` had no production callers — the SE/PE write paths build unmapped records via `build_unmapped_record` and batch them through `write_batch`. The only caller was the method's own self-test. Delete both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes formatting drift inherited from main. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes formatting drift inherited from main. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat: add `--runRNGseed` flag with seeded primary tie-break
feat: add `--outSAMattrRGline` with `@RG` header and `RG:Z` tags
Introduce `src/quant/transcriptome.rs` with a `TranscriptomeIndex` struct that groups GTF exons by `transcript_id`, sorts exons by start, and records absolute genome coords plus STAR's `exLenCum` (cumulative transcript-space offset) per exon. Also builds: - `tr_start` / `tr_end` — transcript bounds in absolute genome coords - `tr_order` / `tr_starts_sorted` / `tr_end_max_sorted` — sorted view used for STAR-style `binarySearch1a` + running-max early-exit in `quantAlign` - per-transcript length, strand, gene_id, chr_idx Transcripts with inconsistent chr/strand across exons or on unknown chromosomes are skipped with `log::warn!`, matching STAR's tolerant GTF handling at `sjdbInsertJunctions` time. Note: STAR persists this table as `transcriptInfo.tab` + `exonInfo.tab` at genomeGenerate; ruSTAR rebuilds it at alignment time from the GTF. Semantically equivalent. No signing — commit.gpgsign disabled locally (no key on this worktree). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nscripts)
Port STAR's `Transcriptome::quantAlign` + `alignToTranscript` (see
`source/Transcriptome_quantAlign.cpp:5-89, 91-114`).
`align_to_transcripts` binary-searches `tr_starts_sorted` for the
greatest tr_start <= align.genome_start, then walks back while
`tr_end_max_sorted[i] >= align.genome_end`, calling
`align_to_one_transcript` on every candidate whose [tr_start, tr_end]
fully contains the alignment.
`align_to_one_transcript` walks the alignment's exon blocks and:
* locates the transcript exon containing the first block
(binary-searched `find_containing_exon`),
* rejects blocks that extend past a transcript-exon boundary,
* on each splice boundary (ruSTAR's `is_splice_boundary_before`,
which approximates STAR's `canonSJ >= 0` using read-side
continuity), requires `prev.genome_end == tr_ex[ex].genome_end &&
next.genome_start == tr_ex[ex+1].genome_start`,
* translates each block start to t-space via
`ex_len_cum[ex] + (g_pos - exon_start)`.
Reverse-strand transcripts flip all projected exons:
`read_pos' = Lread - (read_pos + len)`,
`tr_pos' = tr_length - (tr_pos + len)`,
then reverse the exon vector. `is_reverse` is XOR'd with
`tr_strand == 2`.
Projected CIGAR = original CIGAR with N operations dropped; reversed
for reverse-strand transcripts. Splices collapse so `n_junction = 0`.
ruSTAR-specific notes:
* `Transcript.exons` has no canonSJ array; splice vs indel boundaries
are discriminated by read-side continuity (insertion → read gap,
splice → read contiguous with genome gap). Pure deletions rarely
produce block splits because the stitch merge coalesces across Del.
* GTF is 1-based inclusive; ruSTAR 0-based half-open. Transcript
length = sum(end - start). STAR uses `exLenCum[last] + (end -
start) + 1` which arrives at the same numeric value on 1-based
inclusive inputs.
8 new unit tests cover: single-exon projection, two-exon with matching
junction, junction mismatch (rejected), reverse-strand flip, multi-exon
projection into a longer transcript, out-of-bounds cases, and
projection onto multiple overlapping transcripts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port STAR's `ReadAlign::quantTranscriptome` gatekeeping logic (see `source/ReadAlign_quantTranscriptome.cpp:9-66`) as `filter_and_project`: 1. If `!allow_indels && n_gap > 0` → reject. 2. Single-end filter — for PE, skip alignments where both ends came from the same mate. ruSTAR's `Transcript` does not carry per-block mate tags; the caller enforces this by only invoking `filter_and_project` on both-mapped PE pair mates. Documented as a no-op at the per-mate level. 3. Soft-clip extension (mode-dependent): extend leading / trailing soft-clipped bases back into matched bases, counting mismatches where both read and genome bases are < 4 and unequal. Reject if `n_mismatch + extension_mismatches > min(outFilterMismatchNmax, outFilterMismatchNoverLmax*(Lread-1))`. 4. Call `align_to_transcripts` for projection. New enum `QuantTranscriptomeSAMoutput` with three variants matching STAR's CLI strings: * `BanSingleEnd_BanIndels_ExtendSoftclip` (default, RSEM-compatible) * `BanSingleEnd` * `BanSingleEnd_ExtendSoftclip` `rebuild_cigar_without_softclips` strips the leading/trailing S ops and folds their length into the adjacent M op. Interior soft-clips are left alone. 7 new unit tests: enum FromStr/Display, mode flags, indel rejection under default mode, indel retention under BanSingleEnd, zero-mismatch softclip extension folding 4S+40M → 44M, softclip extension exceeding budget → reject, softclip preservation under BanSingleEnd mode. Divergence from STAR: ruSTAR checks for genome/read buffer bounds via `saturating_sub` + slice-length checks before indexing, to protect against edge cases where extension would read past the genome end. STAR trusts its upstream bounds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion Add `quant_transcriptome_sam_output: QuantTranscriptomeSAMoutput` field with STAR-compatible default `BanSingleEnd_BanIndels_ExtendSoftclip`. Accepts the three STAR strings via the `FromStr` impl added in the previous commit. Add `Parameters::quant_transcriptome_sam()` helper that scans `quant_mode` for `"TranscriptomeSAM"`, mirroring the existing `quant_gene_counts()` helper. Extend `Parameters::validate()` to require `--sjdbGTFfile` when `--quantMode TranscriptomeSAM` is active. STAR also allows running TranscriptomeSAM in SE (the "single-end" flag in the mode name refers to per-mate, not per-run), so no PE-only gate. 5 new unit tests: default parse, flag enable, explicit mode override (BanSingleEnd and BanSingleEnd_ExtendSoftclip), validate error without GTF, validate success with GTF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top-level orchestration for `--quantMode TranscriptomeSAM`.
src/io/sam.rs:
* Factor `build_sam_header` into a generic
`build_sam_header_from_refs(iter, params)` that takes any
`(name, length)` iterator. The original `build_sam_header` now
delegates to it with the genome's chromosomes.
* New `SamWriter::build_transcriptome_records(...)`: builds records
with RNAME pointing at the transcriptome header (reference_sequence_id
= transcript_idx), POS = t-space position + 1, and emits only the
standard attribute set (NH/HI/AS/NM/nM). Splice-aware tags (jM, jI,
XS, MD) are dropped because splices collapse in t-space.
* `primary_hit_idx` parameter selects the randomly-chosen primary
alignment; all others get the SECONDARY (0x100) flag.
src/io/bam.rs:
* New `BamWriter::create_transcriptome(path, tr_idx, params)` builds
a BAM header with one @sq per transcript (name = transcript_id,
length = t-space length).
* New `BamWriter::header()` accessor.
* Unit test verifying transcriptome BAM writer creation + @sq count.
src/align/read_align.rs:
* `per_read_seed` changed from `fn` to `pub(crate) fn` so lib.rs can
seed the transcriptome primary-picker with the same read-scoped
seed used for genome-space tie-shuffle.
src/lib.rs:
* Build an `Arc<TranscriptomeIndex>` alongside the gene-counts
context when `--quantMode TranscriptomeSAM` is active, and thread
it through run_single_pass / run_two_pass / align_reads_single_end
/ align_reads_paired_end.
* Open `<prefix>Aligned.toTranscriptome.out.bam` via
`BamWriter::create_transcriptome` at the start of the pipeline and
flush/finish at the end.
* Extend `AlignmentBatchResults` with `transcriptome_records:
Vec<RecordBuf>`. Per-read builder helpers:
- `build_transcriptome_records_se`: projects every genome-space
alignment via `filter_and_project`, seeds `StdRng` with
`per_read_seed(run_rng_seed, read_name)`, picks a random
projected index as primary, builds records.
- `build_transcriptome_records_pe`: projects both mates
independently, keeps transcripts where both project, emits two
records per projected pair (mate1 FIRST_SEGMENT + mate2
LAST_SEGMENT). Primary pick is shared across both mates.
* Transcriptome records are written serially in batch-merge order
alongside the main SAM/BAM stream (normal and BySJout paths).
Known limitations vs STAR:
* PE pairing is done per-transcript-idx set intersection between
projected mates, not via STAR's combined 2-mate
`Transcript`/`EX_iFrag` machinery. Equivalent for the common case
where each mate maps uniquely within a given transcript; may
diverge when one mate has multiple valid positions within one
transcript.
* StdRng (inherited from `feat/run-rng-seed`) replaces STAR's
std::mt19937 — no bit-for-bit parity on which transcriptome
projection is picked as primary among ties. Per-read
determinism is preserved via `per_read_seed`.
No pass-1 wiring: two-pass pass 1 uses `None` for tr_idx (tr BAM is
only emitted from pass 2).
cargo fmt applied across the touched files plus incidental fmt
touch-ups on read_align.rs / stitch.rs / params.rs / quant/mod.rs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end smoke test in `tests/transcriptome_sam.rs`:
* Builds a 2000bp synthetic genome (single chromosome, pseudo-random
sequence from an LCG).
* Writes a 2-transcript GTF (T1: chr1:101-400 +, T2: chr1:601-900 +).
* Generates 20 reads (30bp) drawn alternately from the T1 and T2
exon regions.
* Runs `ruSTAR --runMode genomeGenerate` then `alignReads
--quantMode TranscriptomeSAM`.
* Asserts `Aligned.toTranscriptome.out.bam` is created, > 100 bytes,
parses as a valid BAM via noodles, and the header contains exactly
two @sq lines named T1 and T2.
Uses the same `assert_cmd::Command::cargo_bin` pattern as the existing
`phase9_threading.rs` test (inherits the same deprecation warnings;
pre-existing parity).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iptomeSAMoutput Remove `Display` impl (never formatted anywhere) and `allow_single_end()` helper (always returns false, never called). Clean up the stale single-end comment in `filter_and_project`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lders `build_transcriptome_records_se` allocated a redundant `fwd` clone of `read_seq` just to unify the reverse / forward branch, and both SE + PE builders inlined the same 6-line base-complement match. Replace both with a single `rc_encode()` helper delegating to the existing `io::fastq::complement_base`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r mate Previously `build_transcriptome_records_pe` called the per-record builder twice *per projected pair*, passing a single-element slice and a `primary_hit_idx` sentinel (0 / usize::MAX) to force the SECONDARY flag. The builder already handles multi-alignment sets correctly, so split the pairs into mate1/mate2 slices, call the builder once per mate with the true `primary_hit`, then stamp FIRST_SEGMENT / LAST_SEGMENT on the results and interleave. Same observable output, half the per-pair work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both `build_transcriptome_records_se` and `_pe` duplicated the RNG seeding (`per_read_seed` → `StdRng::seed_from_u64` → `gen_range`) and the MAPQ calculation (`effective_n = n_alignments.max(n_for_mapq)`). Extract a single `pick_primary_and_mapq` helper and hoist the `use` blocks out of the function bodies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`BamWriter::create` and `BamWriter::create_transcriptome` both opened the file, wrapped it in `BufWriter`/`bam::io::Writer`, and wrote the header. Move that boilerplate into a private `with_header` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Merge the two `actual_left` shadowed lets into one `.min(..).min(..)` chain. - Drop the `actual_right` alias (right-clip never needs clamping). - Replace `ext.genome_start = ext.exons.first().map(..).unwrap_or(..)` with a simple `if let Some(first) = ext.exons.first()` — the fallback branch was unreachable because extend_softclips always enters with a non-empty exon list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eq clone - The 5-line single-end carve-out inside the `filter_and_project` doc said the same thing as the inline comment; fold both into a one-line mention. - Drop redundant `// (1) ... (4) ...` step labels from the function body. - `align_to_one_transcript` was cloning `align.read_seq` into the projected transcript but no consumer reads that field on a projected record (transcriptome SAM takes the read from the caller). Use `Vec::new()` to avoid the per-projection allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extended the end-to-end integration test to also pass --sjdbGTFfile at genomeGenerate and confirm that transcriptInfo.tab, exonInfo.tab, geneInfo.tab, exonGeTrInfo.tab, and sjdbList.fromGTF.out.tab are written into the genome directory with the exact byte content STAR would produce for the synthetic 2-transcript GTF used by this test. This is the best byte-format validation we can do without the full yeast test dataset (which needs to be downloaded separately via test/data_setup.sh). Once that dataset is in place, diffing ruSTAR's output against STAR's for the yeast genome gives the final parity verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a GTF exon record has no gene_name attribute, STAR's GTF.cpp uses the gene_id string as the name. When there's no gene_biotype, it uses the literal string "MissingGeneType". ruSTAR was using empty strings for both. Verified byte-for-byte parity against STAR 2.7.11b on a synthetic 2-chr / 4-transcript / 4-gene test case (/tmp/rustar_diff): all 9 small text files (chrName/chrLength/chrStart/chrNameLength.txt, transcriptInfo.tab, exonInfo.tab, geneInfo.tab, exonGeTrInfo.tab, sjdbList.fromGTF.out.tab) now diff-clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generates a deterministic 2-chr / 4-transcript / 4-gene synthetic test
case, runs both STAR and ruSTAR genomeGenerate, then diffs each of the
9 files byte-for-byte.
Current result on STAR 2.7.11b: 9/9 files identical
* chrName.txt, chrLength.txt, chrStart.txt, chrNameLength.txt
* transcriptInfo.tab, exonInfo.tab, geneInfo.tab,
exonGeTrInfo.tab, sjdbList.fromGTF.out.tab
Runnable via:
RUSTAR_BIN=/path/to/target/release/ruSTAR \\
test/diff_star_transcriptome.sh /tmp/workdir
Reports with ✓/✗ per file and exits non-zero if any file differs.
Requires STAR on PATH (`brew install rna-star` or bioconda star).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrote write_genome_parameters_txt to emit the exact line order, tab/space separators, and trailing whitespace STAR's genomeParametersWrite.cpp uses. Adds missing fields STAR always writes: - genomeType (Full) - genomeTransformType / genomeTransformVCF - sjdbFileChrStartEnd (with STAR's trailing space quirk) - sjdbGTFchrPrefix / sjdbGTFfeatureExon / sjdbGTFtagExonParent* - sjdbInsertSave (Basic) - versionGenome bumped to 2.7.4a (STAR's on-disk format version, not binary build version) Also adds the `### STAR <cli>` and `### GstrandBit 32` comment lines. genomeFileSizes now uses `<tab><genome_size> <sa_size>` (space between values, matching STAR's `<< "\t"` then `<< " "` pattern). Diff script extended to check genomeParameters.txt. Current result: 10/11 files byte-identical on the synthetic test case. The remaining difference is the genomeFileSizes values themselves, because ruSTAR does not yet insert GTF-derived splice junction sequences into the Genome binary — tracked as a separate phase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loader The `from_index_dir` path derived each transcript's chromosome index via a hand-rolled linear scan helper. `Genome::position_to_chr` already does the same thing via binary search (O(log n)) and handles the padding-gap case. Drop the duplicate helper and call through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Raw strings like "transcript_id" / "gene_id" / "gene_name" / "gene_biotype" / "MissingGeneType" were scattered across TranscriptomeIndex building code. Hoist them to module-level `const &str` declarations so the STAR-faithful fallback is documented in one place and can't drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mv-renaming trick worked but obscured that both tools need the same `--genomeDir` string for byte-identical genomeParameters.txt output. Replace it with a symlink `index` → `star_index | rustar_index` that flips between runs; both invocations pass `--genomeDir index` verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ptomeIndex tr_gene_idx + gene_ids cover all callers; the parallel Vec<String> was only populated as a convenience view and never read outside transcriptome.rs tests. Drop the field; update the two asserts that used it to look up via `gene_ids[tr_gene_idx[i] as usize]`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… has it Previously, `--quantMode TranscriptomeSAM` at alignReads rejected invocations without --sjdbGTFfile, even when the user had already persisted transcriptInfo.tab + friends in --genomeDir at genomeGenerate. That's STAR-incompatible: STAR loads the files directly and only re-parses the GTF if explicitly passed at mapping time. Move the hard check to genomeGenerate mode only. GenomeIndex::load already surfaces a clear error if neither source is available at alignReads (via `index.transcriptome.is_none()` in src/lib.rs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
noodles-sam-0.64 rejects `(` `)` `[` `]` `{` `}` `<` `>` `,` `\` in
@sq SN: values. yeast tRNA transcript IDs from Ensembl GTFs contain
parentheses (tP(UGG)A, tA(UGC)A, etc.) so writing a transcriptome
BAM errored with `invalid reference sequence name (<unknown>)` and
truncated the output file.
Replace BamWriter's noodles-driven header write with a local
write_bam_header_lenient that:
- Emits BAM\x01 + l_text + SAM-text block (built via
render_sam_text_lenient — iterates @hd / @sq / @rg / @pg / @co
directly from the noodles sam::Header, writing raw bytes so
forbidden chars pass through)
- Emits the binary reference list (n_ref + l_name/name/l_ref
triples) using CString for NUL-termination
Subsequent record writes still go through noodles (no validation
there). Verified with 5 yeast reads → valid BAM with 4 transcriptome
records + samtools view accepts the output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New module src/junction/sjdb_insert.rs with classify_motif() that returns the 0-6 motif code STAR uses (sjdbPrepare.cpp:37-50): 0 = non-canonical 1 = GT/AG(+) (donor bases GT, acceptor bases AG) 2 = CT/AC(-) 3 = GC/AG(+) 4 = CT/GC(-) 5 = AT/AC(+) 6 = GT/AT(-) First of several pieces of sjdbPrepare + sjdbBuildIndex needed to reach byte-parity on Genome / SA / SAindex / sjdbInfo.txt / sjdbList.out.tab. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports STAR's sjdbPrepare.cpp:52-73 repeat-shift calculation: counts how many bases the intron boundary can shift left / right while preserving the donor/acceptor base identity. Caps at 255 per junction (STAR limit). Stops on genome bounds and on any N-base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New `PreparedJunction` struct carrying STAR's per-junction fields (chr_idx, shift-adjusted start/end, motif, shift_left, shift_right, strand) and a `prepare_junction` free function that pulls them together from raw db coordinates. Follows STAR's sjdbPrepare.cpp semantics: - Compute motif via classify_motif - Compute shifts via compute_shifts - Subtract shift_left from both coordinates (lines 71-72) to sit at the canonical left-most representation - Derive strand from motif when db_strand is unknown (lines 184-188: 2 - motif % 2, or 0 if motif is non-canonical) Three tests cover: canonical + strand, strand derivation fallback, shift-left applied to coords. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /simplify review flagged my classify_motif + MOTIF_* consts as a duplicate of src/align/score.rs::detect_splice_motif (behind a SpliceMotif enum) + src/junction/sj_output.rs::encode_motif (0-6 code). Two independent motif truth tables is a drift risk. - Move detect_splice_motif from AlignmentScorer method (stateless) to a free fn in src/align/score.rs; the method is now a thin wrapper. - Drop classify_motif + MOTIF_* consts from sjdb_insert. - prepare_junction calls detect_splice_motif + encode_motif instead. - compute_shifts signature changed from &[u8] → &Genome so both helpers share the same input shape; takes n_genome_real to scope the forward-strand slice and avoids redundant bounds checks per iteration. - Tests updated to build Genome instances via a local helper. 343 unit tests pass (up from 335 — the test-coverage consolidation netted more tests via score.rs's existing detect_splice_motif coverage plus the new sjdb_insert tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds PreparedJunction::stored_start/end/original_start/original_end so callers can request either STAR's on-disk coordinate form (canonical → original; non-canonical → shifted) or the canonical-for-extraction form used when reading Gsj flanking bytes. sort_and_dedup() ports STAR's second-pass sort + cross-strand dedup from sjdbPrepare.cpp:141-192. ruSTAR's SpliceJunctionDb already deduplicates on (chr, start, end, strand), so the first-pass intra-strand dedup branches never fire for our inputs — kept second-pass cross-strand logic: - Undefined vs defined strand → keep defined - Both non-canonical → collapse to strand 0 (undefined) - One canonical → keep canonical - Both canonical → keep the one whose strand matches 2 - motif%2 6 new tests exercise each branch of merge_cross_strand plus the sort ordering; all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New build_gsj() in src/junction/sjdb_insert.rs reproduces STAR's
sjdbPrepare.cpp:203-215 in-memory buffer: per junction, sjdb_overhang
donor bases + sjdb_overhang acceptor bases + 1 spacer byte (value 5,
STAR's GENOME_spacingChar). Extraction uses ORIGINAL (pre-shift)
coordinates regardless of motif, relying on the new
PreparedJunction::original_{start,end} helpers.
5 tests: canonical single junction, non-canonical junction (verifies
shift reversal), multi-junction concatenation with independent spacers,
plus two bounds-error cases (donor underflow, acceptor overrun).
Next commit will wire the Gsj bytes onto the Genome binary + extend
the SA.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two STAR-faithful text writers matching `sjdbPrepare.cpp:196-222`:
- `write_sjdb_info_tab`: header `<n>\t<overhang>\n`, then
`stored_start\tstored_end\tmotif\tshiftL\tshiftR\tstrand\n` per junction.
- `write_sjdb_list_out_tab`: `chr\t<1-based start>\t<1-based end>\t<strand_char>\n`
with chromosome-local, pre-shift ("original") coordinates.
Both use BufWriter + Error::io(path) matching the existing writer
convention in sj_output.rs. Strand char table matches STAR's
`{'.','+','-'}` for 0/1/2.
Three byte-match tests covering the header, canonical vs non-canonical
coord handling, and multi-chromosome chr_start offsets. 22/22
sjdb_insert tests pass; full suite 357/357, 0 new clippy warnings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches STAR's in-memory layout after `sjdbBuildIndex.cpp:293`: Gsj bytes live immediately after the forward real-genome bytes; chr_start[n_chr_real] stays pinned at the pre-sjdb forward total (verified against STAR's chrStart.txt for the 10k yeast test case: pre-sjdb boundary = 262144, n_genome post-sjdb = 263139 = 262144 + 5*199). n_genome grows to include Gsj. chr_start / chr_length / chr_name / n_chr_real are NOT updated — Gsj lives outside the chromosome accounting. Two unit tests: one verifies forward layout + RC regeneration; one verifies empty-Gsj is a no-op. 359/359 unit + 4/4 integration tests pass; 0 clippy warnings in genome/mod.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iff harness With a GTF provided, GenomeIndex::build now: 1. Parses the GTF once (shared between junction_db, transcriptome, and sjdb_insert — was parsed twice before). 2. Converts chr-local 1-based GTF intron coords to 0-based absolute genome offsets for sjdb_insert::prepare_junction. 3. Sort + dedup prepared junctions via sjdb_insert::sort_and_dedup. 4. Build Gsj buffer and append to Genome (Genome::append_sjdb) BEFORE the suffix array is built — STAR indexes Gsj alongside the real genome in a single SA (sjdbBuildIndex.cpp:293). 5. Stash sorted prepared junctions in a new `prepared_junctions` field so GenomeIndex::write() can emit sjdbInfo.txt + sjdbList.out.tab. GenomeIndex::load() now reads n_genome from genomeParameters.txt's `genomeFileSizes` line (the chr_start boundary is only the real-genome forward total, not the post-sjdb total). Also adds SpliceJunctionDb::from_raw_junctions helper so from_gtf and the build path share the HashMap-building code. **Validation** (test/diff_star_transcriptome.sh, extended): 15/15 files byte-identical to STAR docker output, including: - Genome (real + Gsj appended; verified against STAR 2.7.11b) - sjdbInfo.txt - sjdbList.out.tab - genomeParameters.txt (genomeFileSizes reflects post-sjdb n_genome) SA / SAindex: size-matched (51929 / 6027 bytes), content differs as expected — STAR's bucketed-parallel SA construction breaks ties differently than ruSTAR's comparator-based sort. Size parity proves we indexed the same genome+Gsj length. Round-trip verified: ruSTAR can load the extended index it built and align reads end-to-end (load path reads `n_genome=524585` from the `genomeFileSizes` line, confirming Gsj bytes are included). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
7 tasks
Resolves conflicts between transcriptome-SAM work and dev's PE stitcher rewrites (Phase E/E3/E4 — diagonal dedup, combined seed search, combined-threshold fallback, PE-CHECK2 unconditional). Conflict resolutions in src/align/read_align.rs: 1. Imports — union of both sides: keep dev's `finalize_transcript`, `split_combined_wt`, `stitch_seeds_core`, `PE_SPACER_BASE` and keep this branch's `Exon` import (transcriptome projection uses `Exon::i_frag`). 2. PE stitching block — accept dev. The obsolete per-mate stitching loops (`mate1_transcripts` / `mate2_transcripts` via `stitch_seeds_with_jdb_debug` per mate) are replaced by dev's combined-cluster approach (`stitch_seeds_core` + `split_combined_wt` on a PE_SPACER_BASE-joined combined read). The result populates `single_mate1_transcripts` / `single_mate2_transcripts` for the half-mapped fallback path, which this branch already uses. 3. Half-mapped threshold — accept dev's combined threshold (Lread-1 = len1+len2), matching STAR's `ReadAlign_mappedFilter.cpp`. The per-mate threshold on this branch was a STAR deviation. 4. Minor formatting on `best_pa = paired_alns.iter().max_by_key(...)` — accept dev. src/align/stitch.rs — auto-merged cleanly: dev's mate_id-aware diagonal dedup (per-fragment `HashMap<(i64, u8), ...>`) plus this branch's `pub(crate)` visibility bump on `PE_SPACER_BASE` and `i_frag: 0` on the Exon constructor (PE pair-building rewrites mate2's exons to `i_frag = 1`). CLAUDE.md — auto-merged: dev's Phase E4 status section. Verification: - cargo test: 335 unit + 4 integration + 1 doc, all passing - cargo clippy --all-targets: 27 warnings (pre-existing baseline, 0 new) - cargo fmt --check: clean - test/diff_star_transcriptome.sh: 9/10 files byte-identical to STAR 2.7.11b (genomeParameters.txt still diverges on genomeFileSizes because this branch doesn't yet bake sjdb into the genome — that's PR scverse#8's scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…upply the combined score After merging dev's combined-cluster PE stitching (commit 45792dd), the only caller of `try_pair_transcripts` always supplies a combined WT score — the per-mate fallback path that drove the `Option<i32>` override was removed with the merge. - Collapse `combined_wt_score_override: Option<i32>` → `combined_wt_score: i32`. - Drop the `scorer` parameter (its only use was in the fallback closure). - Delete the dead per-mate fallback that recomputed the score from t1.score + t2.score - p1 - p2 + combined_p. - Fold identical `thresh1` / `thresh2` half-mapped thresholds into a single `single_mate_threshold` variable (both expressions were verbatim identical post-merge). Net -19 lines; same test pass-count (340), same STAR diff-script baseline (9/10 byte-identical). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t/transcriptome-sam-sj-insertion
Collaborator
|
Okay i've pull this over into dev and resolved the conflicts. all of your commits are preserved. cheers |
Psy-Fer
added a commit
that referenced
this pull request
Apr 24, 2026
- PreparedJunction pipeline: motif detection, shift_left/shift_right - build_gsj(): concatenated SJ flanking-sequence buffer - sort + cross-strand dedup for prepared junctions - Genome::append_sjdb(): extend forward genome + rebuild RC - sjdbInfo.txt + sjdbList.out.tab writers - Wire sjdb insertion into GenomeIndex::build - Extended byte-parity diff harness
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Stacked on top of #7 (
--quantMode TranscriptomeSAM). This PR currently includes the 41 commits from that branch plus the 9 new SJ-insertion commits. Once #7 is merged, this branch will be rebased ontomainand the diff will shrink to the SJ-insertion work only. Review the latest 9 commits (frombc1aa6eonwards) for the new material.Summary
Ports STAR's
sjdbPrepare.cpp+sjdbBuildIndex.cppto ruSTAR. AtgenomeGenerate, GTF-derived splice junctions are now baked into theGenomebinary and indexed by the suffix array, matching STAR's layout byte-for-byte.Validation — 15/15 files match STAR at genomeGenerate
test/diff_star_transcriptome.shagainst STAR 2.7.11b on the synthetic 2-chr / 4-transcript / 3-junction test case:SA/SAindex content divergence is expected: STAR's bucketed parallel SA construction breaks ties differently than ruSTAR's comparator-based sort. Size parity proves we indexed the same genome+Gsj length. Byte-for-byte SA parity would require reimplementing STAR's SA algorithm — out of scope for this PR.
Round-trip verified: ruSTAR can load the extended index it built and align reads end-to-end (load path correctly reads
n_genome=524585fromgenomeFileSizes, confirming Gsj bytes are included alongside the real genome bytes).What changed (SJ-insertion commits only)
src/junction/sjdb_insert.rs(new module, 22 unit tests)compute_shifts— STAR'ssjdbShiftLeft/sjdbShiftRight(repeat length across intron boundary, capped at 255, stops at N).PreparedJunction+prepare_junction— combines motif detection, shifts, strand derivation.stored_start/end(forsjdbInfo.txt) vsoriginal_start/end(for Gsj extraction) follow STAR's convention where canonical motifs store the pre-shift ORIGINAL coord and non-canonical ones store the SHIFTED coord.sort_and_dedup— STAR's post-sort + cross-strand dedup (sjdbPrepare.cpp:141-192).build_gsj— concatenated flanking sequences,2*overhang+1bytes per junction (donor + acceptor + spacer=5).write_sjdb_info_tab+write_sjdb_list_out_tab— byte-exact file writers.src/genome/mod.rs—Genome::append_sjdb()Extends the forward buffer with Gsj, rebuilds the RC over the extended range.
chr_start[n_chr_real]stays pinned at the pre-sjdb forward total (matches STAR'schrStart.txt).n_genomegrows to include Gsj.src/index/mod.rs—GenomeIndex::buildorchestrationGenome::append_sjdb()BEFORE the SA is built (STAR indexes Gsj alongside the real genome in a single SA,sjdbBuildIndex.cpp:293).prepared_junctionsfield sowrite()can emitsjdbInfo.txt+sjdbList.out.tab.src/index/io.rs— load pathReads
n_genomefromgenomeParameters.txt'sgenomeFileSizesline. Thechr_startboundary is only the pre-sjdb forward total, so with sjdb-extended indices the two values diverge andchr_startcannot be used as the total genome size.Other
src/junction/mod.rs—SpliceJunctionDb::from_raw_junctionshelper (shared byfrom_gtfand the build path, avoiding GTF-parse-twice).test/diff_star_transcriptome.sh— extended from 10 files to 15 (added Genome, sjdbInfo.txt, sjdbList.out.tab, plus size-only check for SA/SAindex).Test plan
cargo test— 359 unit + 4 integration + 1 doc test, all passingcargo clippy --all-targets— 0 new warningscargo fmt --check— cleantest/diff_star_transcriptome.sh— 15/15 files identical/size-matched against STAR 2.7.11b🤖 Generated with Claude Code