feat: add --quantMode TranscriptomeSAM output#7
Conversation
Adds the STAR `--runRNGseed` CLI flag (default 777) and uses it to randomize primary alignment selection among equal-scoring multi-mappers. Previously primary selection was purely deterministic (lexicographic tie-break), which meant the 127 "genuine tie" disagreements against STAR on 10k SE yeast were frozen in whichever direction our sort happened to pick. With a seeded shuffle, ties flip independently of sort order -- matching STAR's behavior at `ReadAlign_multMapSelect.cpp:71-79` and `funPrimaryAlignMark.cpp:19-28`. Unblocks nf-core/rnaseq, which invokes STAR_ALIGN with `--runRNGseed 0`. Implementation notes: - `shuffle_tied_prefix` does an in-place Fisher-Yates on the contiguous top-score prefix only, leaving non-tied lower-scored alignments alone. Matches STAR's two-phase "move best to front, then shuffle front" (collapsed since the input is already score-sorted). - Applied at the final primary-selection point in both SE (`align_read`) and PE (`align_paired_read`). Not applied in `stitch.rs` per-window sorts -- STAR only shuffles at multMapSelect / primary-mark time. - STAR seeds `std::mt19937` once per chunk and advances per read. ruSTAR parallelises per-read via rayon, so instead of a per-chunk RNG we derive a deterministic per-read seed from `run_rng_seed * (hash(read_name) + 1)`. This is stronger reproducibility than STAR (independent of thread count) while honoring `--runRNGseed`. - Uses `rand::rngs::StdRng` rather than `rand_mt`. Not bit-for-bit parity with STAR's tie-breaking choices (STAR's `std::uniform_real_distribution` is libstdc++-specific anyway) -- it just has to honor the seed deterministically. Tests: +5 (parse-default, parse-override, four shuffle behavior tests). 283 unit tests passing (was 278), 4 integration tests passing. Note: commit is unsigned because local 1Password SSH signing (op-ssh-sign) returned "failed to fill whole buffer" on every attempt. Feel free to re-sign with `git commit --amend -S` once the signer is back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parses STAR's comma-separated multi-RG syntax (first field must be ID: on each block, `,` separates blocks, tokens are tab-joined into the SAM @rg body). Replicates a single RG line across N input files like STAR does, and errors on count mismatch or RG-in-outSAMattributes without a line. Auto-appends RG to sam_attribute_set() when a line is set (matches Parameters_samAttributes.cpp:201). SAM/BAM headers now emit one @rg per unique ID, and every record (mapped, unmapped, half-mapped, both- mates) carries RG:Z:<id> when attrs contain RG. Unblocks nf-core/rnaseq, which invokes STAR with `--outSAMattrRGline 'ID:X' 'SM:X'` and previously errored on the unknown flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hand-rolled Fisher-Yates loop with `rand::seq::SliceRandom::shuffle` on the tied prefix, and collapse the length guard with `slice::first()`. Same behavior (shuffles the [0..tied) range with the seeded RNG), fewer lines, and no need to import `Rng`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the Fisher-Yates reference in `shuffle_tied_prefix`'s doc (the impl now delegates to `SliceRandom::shuffle` — the algorithm is an implementation detail). Collapse the SE call-site comment to one line since the fn doc already covers the "why"; keep the STAR source file + line reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`DefaultHasher` is re-exported from `std::hash` as of Rust 1.76, so there's no need to pull it in via the full `std::collections::hash_map` path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efix Single-use generic parameter — `impl Fn(&T) -> i32` in the argument list is shorter than a `where F: Fn(&T) -> i32` bound and reads more naturally at the call sites we have. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove the `let _ = params.rg_ids()?;` early-validation line in
`align_reads_paired_end`. `validate()` already calls `rg_ids()` at
startup, so the second call was a no-op.
- Simplify `maybe_insert_rg_tag` to `(record, rg_id)` by dropping the
`attrs: &HashSet<String>` parameter and its `attrs.contains("RG")`
gate. `sam_attribute_set()` auto-inserts "RG" whenever an RG line is
configured, so `rg_id.is_some()` already implies the tag is wanted.
- Replace the inline `if let Some(id) = rg_id { ... }` block in
`build_unmapped_record` with a call to the simplified helper, so every
writer-side RG insertion goes through the same path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `Parameters::primary_rg_id() -> Result<Option<String>, Error>` that returns the first RG line's `ID:` value directly, and use it to replace the `let rg_ids = params.rg_ids()?; let rg_id = rg_ids.first()...` two-step pattern at every SAM/BAM writer callsite (5 in sam.rs, 1 in lib.rs). `rg_ids()` is retained because `validate()` still uses it for the 1-vs-N count check against `--readFilesIn`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the manual `while i < len` index-walk with a `out_sam_attr_rg_line.split(|tok| tok == ",")` that maps each block to its tab-joined body and propagates any `ID:`-prefix error via `collect::<Result<_,_>>()`. Same public signature and return value as before, but with half the local state. Behavior note: an empty block (adjacent `,`s or a trailing `,`) now returns a `Parameter` error instead of being silently skipped. This matches STAR's `Parameters_readFilesInit.cpp` which requires every RG block to begin with `ID:`. Existing tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`SamWriter::write_unmapped` had no production callers — the SE/PE write paths build unmapped records via `build_unmapped_record` and batch them through `write_batch`. The only caller was the method's own self-test. Delete both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes formatting drift inherited from main. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes formatting drift inherited from main. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
--quantMode TranscriptomeSAM output
feat: add `--runRNGseed` flag with seeded primary tie-break
feat: add `--outSAMattrRGline` with `@RG` header and `RG:Z` tags
Introduce `src/quant/transcriptome.rs` with a `TranscriptomeIndex` struct that groups GTF exons by `transcript_id`, sorts exons by start, and records absolute genome coords plus STAR's `exLenCum` (cumulative transcript-space offset) per exon. Also builds: - `tr_start` / `tr_end` — transcript bounds in absolute genome coords - `tr_order` / `tr_starts_sorted` / `tr_end_max_sorted` — sorted view used for STAR-style `binarySearch1a` + running-max early-exit in `quantAlign` - per-transcript length, strand, gene_id, chr_idx Transcripts with inconsistent chr/strand across exons or on unknown chromosomes are skipped with `log::warn!`, matching STAR's tolerant GTF handling at `sjdbInsertJunctions` time. Note: STAR persists this table as `transcriptInfo.tab` + `exonInfo.tab` at genomeGenerate; ruSTAR rebuilds it at alignment time from the GTF. Semantically equivalent. No signing — commit.gpgsign disabled locally (no key on this worktree). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nscripts)
Port STAR's `Transcriptome::quantAlign` + `alignToTranscript` (see
`source/Transcriptome_quantAlign.cpp:5-89, 91-114`).
`align_to_transcripts` binary-searches `tr_starts_sorted` for the
greatest tr_start <= align.genome_start, then walks back while
`tr_end_max_sorted[i] >= align.genome_end`, calling
`align_to_one_transcript` on every candidate whose [tr_start, tr_end]
fully contains the alignment.
`align_to_one_transcript` walks the alignment's exon blocks and:
* locates the transcript exon containing the first block
(binary-searched `find_containing_exon`),
* rejects blocks that extend past a transcript-exon boundary,
* on each splice boundary (ruSTAR's `is_splice_boundary_before`,
which approximates STAR's `canonSJ >= 0` using read-side
continuity), requires `prev.genome_end == tr_ex[ex].genome_end &&
next.genome_start == tr_ex[ex+1].genome_start`,
* translates each block start to t-space via
`ex_len_cum[ex] + (g_pos - exon_start)`.
Reverse-strand transcripts flip all projected exons:
`read_pos' = Lread - (read_pos + len)`,
`tr_pos' = tr_length - (tr_pos + len)`,
then reverse the exon vector. `is_reverse` is XOR'd with
`tr_strand == 2`.
Projected CIGAR = original CIGAR with N operations dropped; reversed
for reverse-strand transcripts. Splices collapse so `n_junction = 0`.
ruSTAR-specific notes:
* `Transcript.exons` has no canonSJ array; splice vs indel boundaries
are discriminated by read-side continuity (insertion → read gap,
splice → read contiguous with genome gap). Pure deletions rarely
produce block splits because the stitch merge coalesces across Del.
* GTF is 1-based inclusive; ruSTAR 0-based half-open. Transcript
length = sum(end - start). STAR uses `exLenCum[last] + (end -
start) + 1` which arrives at the same numeric value on 1-based
inclusive inputs.
8 new unit tests cover: single-exon projection, two-exon with matching
junction, junction mismatch (rejected), reverse-strand flip, multi-exon
projection into a longer transcript, out-of-bounds cases, and
projection onto multiple overlapping transcripts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port STAR's `ReadAlign::quantTranscriptome` gatekeeping logic (see `source/ReadAlign_quantTranscriptome.cpp:9-66`) as `filter_and_project`: 1. If `!allow_indels && n_gap > 0` → reject. 2. Single-end filter — for PE, skip alignments where both ends came from the same mate. ruSTAR's `Transcript` does not carry per-block mate tags; the caller enforces this by only invoking `filter_and_project` on both-mapped PE pair mates. Documented as a no-op at the per-mate level. 3. Soft-clip extension (mode-dependent): extend leading / trailing soft-clipped bases back into matched bases, counting mismatches where both read and genome bases are < 4 and unequal. Reject if `n_mismatch + extension_mismatches > min(outFilterMismatchNmax, outFilterMismatchNoverLmax*(Lread-1))`. 4. Call `align_to_transcripts` for projection. New enum `QuantTranscriptomeSAMoutput` with three variants matching STAR's CLI strings: * `BanSingleEnd_BanIndels_ExtendSoftclip` (default, RSEM-compatible) * `BanSingleEnd` * `BanSingleEnd_ExtendSoftclip` `rebuild_cigar_without_softclips` strips the leading/trailing S ops and folds their length into the adjacent M op. Interior soft-clips are left alone. 7 new unit tests: enum FromStr/Display, mode flags, indel rejection under default mode, indel retention under BanSingleEnd, zero-mismatch softclip extension folding 4S+40M → 44M, softclip extension exceeding budget → reject, softclip preservation under BanSingleEnd mode. Divergence from STAR: ruSTAR checks for genome/read buffer bounds via `saturating_sub` + slice-length checks before indexing, to protect against edge cases where extension would read past the genome end. STAR trusts its upstream bounds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion Add `quant_transcriptome_sam_output: QuantTranscriptomeSAMoutput` field with STAR-compatible default `BanSingleEnd_BanIndels_ExtendSoftclip`. Accepts the three STAR strings via the `FromStr` impl added in the previous commit. Add `Parameters::quant_transcriptome_sam()` helper that scans `quant_mode` for `"TranscriptomeSAM"`, mirroring the existing `quant_gene_counts()` helper. Extend `Parameters::validate()` to require `--sjdbGTFfile` when `--quantMode TranscriptomeSAM` is active. STAR also allows running TranscriptomeSAM in SE (the "single-end" flag in the mode name refers to per-mate, not per-run), so no PE-only gate. 5 new unit tests: default parse, flag enable, explicit mode override (BanSingleEnd and BanSingleEnd_ExtendSoftclip), validate error without GTF, validate success with GTF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top-level orchestration for `--quantMode TranscriptomeSAM`.
src/io/sam.rs:
* Factor `build_sam_header` into a generic
`build_sam_header_from_refs(iter, params)` that takes any
`(name, length)` iterator. The original `build_sam_header` now
delegates to it with the genome's chromosomes.
* New `SamWriter::build_transcriptome_records(...)`: builds records
with RNAME pointing at the transcriptome header (reference_sequence_id
= transcript_idx), POS = t-space position + 1, and emits only the
standard attribute set (NH/HI/AS/NM/nM). Splice-aware tags (jM, jI,
XS, MD) are dropped because splices collapse in t-space.
* `primary_hit_idx` parameter selects the randomly-chosen primary
alignment; all others get the SECONDARY (0x100) flag.
src/io/bam.rs:
* New `BamWriter::create_transcriptome(path, tr_idx, params)` builds
a BAM header with one @sq per transcript (name = transcript_id,
length = t-space length).
* New `BamWriter::header()` accessor.
* Unit test verifying transcriptome BAM writer creation + @sq count.
src/align/read_align.rs:
* `per_read_seed` changed from `fn` to `pub(crate) fn` so lib.rs can
seed the transcriptome primary-picker with the same read-scoped
seed used for genome-space tie-shuffle.
src/lib.rs:
* Build an `Arc<TranscriptomeIndex>` alongside the gene-counts
context when `--quantMode TranscriptomeSAM` is active, and thread
it through run_single_pass / run_two_pass / align_reads_single_end
/ align_reads_paired_end.
* Open `<prefix>Aligned.toTranscriptome.out.bam` via
`BamWriter::create_transcriptome` at the start of the pipeline and
flush/finish at the end.
* Extend `AlignmentBatchResults` with `transcriptome_records:
Vec<RecordBuf>`. Per-read builder helpers:
- `build_transcriptome_records_se`: projects every genome-space
alignment via `filter_and_project`, seeds `StdRng` with
`per_read_seed(run_rng_seed, read_name)`, picks a random
projected index as primary, builds records.
- `build_transcriptome_records_pe`: projects both mates
independently, keeps transcripts where both project, emits two
records per projected pair (mate1 FIRST_SEGMENT + mate2
LAST_SEGMENT). Primary pick is shared across both mates.
* Transcriptome records are written serially in batch-merge order
alongside the main SAM/BAM stream (normal and BySJout paths).
Known limitations vs STAR:
* PE pairing is done per-transcript-idx set intersection between
projected mates, not via STAR's combined 2-mate
`Transcript`/`EX_iFrag` machinery. Equivalent for the common case
where each mate maps uniquely within a given transcript; may
diverge when one mate has multiple valid positions within one
transcript.
* StdRng (inherited from `feat/run-rng-seed`) replaces STAR's
std::mt19937 — no bit-for-bit parity on which transcriptome
projection is picked as primary among ties. Per-read
determinism is preserved via `per_read_seed`.
No pass-1 wiring: two-pass pass 1 uses `None` for tr_idx (tr BAM is
only emitted from pass 2).
cargo fmt applied across the touched files plus incidental fmt
touch-ups on read_align.rs / stitch.rs / params.rs / quant/mod.rs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end smoke test in `tests/transcriptome_sam.rs`:
* Builds a 2000bp synthetic genome (single chromosome, pseudo-random
sequence from an LCG).
* Writes a 2-transcript GTF (T1: chr1:101-400 +, T2: chr1:601-900 +).
* Generates 20 reads (30bp) drawn alternately from the T1 and T2
exon regions.
* Runs `ruSTAR --runMode genomeGenerate` then `alignReads
--quantMode TranscriptomeSAM`.
* Asserts `Aligned.toTranscriptome.out.bam` is created, > 100 bytes,
parses as a valid BAM via noodles, and the header contains exactly
two @sq lines named T1 and T2.
Uses the same `assert_cmd::Command::cargo_bin` pattern as the existing
`phase9_threading.rs` test (inherits the same deprecation warnings;
pre-existing parity).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iptomeSAMoutput Remove `Display` impl (never formatted anywhere) and `allow_single_end()` helper (always returns false, never called). Clean up the stale single-end comment in `filter_and_project`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lders `build_transcriptome_records_se` allocated a redundant `fwd` clone of `read_seq` just to unify the reverse / forward branch, and both SE + PE builders inlined the same 6-line base-complement match. Replace both with a single `rc_encode()` helper delegating to the existing `io::fastq::complement_base`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r mate Previously `build_transcriptome_records_pe` called the per-record builder twice *per projected pair*, passing a single-element slice and a `primary_hit_idx` sentinel (0 / usize::MAX) to force the SECONDARY flag. The builder already handles multi-alignment sets correctly, so split the pairs into mate1/mate2 slices, call the builder once per mate with the true `primary_hit`, then stamp FIRST_SEGMENT / LAST_SEGMENT on the results and interleave. Same observable output, half the per-pair work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both `build_transcriptome_records_se` and `_pe` duplicated the RNG seeding (`per_read_seed` → `StdRng::seed_from_u64` → `gen_range`) and the MAPQ calculation (`effective_n = n_alignments.max(n_for_mapq)`). Extract a single `pick_primary_and_mapq` helper and hoist the `use` blocks out of the function bodies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`BamWriter::create` and `BamWriter::create_transcriptome` both opened the file, wrapped it in `BufWriter`/`bam::io::Writer`, and wrote the header. Move that boilerplate into a private `with_header` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors STAR's Transcript::exons[i][EX_iFrag] (IncludeDefine.h:209). Value 0 for single-end / mate1, 1 for mate2. This is pure plumbing: - Exon gains `i_frag: u8` field with doc comment pointing at STAR's source. - Stitcher (src/align/stitch.rs) sets `i_frag: 0` on all exons it produces — SE and per-mate PE go through here before any pair combining. - Transcriptome projection (src/quant/transcriptome.rs) propagates `block.i_frag` onto projected exons, so transcript-space alignments carry mate identity. - All Exon construction sites in test code updated to set `i_frag: 0` (SE contexts). No behavior change yet — PE pair construction still concatenates two i_frag=0 transcripts. T.3 will introduce combined-pair Transcript with mate2 exons rewritten to i_frag=1, and T.4 will use this field to implement STAR's native single-end filter (ReadAlign_quantTranscriptome.cpp:17). 333 unit + 1 integration test pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
align_to_one_transcript now detects block-to-block mate boundaries via i_frag change (equivalent to STAR's canonSJ == -3) and re-locates the transcript exon for the new block's genome position. Within-mate splice boundaries still use the `ex += 1` advance; cross-mate boundaries do NOT require the transcript-junction identity check. Lays the groundwork for PE combined-pair projection. Output unchanged for existing tests (no combined Transcript yet passed through). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds a STAR-style single-Transcript view of a PE pair: mate1 exons followed by mate2 exons with i_frag rewritten to 1. This is the structure STAR feeds to Transcriptome_quantAlign for PE reads. cigar / junction_motifs / read_seq fields are left empty — the combined transcript is consumed only by the projection logic in T.3's next step, which splits the result back to per-mate BAM records. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extended the end-to-end integration test to also pass --sjdbGTFfile at genomeGenerate and confirm that transcriptInfo.tab, exonInfo.tab, geneInfo.tab, exonGeTrInfo.tab, and sjdbList.fromGTF.out.tab are written into the genome directory with the exact byte content STAR would produce for the synthetic 2-transcript GTF used by this test. This is the best byte-format validation we can do without the full yeast test dataset (which needs to be downloaded separately via test/data_setup.sh). Once that dataset is in place, diffing ruSTAR's output against STAR's for the yeast genome gives the final parity verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a GTF exon record has no gene_name attribute, STAR's GTF.cpp uses the gene_id string as the name. When there's no gene_biotype, it uses the literal string "MissingGeneType". ruSTAR was using empty strings for both. Verified byte-for-byte parity against STAR 2.7.11b on a synthetic 2-chr / 4-transcript / 4-gene test case (/tmp/rustar_diff): all 9 small text files (chrName/chrLength/chrStart/chrNameLength.txt, transcriptInfo.tab, exonInfo.tab, geneInfo.tab, exonGeTrInfo.tab, sjdbList.fromGTF.out.tab) now diff-clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generates a deterministic 2-chr / 4-transcript / 4-gene synthetic test
case, runs both STAR and ruSTAR genomeGenerate, then diffs each of the
9 files byte-for-byte.
Current result on STAR 2.7.11b: 9/9 files identical
* chrName.txt, chrLength.txt, chrStart.txt, chrNameLength.txt
* transcriptInfo.tab, exonInfo.tab, geneInfo.tab,
exonGeTrInfo.tab, sjdbList.fromGTF.out.tab
Runnable via:
RUSTAR_BIN=/path/to/target/release/ruSTAR \\
test/diff_star_transcriptome.sh /tmp/workdir
Reports with ✓/✗ per file and exits non-zero if any file differs.
Requires STAR on PATH (`brew install rna-star` or bioconda star).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrote write_genome_parameters_txt to emit the exact line order, tab/space separators, and trailing whitespace STAR's genomeParametersWrite.cpp uses. Adds missing fields STAR always writes: - genomeType (Full) - genomeTransformType / genomeTransformVCF - sjdbFileChrStartEnd (with STAR's trailing space quirk) - sjdbGTFchrPrefix / sjdbGTFfeatureExon / sjdbGTFtagExonParent* - sjdbInsertSave (Basic) - versionGenome bumped to 2.7.4a (STAR's on-disk format version, not binary build version) Also adds the `### STAR <cli>` and `### GstrandBit 32` comment lines. genomeFileSizes now uses `<tab><genome_size> <sa_size>` (space between values, matching STAR's `<< "\t"` then `<< " "` pattern). Diff script extended to check genomeParameters.txt. Current result: 10/11 files byte-identical on the synthetic test case. The remaining difference is the genomeFileSizes values themselves, because ruSTAR does not yet insert GTF-derived splice junction sequences into the Genome binary — tracked as a separate phase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loader The `from_index_dir` path derived each transcript's chromosome index via a hand-rolled linear scan helper. `Genome::position_to_chr` already does the same thing via binary search (O(log n)) and handles the padding-gap case. Drop the duplicate helper and call through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Raw strings like "transcript_id" / "gene_id" / "gene_name" / "gene_biotype" / "MissingGeneType" were scattered across TranscriptomeIndex building code. Hoist them to module-level `const &str` declarations so the STAR-faithful fallback is documented in one place and can't drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mv-renaming trick worked but obscured that both tools need the same `--genomeDir` string for byte-identical genomeParameters.txt output. Replace it with a symlink `index` → `star_index | rustar_index` that flips between runs; both invocations pass `--genomeDir index` verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ptomeIndex tr_gene_idx + gene_ids cover all callers; the parallel Vec<String> was only populated as a convenience view and never read outside transcriptome.rs tests. Drop the field; update the two asserts that used it to look up via `gene_ids[tr_gene_idx[i] as usize]`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… has it Previously, `--quantMode TranscriptomeSAM` at alignReads rejected invocations without --sjdbGTFfile, even when the user had already persisted transcriptInfo.tab + friends in --genomeDir at genomeGenerate. That's STAR-incompatible: STAR loads the files directly and only re-parses the GTF if explicitly passed at mapping time. Move the hard check to genomeGenerate mode only. GenomeIndex::load already surfaces a clear error if neither source is available at alignReads (via `index.transcriptome.is_none()` in src/lib.rs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
noodles-sam-0.64 rejects `(` `)` `[` `]` `{` `}` `<` `>` `,` `\` in
@sq SN: values. yeast tRNA transcript IDs from Ensembl GTFs contain
parentheses (tP(UGG)A, tA(UGC)A, etc.) so writing a transcriptome
BAM errored with `invalid reference sequence name (<unknown>)` and
truncated the output file.
Replace BamWriter's noodles-driven header write with a local
write_bam_header_lenient that:
- Emits BAM\x01 + l_text + SAM-text block (built via
render_sam_text_lenient — iterates @hd / @sq / @rg / @pg / @co
directly from the noodles sam::Header, writing raw bytes so
forbidden chars pass through)
- Emits the binary reference list (n_ref + l_name/name/l_ref
triples) using CString for NUL-termination
Subsequent record writes still go through noodles (no validation
there). Verified with 5 yeast reads → valid BAM with 4 transcriptome
records + samtools view accepts the output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hmm this is going to be fun to merge with complex changes on dev branch. Especially given the align and stitcher changes with iFrag. I'll get through this in the next few days James |
|
Arg, I did not notice you had a dev branch sorry 🙈 I can point Claude at dev and ask it to rebase. Better to do that with its existing context I suspect. |
Resolves conflicts between transcriptome-SAM work and dev's PE stitcher rewrites (Phase E/E3/E4 — diagonal dedup, combined seed search, combined-threshold fallback, PE-CHECK2 unconditional). Conflict resolutions in src/align/read_align.rs: 1. Imports — union of both sides: keep dev's `finalize_transcript`, `split_combined_wt`, `stitch_seeds_core`, `PE_SPACER_BASE` and keep this branch's `Exon` import (transcriptome projection uses `Exon::i_frag`). 2. PE stitching block — accept dev. The obsolete per-mate stitching loops (`mate1_transcripts` / `mate2_transcripts` via `stitch_seeds_with_jdb_debug` per mate) are replaced by dev's combined-cluster approach (`stitch_seeds_core` + `split_combined_wt` on a PE_SPACER_BASE-joined combined read). The result populates `single_mate1_transcripts` / `single_mate2_transcripts` for the half-mapped fallback path, which this branch already uses. 3. Half-mapped threshold — accept dev's combined threshold (Lread-1 = len1+len2), matching STAR's `ReadAlign_mappedFilter.cpp`. The per-mate threshold on this branch was a STAR deviation. 4. Minor formatting on `best_pa = paired_alns.iter().max_by_key(...)` — accept dev. src/align/stitch.rs — auto-merged cleanly: dev's mate_id-aware diagonal dedup (per-fragment `HashMap<(i64, u8), ...>`) plus this branch's `pub(crate)` visibility bump on `PE_SPACER_BASE` and `i_frag: 0` on the Exon constructor (PE pair-building rewrites mate2's exons to `i_frag = 1`). CLAUDE.md — auto-merged: dev's Phase E4 status section. Verification: - cargo test: 335 unit + 4 integration + 1 doc, all passing - cargo clippy --all-targets: 27 warnings (pre-existing baseline, 0 new) - cargo fmt --check: clean - test/diff_star_transcriptome.sh: 9/10 files byte-identical to STAR 2.7.11b (genomeParameters.txt still diverges on genomeFileSizes because this branch doesn't yet bake sjdb into the genome — that's PR scverse#8's scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Nah it's okay I can take care of it. Mad ea Dev branch specifically so it would tie in better with doing PRs that would then trigger all the CI and release stuff after I fixed some really annoying things Claude was doing. |
…upply the combined score After merging dev's combined-cluster PE stitching (commit 45792dd), the only caller of `try_pair_transcripts` always supplies a combined WT score — the per-mate fallback path that drove the `Option<i32>` override was removed with the merge. - Collapse `combined_wt_score_override: Option<i32>` → `combined_wt_score: i32`. - Drop the `scorer` parameter (its only use was in the fallback closure). - Delete the dead per-mate fallback that recomputed the score from t1.score + t2.score - p1 - p2 + combined_p. - Fold identical `thresh1` / `thresh2` half-mapped thresholds into a single `single_mate_threshold` variable (both expressions were verbatim identical post-merge). Net -19 lines; same test pass-count (340), same STAR diff-script baseline (9/10 byte-identical). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
oops sorry, already pushed a commit where I merged in your |
|
Updated the PR to point to |
|
..looks like Bit messy sorry 🫠 Shouldn't really matter though. |
|
I've pulled all of this into dev locally with all git commits preserved. Thanks |
- Transcriptome index built from GTF at genomeGenerate, persisted to disk - Genome alignment projection onto transcripts (align_to_transcripts) - Aligned.toTranscriptome.out.bam pipeline - fix(params): don't require --sjdbGTFfile at alignReads when index has it - fix(io/bam): lenient BAM header validator for STAR tRNA names - fix(genome): match STAR's genomeParameters.txt byte-for-byte - fix(quant): STAR-exact geneInfo/exonGeTrInfo/sjdbList tab formats - Exon gains i_frag: u8 field (mirrors STAR's EX_iFrag) - try_pair_transcripts simplified (single caller always supplies score)
Summary
Implements
--quantMode TranscriptomeSAM+--quantTranscriptomeSAMoutput, producingAligned.toTranscriptome.out.bamfor downstream Salmon/RSEM quantification.Byte-for-byte parity with STAR (verified via
test/diff_star_transcriptome.sh)Ran STAR 2.7.11b genomeGenerate alongside ruSTAR genomeGenerate on two workloads:
Synthetic 2-chr / 4-transcript / 4-gene (committed in the script):
chrName.txtchrLength.txtchrStart.txtchrNameLength.txttranscriptInfo.tabexonInfo.tabgeneInfo.tabexonGeTrInfo.tabsjdbList.fromGTF.out.tabgenomeParameters.txtReal yeast (nf-core/test-datasets, 124-transcript chrI Ensembl GTF): same 9 transcriptome text files confirmed byte-identical.
genomeParameters.txtdiffers on one line (genomeFileSizes) for the same reason as synthetic.Remaining
genomeParameters.txtdelta: thegenomeFileSizesvalues differ because ruSTAR's Genome binary does not yet include STAR's pre-inserted splice-junction sequences (~nInsertedSJs × sjdbLengthbytes). Closing that gap is the scope of the follow-up PR stacked on this branch (feat/transcriptome-sam-sj-insertion, WIP — Phase 17.X). Blocks byte-identity of:Genome,SA,SAindex,sjdbInfo.txt,sjdbList.out.tab, plus the one line ingenomeParameters.txt.Transcriptome BAM end-to-end
Ran ruSTAR alignReads on the yeast GTF index built above with 50 000 real 101 bp reads from nf-core/test-datasets:
Aligned.toTranscriptome.out.bamis a valid BAM (samtools view round-trips cleanly).@SQentries with the correct STAR-style transcript IDs, including tRNA names with parentheses (e.g.tP(UGG)A).NH,HI,AS,nM) and valid CIGARs (100–101 M per read).Byte-for-byte diff of the BAM itself against STAR's output is blocked by a STAR 2.7.11b macOS arm64 bug where STAR reads 0 input reads from any FASTQ (reproducible on both synthetic and real nf-core data, with and without
--readFilesCommand, gzipped or plain). CI on Linux will run the diff once this lands.Changes (highlights)
TranscriptomeIndex::write_transcript_info+ 4 siblings) — byte-identical STAR output fortranscriptInfo.tab/exonInfo.tab/geneInfo.tab/exonGeTrInfo.tab/sjdbList.fromGTF.out.tab.GenomeIndex::loadprefers the on-disk files over re-parsing the GTF; falls back when the files are absent.gene_name→gene_id,gene_biotype→MissingGeneType) — matchesGTF.cpp.genomeParameters.txtbyte-match — hoisted writer to reproduce STAR's exact line order, tab/space separators, and trailing-whitespace quirks fromgenomeParametersWrite.cpp.i_fragonExon— matches STAR'sEX_iFragonTranscript::exons[i]. Propagated through stitcher + projection. Mate boundaries re-locate the transcript exon during projection (canonSJ == -3).Transcripthelper (PairedAlignment::combined_transcript_for_projection) — builds STAR's unifiedTranscriptwith mate2 exons carryingi_frag = 1.tP(UGG)A(which STAR accepts and emits) pass through unchanged.--quantMode TranscriptomeSAMat alignReads no longer requires--sjdbGTFfilewhen the index already carriestranscriptInfo.tab— matches STAR's behavior.MissingGeneTypeto constants, reusedGenome::position_to_chrover a hand-rolled helper, dropped a redundanttr_gene_id: Vec<String>field.Test plan
cargo clippy --lib -- -D warningscleancargo fmt --checkcleantest/diff_star_transcriptome.sh— 9/9 text files byte-identical with STAR@SQentries + tRNA namesAligned.toTranscriptome.out.bamvs STAR (Linux CI; local macOS STAR bug)Genome/SA/SAindex/sjdbInfo.txt/sjdbList.out.tabparity — stacked PR (SJ insertion)Known divergences
--runRNGseedflag with seeded primary tie-break #5. Per-read deterministic, not bit-for-bit with STAR's mt19937 primary-pick order.🤖 Generated with Claude Code