Conversation
Introduce `src/quant/transcriptome.rs` with a `TranscriptomeIndex` struct that groups GTF exons by `transcript_id`, sorts exons by start, and records absolute genome coords plus STAR's `exLenCum` (cumulative transcript-space offset) per exon. Also builds: - `tr_start` / `tr_end` — transcript bounds in absolute genome coords - `tr_order` / `tr_starts_sorted` / `tr_end_max_sorted` — sorted view used for STAR-style `binarySearch1a` + running-max early-exit in `quantAlign` - per-transcript length, strand, gene_id, chr_idx Transcripts with inconsistent chr/strand across exons or on unknown chromosomes are skipped with `log::warn!`, matching STAR's tolerant GTF handling at `sjdbInsertJunctions` time. Note: STAR persists this table as `transcriptInfo.tab` + `exonInfo.tab` at genomeGenerate; ruSTAR rebuilds it at alignment time from the GTF. Semantically equivalent. No signing — commit.gpgsign disabled locally (no key on this worktree). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nscripts)
Port STAR's `Transcriptome::quantAlign` + `alignToTranscript` (see
`source/Transcriptome_quantAlign.cpp:5-89, 91-114`).
`align_to_transcripts` binary-searches `tr_starts_sorted` for the
greatest tr_start <= align.genome_start, then walks back while
`tr_end_max_sorted[i] >= align.genome_end`, calling
`align_to_one_transcript` on every candidate whose [tr_start, tr_end]
fully contains the alignment.
`align_to_one_transcript` walks the alignment's exon blocks and:
* locates the transcript exon containing the first block
(binary-searched `find_containing_exon`),
* rejects blocks that extend past a transcript-exon boundary,
* on each splice boundary (ruSTAR's `is_splice_boundary_before`,
which approximates STAR's `canonSJ >= 0` using read-side
continuity), requires `prev.genome_end == tr_ex[ex].genome_end &&
next.genome_start == tr_ex[ex+1].genome_start`,
* translates each block start to t-space via
`ex_len_cum[ex] + (g_pos - exon_start)`.
Reverse-strand transcripts flip all projected exons:
`read_pos' = Lread - (read_pos + len)`,
`tr_pos' = tr_length - (tr_pos + len)`,
then reverse the exon vector. `is_reverse` is XOR'd with
`tr_strand == 2`.
Projected CIGAR = original CIGAR with N operations dropped; reversed
for reverse-strand transcripts. Splices collapse so `n_junction = 0`.
ruSTAR-specific notes:
* `Transcript.exons` has no canonSJ array; splice vs indel boundaries
are discriminated by read-side continuity (insertion → read gap,
splice → read contiguous with genome gap). Pure deletions rarely
produce block splits because the stitch merge coalesces across Del.
* GTF is 1-based inclusive; ruSTAR 0-based half-open. Transcript
length = sum(end - start). STAR uses `exLenCum[last] + (end -
start) + 1` which arrives at the same numeric value on 1-based
inclusive inputs.
8 new unit tests cover: single-exon projection, two-exon with matching
junction, junction mismatch (rejected), reverse-strand flip, multi-exon
projection into a longer transcript, out-of-bounds cases, and
projection onto multiple overlapping transcripts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port STAR's `ReadAlign::quantTranscriptome` gatekeeping logic (see `source/ReadAlign_quantTranscriptome.cpp:9-66`) as `filter_and_project`: 1. If `!allow_indels && n_gap > 0` → reject. 2. Single-end filter — for PE, skip alignments where both ends came from the same mate. ruSTAR's `Transcript` does not carry per-block mate tags; the caller enforces this by only invoking `filter_and_project` on both-mapped PE pair mates. Documented as a no-op at the per-mate level. 3. Soft-clip extension (mode-dependent): extend leading / trailing soft-clipped bases back into matched bases, counting mismatches where both read and genome bases are < 4 and unequal. Reject if `n_mismatch + extension_mismatches > min(outFilterMismatchNmax, outFilterMismatchNoverLmax*(Lread-1))`. 4. Call `align_to_transcripts` for projection. New enum `QuantTranscriptomeSAMoutput` with three variants matching STAR's CLI strings: * `BanSingleEnd_BanIndels_ExtendSoftclip` (default, RSEM-compatible) * `BanSingleEnd` * `BanSingleEnd_ExtendSoftclip` `rebuild_cigar_without_softclips` strips the leading/trailing S ops and folds their length into the adjacent M op. Interior soft-clips are left alone. 7 new unit tests: enum FromStr/Display, mode flags, indel rejection under default mode, indel retention under BanSingleEnd, zero-mismatch softclip extension folding 4S+40M → 44M, softclip extension exceeding budget → reject, softclip preservation under BanSingleEnd mode. Divergence from STAR: ruSTAR checks for genome/read buffer bounds via `saturating_sub` + slice-length checks before indexing, to protect against edge cases where extension would read past the genome end. STAR trusts its upstream bounds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion Add `quant_transcriptome_sam_output: QuantTranscriptomeSAMoutput` field with STAR-compatible default `BanSingleEnd_BanIndels_ExtendSoftclip`. Accepts the three STAR strings via the `FromStr` impl added in the previous commit. Add `Parameters::quant_transcriptome_sam()` helper that scans `quant_mode` for `"TranscriptomeSAM"`, mirroring the existing `quant_gene_counts()` helper. Extend `Parameters::validate()` to require `--sjdbGTFfile` when `--quantMode TranscriptomeSAM` is active. STAR also allows running TranscriptomeSAM in SE (the "single-end" flag in the mode name refers to per-mate, not per-run), so no PE-only gate. 5 new unit tests: default parse, flag enable, explicit mode override (BanSingleEnd and BanSingleEnd_ExtendSoftclip), validate error without GTF, validate success with GTF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top-level orchestration for `--quantMode TranscriptomeSAM`.
src/io/sam.rs:
* Factor `build_sam_header` into a generic
`build_sam_header_from_refs(iter, params)` that takes any
`(name, length)` iterator. The original `build_sam_header` now
delegates to it with the genome's chromosomes.
* New `SamWriter::build_transcriptome_records(...)`: builds records
with RNAME pointing at the transcriptome header (reference_sequence_id
= transcript_idx), POS = t-space position + 1, and emits only the
standard attribute set (NH/HI/AS/NM/nM). Splice-aware tags (jM, jI,
XS, MD) are dropped because splices collapse in t-space.
* `primary_hit_idx` parameter selects the randomly-chosen primary
alignment; all others get the SECONDARY (0x100) flag.
src/io/bam.rs:
* New `BamWriter::create_transcriptome(path, tr_idx, params)` builds
a BAM header with one @sq per transcript (name = transcript_id,
length = t-space length).
* New `BamWriter::header()` accessor.
* Unit test verifying transcriptome BAM writer creation + @sq count.
src/align/read_align.rs:
* `per_read_seed` changed from `fn` to `pub(crate) fn` so lib.rs can
seed the transcriptome primary-picker with the same read-scoped
seed used for genome-space tie-shuffle.
src/lib.rs:
* Build an `Arc<TranscriptomeIndex>` alongside the gene-counts
context when `--quantMode TranscriptomeSAM` is active, and thread
it through run_single_pass / run_two_pass / align_reads_single_end
/ align_reads_paired_end.
* Open `<prefix>Aligned.toTranscriptome.out.bam` via
`BamWriter::create_transcriptome` at the start of the pipeline and
flush/finish at the end.
* Extend `AlignmentBatchResults` with `transcriptome_records:
Vec<RecordBuf>`. Per-read builder helpers:
- `build_transcriptome_records_se`: projects every genome-space
alignment via `filter_and_project`, seeds `StdRng` with
`per_read_seed(run_rng_seed, read_name)`, picks a random
projected index as primary, builds records.
- `build_transcriptome_records_pe`: projects both mates
independently, keeps transcripts where both project, emits two
records per projected pair (mate1 FIRST_SEGMENT + mate2
LAST_SEGMENT). Primary pick is shared across both mates.
* Transcriptome records are written serially in batch-merge order
alongside the main SAM/BAM stream (normal and BySJout paths).
Known limitations vs STAR:
* PE pairing is done per-transcript-idx set intersection between
projected mates, not via STAR's combined 2-mate
`Transcript`/`EX_iFrag` machinery. Equivalent for the common case
where each mate maps uniquely within a given transcript; may
diverge when one mate has multiple valid positions within one
transcript.
* StdRng (inherited from `feat/run-rng-seed`) replaces STAR's
std::mt19937 — no bit-for-bit parity on which transcriptome
projection is picked as primary among ties. Per-read
determinism is preserved via `per_read_seed`.
No pass-1 wiring: two-pass pass 1 uses `None` for tr_idx (tr BAM is
only emitted from pass 2).
cargo fmt applied across the touched files plus incidental fmt
touch-ups on read_align.rs / stitch.rs / params.rs / quant/mod.rs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end smoke test in `tests/transcriptome_sam.rs`:
* Builds a 2000bp synthetic genome (single chromosome, pseudo-random
sequence from an LCG).
* Writes a 2-transcript GTF (T1: chr1:101-400 +, T2: chr1:601-900 +).
* Generates 20 reads (30bp) drawn alternately from the T1 and T2
exon regions.
* Runs `ruSTAR --runMode genomeGenerate` then `alignReads
--quantMode TranscriptomeSAM`.
* Asserts `Aligned.toTranscriptome.out.bam` is created, > 100 bytes,
parses as a valid BAM via noodles, and the header contains exactly
two @sq lines named T1 and T2.
Uses the same `assert_cmd::Command::cargo_bin` pattern as the existing
`phase9_threading.rs` test (inherits the same deprecation warnings;
pre-existing parity).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iptomeSAMoutput Remove `Display` impl (never formatted anywhere) and `allow_single_end()` helper (always returns false, never called). Clean up the stale single-end comment in `filter_and_project`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lders `build_transcriptome_records_se` allocated a redundant `fwd` clone of `read_seq` just to unify the reverse / forward branch, and both SE + PE builders inlined the same 6-line base-complement match. Replace both with a single `rc_encode()` helper delegating to the existing `io::fastq::complement_base`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r mate Previously `build_transcriptome_records_pe` called the per-record builder twice *per projected pair*, passing a single-element slice and a `primary_hit_idx` sentinel (0 / usize::MAX) to force the SECONDARY flag. The builder already handles multi-alignment sets correctly, so split the pairs into mate1/mate2 slices, call the builder once per mate with the true `primary_hit`, then stamp FIRST_SEGMENT / LAST_SEGMENT on the results and interleave. Same observable output, half the per-pair work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both `build_transcriptome_records_se` and `_pe` duplicated the RNG seeding (`per_read_seed` → `StdRng::seed_from_u64` → `gen_range`) and the MAPQ calculation (`effective_n = n_alignments.max(n_for_mapq)`). Extract a single `pick_primary_and_mapq` helper and hoist the `use` blocks out of the function bodies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`BamWriter::create` and `BamWriter::create_transcriptome` both opened the file, wrapped it in `BufWriter`/`bam::io::Writer`, and wrote the header. Move that boilerplate into a private `with_header` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Merge the two `actual_left` shadowed lets into one `.min(..).min(..)` chain. - Drop the `actual_right` alias (right-clip never needs clamping). - Replace `ext.genome_start = ext.exons.first().map(..).unwrap_or(..)` with a simple `if let Some(first) = ext.exons.first()` — the fallback branch was unreachable because extend_softclips always enters with a non-empty exon list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eq clone - The 5-line single-end carve-out inside the `filter_and_project` doc said the same thing as the inline comment; fold both into a one-line mention. - Drop redundant `// (1) ... (4) ...` step labels from the function body. - `align_to_one_transcript` was cloning `align.read_seq` into the projected transcript but no consumer reads that field on a projected record (transcriptome SAM takes the read from the caller). Use `Vec::new()` to avoid the per-projection allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o_one_transcript The comment referenced a function name that never existed in ruSTAR (it was presumably copied from an earlier draft); the real call is `is_splice_boundary_before`. Collapse the now-out-of-date four-line block to a single factual line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `header()` method was added "for serial-write code paths that build records outside of `write_batch`" but no such caller exists; only the unit test used it. The test lives inside the same module so it can read the private `header` field directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caller had a `Vec<&Box<PairedAlignment>>` and was building a throwaway `Vec<&PairedAlignment>` to match the function's `&[&PairedAlignment]` signature. Change the signature to `IntoIterator<Item = &PairedAlignment>` and call `.iter().map(|b| b.as_ref())` at the call-site, eliminating the intermediate Vec allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test module already imports `HashMap` via `use super::*` (the parent module re-exports `std::collections::HashMap`). The `StdHashMap` alias was defensive but unused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trailing whitespace cleanup after rebase conflict resolution. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First field for STAR-faithful transcriptInfo.tab emission. `tr_exi[i]` is the starting offset of transcript `i`'s exons in a flat global exon array — matches STAR's `trExI` column (GTF_transcriptGeneSJ.cpp:86-112). Pure addition, no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required for STAR-faithful geneInfo.tab + transcriptInfo.tab emission. Each unique gene_id gets a position-based integer index (matches STAR's trGene column). First-seen-wins for gene_name / gene_biotype. No behavior change — tr_gene_id (string) is still populated for existing callers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
STAR's trExI column in transcriptInfo.tab indexes into exonInfo.tab, which writes exons grouped by transcript in SORTED (trStart, trEnd) order. My initial tr_exi computation used insertion order, producing offsets that wouldn't match STAR's output when insertion order differs from sorted order. Added second test exercising the sort-order semantics explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TranscriptomeIndex::write_transcript_info() emits the 8-column tab-separated file that STAR's GTF_transcriptGeneSJ.cpp:86-112 writes at genomeGenerate time. Transcripts are sorted by (trStart, trEnd) in the output; trStart is 0-based absolute, trEnd is 0-based inclusive, trEmax is the running max of prior sorted transcripts' ends with the STAR initializer quirk (first row has trEmax == trEnd). Three byte-level tests cover: single-strand emission, reverse-strand encoding (2), and the running-max-excludes-current trEmax semantics. Next commits will add exonInfo.tab, geneInfo.tab, exonGeTrInfo.tab, sjdbList.fromGTF.out.tab, then the load path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TranscriptomeIndex::write_exon_info() emits STAR's 3-column file (GTF_transcriptGeneSJ.cpp:108): transcript-relative exon start, 0-based-inclusive end, and cumulative transcript-space length of prior exons. Exons are grouped by transcript in STAR's sorted (trStart, trEnd) order. Two byte-level tests cover: multi-exon per-transcript emission with correct exLenCum progression, and insertion-vs-sort order divergence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TranscriptomeIndex::write_gene_info() emits 3 columns (geneID, geneName, geneBiotype) per STAR GTF_transcriptGeneSJ.cpp:55-60. Empty attribute strings for GTF records that don't set gene_name or gene_biotype. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TranscriptomeIndex::write_exon_ge_tr_info() emits STAR's 5-column exon-gene-transcript cross-reference, sorted by (exStart, exEnd, exStrand, gene_idx, trIdx) to match STAR's funCompareArrays<uint64,5>. Coordinates are 0-based absolute with inclusive exEnd; trIdx column is the insertion-order transcript index (STAR's own code calls this "wrong" because transcripts are re-sorted later — we replicate the behavior for byte parity). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TranscriptomeIndex::write_sjdb_list_from_gtf() emits STAR's 5-column
GTF-derived splice junction list (GTF_transcriptGeneSJ.cpp:115-171):
chr, 1-based chrom-local intron start, 1-based intron end, strand
('.', '+', '-'), and comma-separated 1-based gene indices for
transcripts sharing that junction.
No header line (matches STAR). Junctions are derived from consecutive
exon pairs within each transcript where an intron exists; duplicates
across transcripts are merged into a single line with the union of
gene indices. Sorted by intron start (STAR's funCompareUint2 behavior).
Three byte-level tests: basic + dedup/merge + single-exon zero case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TranscriptomeIndex::from_index_dir() reconstructs a fully-populated index from transcriptInfo.tab + exonInfo.tab + geneInfo.tab. Converts STAR's 0-based-inclusive trEnd / exEnd back to ruSTAR's exclusive convention. Derives tr_chr_idx from first-exon position, tr_order / tr_starts_sorted / tr_end_max_sorted from the already-sorted on-disk layout. Round-trip tests verify that from_gtf_exons → write_* → from_index_dir produces an equivalent index (tr_ids, tr_start, tr_end, tr_strand, tr_gene_idx, tr_chr_idx, tr_length, tr_exons, gene_ids/names/biotypes all match). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GenomeIndex now carries an Option<TranscriptomeIndex>: - GenomeIndex::build (genomeGenerate) builds the TranscriptomeIndex once from the GTF and stores it alongside genome/SA/SAindex. Write emits transcriptInfo.tab + exonInfo.tab + geneInfo.tab + exonGeTrInfo.tab + sjdbList.fromGTF.out.tab into the genome dir. - GenomeIndex::load (alignReads) prefers the on-disk files when transcriptInfo.tab is present, matching STAR. Falls back to re-parsing the GTF only when the files are absent and the user supplies --sjdbGTFfile at alignReads (STAR's sjdbInsertJunctions path). - src/lib.rs stops calling TranscriptomeIndex::from_gtf_exons at align time — it uses the pre-loaded instance from GenomeIndex. - Test helpers in read_align.rs / stitch.rs initialize the new field with None. 333 unit + 1 integration test pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors STAR's Transcript::exons[i][EX_iFrag] (IncludeDefine.h:209). Value 0 for single-end / mate1, 1 for mate2. This is pure plumbing: - Exon gains `i_frag: u8` field with doc comment pointing at STAR's source. - Stitcher (src/align/stitch.rs) sets `i_frag: 0` on all exons it produces — SE and per-mate PE go through here before any pair combining. - Transcriptome projection (src/quant/transcriptome.rs) propagates `block.i_frag` onto projected exons, so transcript-space alignments carry mate identity. - All Exon construction sites in test code updated to set `i_frag: 0` (SE contexts). No behavior change yet — PE pair construction still concatenates two i_frag=0 transcripts. T.3 will introduce combined-pair Transcript with mate2 exons rewritten to i_frag=1, and T.4 will use this field to implement STAR's native single-end filter (ReadAlign_quantTranscriptome.cpp:17). 333 unit + 1 integration test pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
align_to_one_transcript now detects block-to-block mate boundaries via i_frag change (equivalent to STAR's canonSJ == -3) and re-locates the transcript exon for the new block's genome position. Within-mate splice boundaries still use the `ex += 1` advance; cross-mate boundaries do NOT require the transcript-junction identity check. Lays the groundwork for PE combined-pair projection. Output unchanged for existing tests (no combined Transcript yet passed through). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…upply the combined score After merging dev's combined-cluster PE stitching (commit 45792dd), the only caller of `try_pair_transcripts` always supplies a combined WT score — the per-mate fallback path that drove the `Option<i32>` override was removed with the merge. - Collapse `combined_wt_score_override: Option<i32>` → `combined_wt_score: i32`. - Drop the `scorer` parameter (its only use was in the fallback closure). - Delete the dead per-mate fallback that recomputed the score from t1.score + t2.score - p1 - p2 + combined_p. - Fold identical `thresh1` / `thresh2` half-mapped thresholds into a single `single_mate_threshold` variable (both expressions were verbatim identical post-merge). Net -19 lines; same test pass-count (340), same STAR diff-script baseline (9/10 byte-identical). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t/transcriptome-sam-sj-insertion
- shuffle_tied_prefix: deterministic per-read RNG for tied primary selection - --runRNGseed param (default 777, matches STAR) - --outSAMattrRGline: @rg header + RG:Z tags - Various code simplifications and formatting
- Transcriptome index built from GTF at genomeGenerate, persisted to disk - Genome alignment projection onto transcripts (align_to_transcripts) - Aligned.toTranscriptome.out.bam pipeline - fix(params): don't require --sjdbGTFfile at alignReads when index has it - fix(io/bam): lenient BAM header validator for STAR tRNA names - fix(genome): match STAR's genomeParameters.txt byte-for-byte - fix(quant): STAR-exact geneInfo/exonGeTrInfo/sjdbList tab formats - Exon gains i_frag: u8 field (mirrors STAR's EX_iFrag) - try_pair_transcripts simplified (single caller always supplies score)
- PreparedJunction pipeline: motif detection, shift_left/shift_right - build_gsj(): concatenated SJ flanking-sequence buffer - sort + cross-strand dedup for prepared junctions - Genome::append_sjdb(): extend forward genome + rebuild RC - sjdbInfo.txt + sjdbList.out.tab writers - Wire sjdb insertion into GenomeIndex::build - Extended byte-parity diff harness
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Most things done now other than STARsolo.