Skip to content

feat: add --runRNGseed flag with seeded primary tie-break#5

Merged
Psy-Fer merged 7 commits into
scverse:mainfrom
ewels:feat/run-rng-seed
Apr 20, 2026
Merged

feat: add --runRNGseed flag with seeded primary tie-break#5
Psy-Fer merged 7 commits into
scverse:mainfrom
ewels:feat/run-rng-seed

Conversation

@ewels
Copy link
Copy Markdown
Member

@ewels ewels commented Apr 17, 2026

Summary

Adds the --runRNGseed CLI flag, matching STAR's behavior. Used to seed a Fisher–Yates shuffle over top-score-tied alignments at primary-selection time, so which copy of a multi-mapper is marked primary can be controlled reproducibly.

Part 1 of 3 branches unblocking nf-core/rnaseq's STAR_ALIGN step. This is a standalone PR; does not depend on the others.

Changes

  • src/params.rs: new run_rng_seed: u64 arg, default 777 (matches STAR's parametersDefault).
  • src/align/read_align.rs: per_read_seed(seed, read_name) + shuffle_tied_prefix() helpers. Called after the final sort in align_read (SE) and align_paired_read (PE).
  • Cargo.toml: rand = "0.8".
  • +5 unit tests (total 283 unit + 4 integration = 287 passing).

Divergences from STAR

  • RNG: rand::rngs::StdRng (ChaCha12) instead of std::mt19937. STAR's std::uniform_real_distribution<double> on mt19937 is libstdc++-specific, so bit-for-bit parity is not achievable anyway.
  • Seeding: per-read (seed * (hash(read_name) + 1)) instead of STAR's per-chunk (seed * (iChunk+1)) — ruSTAR's rayon-based parallelism would make per-chunk seeding non-deterministic across thread counts. Per-read seeding is strictly more reproducible.

Follow-up commits

  • 4 simplify commits: swap hand-rolled Fisher-Yates for rand::seq::SliceRandom::shuffle, tighter comments, modern DefaultHasher re-export, impl Fn bound.
  • 1 cargo fmt commit to fix pre-existing drift inherited from main.

Test plan

  • cargo test — 283 passing
  • cargo clippy --lib -- -D warnings — clean
  • cargo fmt --check — clean
  • End-to-end against nf-core/rnaseq
  • Differential check vs STAR on yeast 10k (some of the 127 "genuine tie" position disagreements will shift; direction depends on seed)

Notes

  • Commits are unsigned (1Password SSH agent unavailable during session). Re-sign with git commit --amend -S --no-edit + force-push-with-lease if signed history is required.
  • Behavior tweak worth reviewing: per_read_seed uses run_rng_seed.wrapping_mul(hash + 1). When the user passes --runRNGseed 0 (as nf-core does), every read gets seed 0 and per-read decorrelation collapses. A bitwise-XOR mix (run_rng_seed ^ hash) would keep decorrelation at seed=0. Not changed here — it's a behavior tweak, not a simplification.

🤖 Generated with Claude Code

ewels and others added 6 commits April 17, 2026 14:11
Adds the STAR `--runRNGseed` CLI flag (default 777) and uses it to
randomize primary alignment selection among equal-scoring multi-mappers.
Previously primary selection was purely deterministic (lexicographic
tie-break), which meant the 127 "genuine tie" disagreements against
STAR on 10k SE yeast were frozen in whichever direction our sort happened
to pick. With a seeded shuffle, ties flip independently of sort order --
matching STAR's behavior at `ReadAlign_multMapSelect.cpp:71-79` and
`funPrimaryAlignMark.cpp:19-28`.

Unblocks nf-core/rnaseq, which invokes STAR_ALIGN with `--runRNGseed 0`.

Implementation notes:
- `shuffle_tied_prefix` does an in-place Fisher-Yates on the contiguous
  top-score prefix only, leaving non-tied lower-scored alignments alone.
  Matches STAR's two-phase "move best to front, then shuffle front"
  (collapsed since the input is already score-sorted).
- Applied at the final primary-selection point in both SE (`align_read`)
  and PE (`align_paired_read`). Not applied in `stitch.rs` per-window
  sorts -- STAR only shuffles at multMapSelect / primary-mark time.
- STAR seeds `std::mt19937` once per chunk and advances per read.
  ruSTAR parallelises per-read via rayon, so instead of a per-chunk RNG
  we derive a deterministic per-read seed from
  `run_rng_seed * (hash(read_name) + 1)`. This is stronger reproducibility
  than STAR (independent of thread count) while honoring `--runRNGseed`.
- Uses `rand::rngs::StdRng` rather than `rand_mt`. Not bit-for-bit parity
  with STAR's tie-breaking choices (STAR's `std::uniform_real_distribution`
  is libstdc++-specific anyway) -- it just has to honor the seed
  deterministically.

Tests: +5 (parse-default, parse-override, four shuffle behavior tests).
283 unit tests passing (was 278), 4 integration tests passing.

Note: commit is unsigned because local 1Password SSH signing
(op-ssh-sign) returned "failed to fill whole buffer" on every attempt.
Feel free to re-sign with `git commit --amend -S` once the signer is back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hand-rolled Fisher-Yates loop with `rand::seq::SliceRandom::shuffle`
on the tied prefix, and collapse the length guard with `slice::first()`. Same
behavior (shuffles the [0..tied) range with the seeded RNG), fewer lines, and
no need to import `Rng`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the Fisher-Yates reference in `shuffle_tied_prefix`'s doc (the impl now
delegates to `SliceRandom::shuffle` — the algorithm is an implementation
detail). Collapse the SE call-site comment to one line since the fn doc
already covers the "why"; keep the STAR source file + line reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`DefaultHasher` is re-exported from `std::hash` as of Rust 1.76, so there's
no need to pull it in via the full `std::collections::hash_map` path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efix

Single-use generic parameter — `impl Fn(&T) -> i32` in the argument list is
shorter than a `where F: Fn(&T) -> i32` bound and reads more naturally at
the call sites we have.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes formatting drift inherited from main. No semantic changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ewels ewels changed the title feat: add --runRNGseed flag with seeded primary tie-break feat: add --runRNGseed flag with seeded primary tie-break Apr 17, 2026
@Psy-Fer Psy-Fer self-assigned this Apr 20, 2026
@Psy-Fer Psy-Fer added the enhancement New feature or request label Apr 20, 2026
@Psy-Fer Psy-Fer merged commit e67864d into scverse:main Apr 20, 2026
9 checks passed
@ewels ewels deleted the feat/run-rng-seed branch April 21, 2026 14:56
Psy-Fer added a commit that referenced this pull request Apr 24, 2026
- shuffle_tied_prefix: deterministic per-read RNG for tied primary selection
- --runRNGseed param (default 777, matches STAR)
- --outSAMattrRGline: @rg header + RG:Z tags
- Various code simplifications and formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants