fix(sjdb): key SpliceJunctionDb in genome-absolute 0-based coordinates#45
Open
pinin4fjords wants to merge 2 commits into
Open
fix(sjdb): key SpliceJunctionDb in genome-absolute 0-based coordinates#45pinin4fjords wants to merge 2 commits into
SpliceJunctionDb in genome-absolute 0-based coordinates#45pinin4fjords wants to merge 2 commits into
Conversation
The DB was keyed in chromosome-local 1-based coords (straight from the GTF) but consulted in genome-absolute 0-based at stitch time, so every sjdb lookup during alignment missed. Annotated splice events never got the sjdb_score bonus and the stricter overhang gate fired in their place, dropping ~50 % of GT/AG splices on the test profile. The stats-time site queried in genome-absolute 1-based-equivalent, so it accidentally matched on chr 0 (chr_start=0) but missed on every other chromosome -- producing inconsistent answers between SJ.out.tab and Log.final.out on the same BAM. Normalise the DB to genome-absolute 0-based at construction, matching the convention prepared_junctions and SpliceJunctionStats already use. Update the stats-time call site to query in the new space. Stitch-time needs no change. Fixes scverse#27 Co-Authored-By: Claude <noreply@anthropic.com>
The previous commit normalised the DB to genome-absolute 0-based and updated the stats-time query. Stitch-time was left unchanged on the assumption that it already passed genome-absolute 0-based — half right. donor_fwd was correct (first intron base, 0-based) but acceptor_fwd = donor_fwd + del landed on the first base AFTER the intron, while the DB keys store the last intron base. Lookup missed every annotated junction. Subtract 1 from the acceptor to land on the last intron base. After this, Log.final.out Annotated (sjdb) goes from 0 to a non-zero count that's consistent with the annotated=1 row count in SJ.out.tab on the same BAM. Co-Authored-By: Claude <noreply@anthropic.com>
Author
|
Verified end-to-end on macOS/aarch64 against the rebuilt fix branch (PE yeast + The two counters are now consistent — the pre-fix Remaining gap (separate from this PR, tracked as #47): total splice count is still 366 vs STAR's ~720 — pass-1 isn't using sjdb-derived junctions as alignment candidates, only at scoring time. That's a deeper change. |
This was referenced May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SpliceJunctionDbwas keyed in chromosome-local 1-based coordinates (extracted straight from GTF column 5) but consulted in two different coordinate spaces at runtime:src/align/stitch.rs:1305-1314)chr_start[N]-> misssrc/lib.rs:1860-1894)chr_start[N]-> missNeither matched the DB. Smoking gun: on the same WT_REP2 BAM,
SJ.out.tabhad 2 rowsannotated=1(stats-time accidentally hits chr-0) whileLog.final.outreportedNumber of splices: Annotated (sjdb) | 0(stitch-time always misses). The stitch-time miss is the load-bearing one - it dropssjdb_scoreand pulls in the stricteralign_sj_overhang_mingate, costing ~50 % of GT/AG splices on the nf-core/rnaseq test profile.Fix
Normalise the DB to genome-absolute 0-based at construction (matching the existing convention used by
prepared_junctionsinsrc/index/mod.rsand bySpliceJunctionStats). Single source of truth contained in the GTF extraction step.extract_junctions_configuredinsrc/junction/gtf.rsnow addschr_start[chr_idx]and subtracts 1 before returning each(intron_start, intron_end).SpliceJunctionDb::from_raw_junctionsconsumes those tuples verbatim.src/lib.rs:1860andsrc/lib.rs:796now record(genome_pos, genome_pos + intron_len - 1)instead of(genome_pos + 1, genome_pos + intron_len).genome_posis the 0-based first intronic base, which is whatdetect_splice_motifalready expects.src/align/stitch.rsis unchanged - it already passes genome-absolute 0-based.SJ.out.tabwriter insrc/junction/sj_output.rsnow adds+ 1after subtractingchr_start[chr_idx]so the chr-local 1-based output is unchanged.prepared_junctionsconstruction insrc/index/mod.rsis simplified: the chr-local-to-absolute conversion that used to live there now happens upstream in the GTF extractor.Test plan
test_db_keyed_in_genome_absolute_zero_based_multi_chrbuilds aSpliceJunctionDbon a 2-chromosome toy genome (chr_start[1] = 1000) and assertsis_annotatedsucceeds for the chr-1 junction at its genome-absolute 0-based key. The pre-fix chr-local key and the pre-fix stitch-time off-by-one key must both misscargo buildcargo clippy --lib -- -D warningscargo fmt --checkcargo test(384 lib tests + integration suites pass)After this fix,
Number of splices: Annotated (sjdb)inLog.final.outmatches the count ofannotated=1rows inSJ.out.tabon the same BAM, and the per-sample Annotated count is in the same order as STAR's on equivalent inputs.Fixes #27