Skip to content

fix(sjdb): key SpliceJunctionDb in genome-absolute 0-based coordinates#45

Open
pinin4fjords wants to merge 2 commits into
scverse:mainfrom
pinin4fjords:fix/sjdb-coord-space
Open

fix(sjdb): key SpliceJunctionDb in genome-absolute 0-based coordinates#45
pinin4fjords wants to merge 2 commits into
scverse:mainfrom
pinin4fjords:fix/sjdb-coord-space

Conversation

@pinin4fjords
Copy link
Copy Markdown

Summary

SpliceJunctionDb was keyed in chromosome-local 1-based coordinates (extracted straight from GTF column 5) but consulted in two different coordinate spaces at runtime:

Call site Convention On chr 0 (chr_start=0) On chr N+
stitch-time (src/align/stitch.rs:1305-1314) genome-absolute 0-based off by 1 -> miss off by chr_start[N] -> miss
stats-time (src/lib.rs:1860-1894) genome-absolute 1-based matches -> hit off by chr_start[N] -> miss

Neither matched the DB. Smoking gun: on the same WT_REP2 BAM, SJ.out.tab had 2 rows annotated=1 (stats-time accidentally hits chr-0) while Log.final.out reported Number of splices: Annotated (sjdb) | 0 (stitch-time always misses). The stitch-time miss is the load-bearing one - it drops sjdb_score and pulls in the stricter align_sj_overhang_min gate, costing ~50 % of GT/AG splices on the nf-core/rnaseq test profile.

Fix

Normalise the DB to genome-absolute 0-based at construction (matching the existing convention used by prepared_junctions in src/index/mod.rs and by SpliceJunctionStats). Single source of truth contained in the GTF extraction step.

  • extract_junctions_configured in src/junction/gtf.rs now adds chr_start[chr_idx] and subtracts 1 before returning each (intron_start, intron_end). SpliceJunctionDb::from_raw_junctions consumes those tuples verbatim.
  • Stats-time call sites at src/lib.rs:1860 and src/lib.rs:796 now record (genome_pos, genome_pos + intron_len - 1) instead of (genome_pos + 1, genome_pos + intron_len). genome_pos is the 0-based first intronic base, which is what detect_splice_motif already expects.
  • Stitch-time call site at src/align/stitch.rs is unchanged - it already passes genome-absolute 0-based.
  • SJ.out.tab writer in src/junction/sj_output.rs now adds + 1 after subtracting chr_start[chr_idx] so the chr-local 1-based output is unchanged.
  • prepared_junctions construction in src/index/mod.rs is simplified: the chr-local-to-absolute conversion that used to live there now happens upstream in the GTF extractor.

Test plan

  • New unit test test_db_keyed_in_genome_absolute_zero_based_multi_chr builds a SpliceJunctionDb on a 2-chromosome toy genome (chr_start[1] = 1000) and asserts is_annotated succeeds for the chr-1 junction at its genome-absolute 0-based key. The pre-fix chr-local key and the pre-fix stitch-time off-by-one key must both miss
  • Existing tests that encoded the chr-local 1-based assumption have been updated to genome-absolute 0-based equivalents (those were testing the bug)
  • cargo build
  • cargo clippy --lib -- -D warnings
  • cargo fmt --check
  • cargo test (384 lib tests + integration suites pass)

After this fix, Number of splices: Annotated (sjdb) in Log.final.out matches the count of annotated=1 rows in SJ.out.tab on the same BAM, and the per-sample Annotated count is in the same order as STAR's on equivalent inputs.

Fixes #27

pinin4fjords and others added 2 commits May 12, 2026 18:38
The DB was keyed in chromosome-local 1-based coords (straight from the
GTF) but consulted in genome-absolute 0-based at stitch time, so every
sjdb lookup during alignment missed. Annotated splice events never got
the sjdb_score bonus and the stricter overhang gate fired in their
place, dropping ~50 % of GT/AG splices on the test profile.

The stats-time site queried in genome-absolute 1-based-equivalent, so
it accidentally matched on chr 0 (chr_start=0) but missed on every
other chromosome -- producing inconsistent answers between SJ.out.tab
and Log.final.out on the same BAM.

Normalise the DB to genome-absolute 0-based at construction, matching
the convention prepared_junctions and SpliceJunctionStats already use.
Update the stats-time call site to query in the new space. Stitch-time
needs no change.

Fixes scverse#27

Co-Authored-By: Claude <noreply@anthropic.com>
The previous commit normalised the DB to genome-absolute 0-based and
updated the stats-time query. Stitch-time was left unchanged on the
assumption that it already passed genome-absolute 0-based — half right.
donor_fwd was correct (first intron base, 0-based) but acceptor_fwd =
donor_fwd + del landed on the first base AFTER the intron, while the
DB keys store the last intron base. Lookup missed every annotated
junction.

Subtract 1 from the acceptor to land on the last intron base. After
this, Log.final.out Annotated (sjdb) goes from 0 to a non-zero count
that's consistent with the annotated=1 row count in SJ.out.tab on
the same BAM.

Co-Authored-By: Claude <noreply@anthropic.com>
@pinin4fjords
Copy link
Copy Markdown
Author

pinin4fjords commented May 12, 2026

Verified end-to-end on macOS/aarch64 against the rebuilt fix branch (PE yeast + --twopassMode Basic --sjdbGTFfile genes.gtf):

=== rustar Log.final.out splice section ===
Number of splices: Total            | 366
Number of splices: Annotated (sjdb) | 95         <- was 0 pre-fix
Number of splices: GT/AG            | 266
Number of splices: GC/AG            | 2
Number of splices: AT/AC            | 5
Number of splices: Non-canonical    | 93

=== rustar SJ.out.tab annotated col distribution ===
    14  annotated=0
     2  annotated=1

The two counters are now consistent — the pre-fix Log.final.out = 0 vs SJ.out.tab = 2 annotated smoking gun is gone. Both fixes in this branch are load-bearing: the coord-space normalisation, plus the acceptor_fwd - 1 follow-up on the stitch-time lookup (without the latter, donor_fwd + del lands one base past the last intron base and every annotated lookup still misses).

Remaining gap (separate from this PR, tracked as #47): total splice count is still 366 vs STAR's ~720 — pass-1 isn't using sjdb-derived junctions as alignment candidates, only at scoring time. That's a deeper change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Log.final.out always reports Number of splices: Annotated (sjdb) = 0 despite --sjdbGTFfile

1 participant