Skip to content

Add samtools-compatible outputs, TIN analysis, and gene body coverage#17

Merged
ewels merged 21 commits into
mainfrom
samtools-tin-genebody
Feb 16, 2026
Merged

Add samtools-compatible outputs, TIN analysis, and gene body coverage#17
ewels merged 21 commits into
mainfrom
samtools-tin-genebody

Conversation

@ewels
Copy link
Copy Markdown
Member

@ewels ewels commented Feb 15, 2026

Summary

Adds three major feature groups to RustQC's RNA-seq QC pipeline, all computed in the existing single-pass BAM counting engine:

  • Samtools-compatible outputs: flagstat, idxstats, and stats (SN section) — reimplementations that produce output matching samtools format, integrated into the RSeQC accumulator framework
  • TIN (Transcript Integrity Number): Reimplementation of RSeQC's tin.py, computing per-transcript integrity scores from read coverage uniformity across exonic regions
  • Gene body coverage + Qualimap-compatible output: Computes a normalized 100-bin coverage profile across transcript bodies, producing coverage_profile_along_genes_(total).txt and rnaseq_qc_results.txt with alignment stats, genomic origin percentages, and 5'/3' transcript coverage bias

Changes

New files

  • src/rna/rseqc/flagstat.rs — samtools flagstat reimplementation
  • src/rna/rseqc/idxstats.rs — samtools idxstats reimplementation
  • src/rna/rseqc/stats.rs — samtools stats SN section reimplementation
  • src/rna/rseqc/tin.rs — TIN analysis module
  • src/rna/genebody.rs — Gene body coverage module with Qualimap-compatible output

Modified files

  • src/config.rs — Config structs for all new tools (flagstat, idxstats, stats, TIN, genebody coverage)
  • src/main.rs — Wiring for all new outputs, transcript position map construction, output writing
  • src/rna/dupradar/counting.rs — Gene body coverage recording at all gene assignment points
  • src/rna/rseqc/accumulators.rs — Accumulator integration for flagstat/idxstats/stats
  • src/rna/rseqc/bam_stat.rs — Extended to support flagstat/stats data collection
  • src/rna/rseqc/mod.rs — Module declarations
  • src/rna/mod.rs — Module declaration for genebody

Testing

  • All 120 unit tests and 12 integration tests pass (cargo test --release)
  • cargo fmt --check and cargo clippy -- -D warnings clean
  • Verified output on test BAM: all new output files produced with correct content

Extend BamStatAccum with counters for flagstat (paired, mapped, singletons,
mate-diff-chr), idxstats (per-chromosome mapped/unmapped counts), and
samtools stats SN section (sequence lengths, quality, insert size, CIGAR
bases, pair orientation).

New output modules:
- flagstat.rs: samtools flagstat-compatible format
- idxstats.rs: samtools idxstats-compatible TSV format
- stats.rs: samtools stats SN section (MultiQC-parseable)

All three are enabled by default and can be toggled via YAML config.
Outputs are written to samtools/ subdirectory alongside existing tools.
Exact match with samtools on test data; close match on large BAM with
minor differences in computed averages (insert size filtering).
Implement RSeQC tin.py-compatible transcript integrity analysis that
measures coverage uniformity across transcript bodies using Shannon entropy.

New module tin.rs implements:
- TinIndex: per-transcript exonic position sampling with binary-search index
- TinAccum: per-position coverage tracking during BAM pass
- TIN computation: entropy-based uniformity score (0-100 scale)
- Output: .tin.xls (per-transcript scores) and .summary.txt (MultiQC-parseable)

Builds from GTF transcripts or BED12, configurable sample_size and min_coverage.
Integrated into the single-pass BAM processing pipeline alongside existing tools.
Implement gene body coverage analysis that computes a normalized coverage
profile across 100 transcript body bins, producing two output files:
- coverage_profile_along_genes_(total).txt (Qualimap-compatible)
- rnaseq_qc_results.txt with alignment stats, genomic origin, and 5'/3' bias

Wire coverage recording into all gene assignment points in counting.rs
(SE/PE parallel and sequential paths). Cross-chromosome singleton mates
are skipped as aligned blocks are unavailable.
@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 15, 2026

Deploy Preview for rustqc canceled.

Name Link
🔨 Latest commit 0b162b9
🔍 Latest deploy log https://app.netlify.com/projects/rustqc/deploys/6992f9176931660008a4a739

ewels added 18 commits February 15, 2026 17:43
Implement preseq-compatible library complexity estimation (Phase 3):
- PreseqAccum: fragment counting via hashed keys (PE read1 / SE)
- Frequency-of-frequencies histogram construction
- Heck 1975 interpolation with Lanczos ln_gamma
- Good-Toulmin power series → QD algorithm → continued fraction
  extrapolation with Euler forward recurrence evaluation
- Optimal CF degree selection with stability checks
- Bootstrap multinomial resampling with quantile CIs
- TSV output matching preseq format (TOTAL_READS, EXPECTED_DISTINCT,
  LOWER/UPPER CI)

Config: PreseqConfig with max_extrap, step_size, n_bootstraps,
confidence_level, seed, max_terms, defects toggle.
CLI: --skip-preseq, --preseq-max-extrap, --preseq-step-size,
     --preseq-n-bootstraps flags.

Wired into RseqcAccumulators for both GTF and BED modes.
18 unit tests covering all numerical components.
Fix critical performance bug in bootstrap_resample where remaining
probability was recomputed by summing the entire remaining slice on
each iteration — O(n²) on 25M distinct molecules. Replace with
precomputed running sum decremented in-place — now O(n).

Add preseq reference outputs generated from preseq 3.2.0 Docker
image for both the small test BAM and large benchmark BAM.

Note: numerical accuracy vs reference preseq still needs work —
extrapolation diverges at high multiples and degree selection
falls back to max available degree too often.
Fix the root cause of ~40% singleton inflation in preseq library complexity
estimation: the negative TLEN branch computed frag_start as pos + tlen, but
the correct formula is cigar_end + tlen (since frag_end = cigar_end for the
rightmost read, and frag_start = frag_end - abs(tlen)).

This affected 43.6% of read pairs (all pairs where read1 maps right of read2).
Singletons now match preseq v3.2.0 within 0.001% (was +40.3%), and the full
extrapolation curve matches within <0.1% across the entire range.

Additional changes:
- Match preseq v3.2.0 read filtering: only exclude secondary (0x100), not
  supplementary (0x800); use proper pair flag (0x2) for PE/SE distinction
- Remove single-threaded --preseq-mode preseq-compat (hash-based parallel
  counting now matches preseq, making compat mode obsolete)
- Remove temporary histogram debug logging
- Add preseq v3.2.0 PE reference output for benchmarking
…secondary

- Fix TIN slot_idx u8 truncation: widen to u16 to support sample_size > 255
  without silent data corruption from index aliasing
- Fix genebody exonic/intronic base counting to use actual overlap length
  instead of full block length when blocks span exon-intron boundaries;
  cap exonic_bases at block_len to handle overlapping merged exons
- Fix Qualimap 'secondary' count: use actual BamStat secondary+supplementary
  values instead of computing as residual (which included QC-fail, paired-
  status-filtered reads, producing incorrect 'reads aligned' in Qualimap)
- Remove unused _interner parameter from process_chromosome_batch
- Add rust-version = "1.87" to Cargo.toml (required for is_multiple_of)
…ools

New output docs pages for preseq (library complexity), samtools
(flagstat, idxstats, stats), and TIN/gene body coverage. New benchmark
comparison pages for preseq and samtools tools.

Updated all existing docs to include the new tools: README, AGENTS.md,
CLI reference, configuration, quickstart, introduction, index, combined
benchmarks, credits, and Astro sidebar. Added benchmark reference outputs
for flagstat, idxstats, stats, TIN, preseq, and Qualimap gene body
coverage (both large and small datasets).
- Replace Vec<u8> qname in MateBufferKey with FNV-1a hash (u64),
  eliminating ~200M heap allocations per BAM file in paired-end mode
- Replace gene_hits.clone() with std::mem::take() for zero-cost
  ownership transfer when buffering mates
- Replace HashMap-based gene scoring with inline sorted-merge algorithm
  for paired-end mate gene assignment
- Replace String-based position keys in read_duplication with hash-based
  keys (HashMap<u64, u64>), eliminating per-read string formatting
- Parallelize preseq bootstrap replicates using rayon

10 GB BAM benchmark (10 threads):
  Counting pass: 260s -> 176s (32% faster)
  Total runtime: 414s -> 303s (27% faster)

All outputs remain bit-identical to the baseline.
… lookup

- Replace seq.as_bytes() allocation in read_duplication with direct
  4-bit BAM encoding iteration (hash_sequence_encoded), eliminating
  ~100M Vec<u8> allocations of ~100 bytes each
- Pre-compute chromosome name prefix mapping per TID instead of
  calling format!() on every read

10 GB BAM benchmark (10 threads):
  Counting pass: 176s -> 152s (14% faster, cumulative 41% vs baseline)
  Total runtime: 303s -> 280s (8% faster, cumulative 32% vs baseline)

All outputs remain bit-identical.
TIN is a reimplementation of RSeQC tin.py, so its benchmark belongs
with the other RSeQC tools rather than on the samtools page.
Add benchmark runners for preseq, samtools (flagstat/idxstats/stats),
TIN (tin.py), and Qualimap gene body coverage to run_benchmarks.py.
Update hand-crafted SVG benchmark charts with rows for all 16 tools.
Update timing references across README and docs to reflect the full
tool suite (~2h 45m traditional vs ~5m RustQC).
…ared utilities

- TIN: widen slot_idx and n_samples from u16 to u32, preventing silent
  truncation if sample_size exceeds 65535
- Gene body coverage: replace per-base iteration with bin-range approach,
  reducing inner loop from O(read_length) to O(bins_spanned)
- samtools stats: add TLEN >65536 outlier filter for insert size metrics,
  matching samtools behavior; move reads_mq0 counting to include all
  mapped reads (not just primary)
- Qualimap: use BamStatAccum.total_records for total_alignments instead
  of filtered stat_total_reads
- Extract shared median() function to io.rs, replacing duplicate
  implementations in genebody.rs and tin.rs
… benchmark/input/large/

Consolidate all upstream tool reference outputs into benchmark/input/large/
with descriptive names (samtools_reference.*, tin_reference.*) to match
the existing pattern used for preseq references. Removes the orphaned
benchmark/expected/ directory.
Move benchmark comparison images from flat docs/public/benchmarks/large/rseqc/
into subdirectories named after their respective scripts (junction_annotation/,
junction_saturation/, inner_distance/, read_duplication/). Update image paths
in rseqc.mdx.
…ctories

Move all Python RSeQC reference outputs from flat benchmark/rseqc/{large,small}/
into subdirectories named after their scripts: bam_stat/, infer_experiment/,
read_distribution/, inner_distance/, junction_annotation/, junction_saturation/,
read_duplication/, and tin/. Matches the structure used by RustQC outputs.
The RSeQC tin.py reference summary file was empty (0 bytes). Generated
proper summary from the reference tin.xls: mean=55.22, median=62.04,
stdev=28.58 (transcript-level, 97,750 transcripts).
…irectories

Update run_benchmarks.py to write upstream RSeQC outputs to per-tool
subdirectories under benchmark/rseqc/{dataset}/{tool}/ instead of a
flat benchmark/RSeQC/{dataset}/ directory. Move TIN output from
benchmark/tin/ to benchmark/rseqc/{dataset}/tin/ since it's an RSeQC
tool. Update benchmark/README.md, CONTRIBUTING.md, and .gitignore
with the new paths.
Move samtools and preseq reference outputs from benchmark/input/large/
into dedicated benchmark/samtools/ and benchmark/preseq/ directories
matching the structure of benchmark/rseqc/. Add small dataset references
for samtools (flagstat, idxstats, stats), preseq (lc_extrap_pe), and
Qualimap (rnaseq_qc_results, coverage_profile). Large Qualimap reference
omitted as it requires excessive disk for the name-sort pass.

Update benchmark/README.md with current directory structure and
reference generation commands for all upstream tools.
@ewels ewels merged commit a31b9c3 into main Feb 16, 2026
4 of 8 checks passed
@ewels ewels deleted the samtools-tin-genebody branch February 16, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant