Add samtools-compatible outputs, TIN analysis, and gene body coverage by ewels · Pull Request #17 · seqeralabs/RustQC

ewels · 2026-02-15T13:52:12Z

Summary

Adds three major feature groups to RustQC's RNA-seq QC pipeline, all computed in the existing single-pass BAM counting engine:

Samtools-compatible outputs: flagstat, idxstats, and stats (SN section) — reimplementations that produce output matching samtools format, integrated into the RSeQC accumulator framework
TIN (Transcript Integrity Number): Reimplementation of RSeQC's tin.py, computing per-transcript integrity scores from read coverage uniformity across exonic regions
Gene body coverage + Qualimap-compatible output: Computes a normalized 100-bin coverage profile across transcript bodies, producing coverage_profile_along_genes_(total).txt and rnaseq_qc_results.txt with alignment stats, genomic origin percentages, and 5'/3' transcript coverage bias

Changes

New files

src/rna/rseqc/flagstat.rs — samtools flagstat reimplementation
src/rna/rseqc/idxstats.rs — samtools idxstats reimplementation
src/rna/rseqc/stats.rs — samtools stats SN section reimplementation
src/rna/rseqc/tin.rs — TIN analysis module
src/rna/genebody.rs — Gene body coverage module with Qualimap-compatible output

Modified files

src/config.rs — Config structs for all new tools (flagstat, idxstats, stats, TIN, genebody coverage)
src/main.rs — Wiring for all new outputs, transcript position map construction, output writing
src/rna/dupradar/counting.rs — Gene body coverage recording at all gene assignment points
src/rna/rseqc/accumulators.rs — Accumulator integration for flagstat/idxstats/stats
src/rna/rseqc/bam_stat.rs — Extended to support flagstat/stats data collection
src/rna/rseqc/mod.rs — Module declarations
src/rna/mod.rs — Module declaration for genebody

Testing

All 120 unit tests and 12 integration tests pass (cargo test --release)
cargo fmt --check and cargo clippy -- -D warnings clean
Verified output on test BAM: all new output files produced with correct content

Extend BamStatAccum with counters for flagstat (paired, mapped, singletons, mate-diff-chr), idxstats (per-chromosome mapped/unmapped counts), and samtools stats SN section (sequence lengths, quality, insert size, CIGAR bases, pair orientation). New output modules: - flagstat.rs: samtools flagstat-compatible format - idxstats.rs: samtools idxstats-compatible TSV format - stats.rs: samtools stats SN section (MultiQC-parseable) All three are enabled by default and can be toggled via YAML config. Outputs are written to samtools/ subdirectory alongside existing tools. Exact match with samtools on test data; close match on large BAM with minor differences in computed averages (insert size filtering).

Implement RSeQC tin.py-compatible transcript integrity analysis that measures coverage uniformity across transcript bodies using Shannon entropy. New module tin.rs implements: - TinIndex: per-transcript exonic position sampling with binary-search index - TinAccum: per-position coverage tracking during BAM pass - TIN computation: entropy-based uniformity score (0-100 scale) - Output: .tin.xls (per-transcript scores) and .summary.txt (MultiQC-parseable) Builds from GTF transcripts or BED12, configurable sample_size and min_coverage. Integrated into the single-pass BAM processing pipeline alongside existing tools.

Implement gene body coverage analysis that computes a normalized coverage profile across 100 transcript body bins, producing two output files: - coverage_profile_along_genes_(total).txt (Qualimap-compatible) - rnaseq_qc_results.txt with alignment stats, genomic origin, and 5'/3' bias Wire coverage recording into all gene assignment points in counting.rs (SE/PE parallel and sequential paths). Cross-chromosome singleton mates are skipped as aligned blocks are unavailable.

netlify · 2026-02-15T13:52:16Z

✅ Deploy Preview for rustqc canceled.

Name	Link
🔨 Latest commit	`0b162b9`
🔍 Latest deploy log	https://app.netlify.com/projects/rustqc/deploys/6992f9176931660008a4a739

Implement preseq-compatible library complexity estimation (Phase 3): - PreseqAccum: fragment counting via hashed keys (PE read1 / SE) - Frequency-of-frequencies histogram construction - Heck 1975 interpolation with Lanczos ln_gamma - Good-Toulmin power series → QD algorithm → continued fraction extrapolation with Euler forward recurrence evaluation - Optimal CF degree selection with stability checks - Bootstrap multinomial resampling with quantile CIs - TSV output matching preseq format (TOTAL_READS, EXPECTED_DISTINCT, LOWER/UPPER CI) Config: PreseqConfig with max_extrap, step_size, n_bootstraps, confidence_level, seed, max_terms, defects toggle. CLI: --skip-preseq, --preseq-max-extrap, --preseq-step-size, --preseq-n-bootstraps flags. Wired into RseqcAccumulators for both GTF and BED modes. 18 unit tests covering all numerical components.

Fix critical performance bug in bootstrap_resample where remaining probability was recomputed by summing the entire remaining slice on each iteration — O(n²) on 25M distinct molecules. Replace with precomputed running sum decremented in-place — now O(n). Add preseq reference outputs generated from preseq 3.2.0 Docker image for both the small test BAM and large benchmark BAM. Note: numerical accuracy vs reference preseq still needs work — extrapolation diverges at high multiples and degree selection falls back to max available degree too often.

Fix the root cause of ~40% singleton inflation in preseq library complexity estimation: the negative TLEN branch computed frag_start as pos + tlen, but the correct formula is cigar_end + tlen (since frag_end = cigar_end for the rightmost read, and frag_start = frag_end - abs(tlen)). This affected 43.6% of read pairs (all pairs where read1 maps right of read2). Singletons now match preseq v3.2.0 within 0.001% (was +40.3%), and the full extrapolation curve matches within <0.1% across the entire range. Additional changes: - Match preseq v3.2.0 read filtering: only exclude secondary (0x100), not supplementary (0x800); use proper pair flag (0x2) for PE/SE distinction - Remove single-threaded --preseq-mode preseq-compat (hash-based parallel counting now matches preseq, making compat mode obsolete) - Remove temporary histogram debug logging - Add preseq v3.2.0 PE reference output for benchmarking

…secondary - Fix TIN slot_idx u8 truncation: widen to u16 to support sample_size > 255 without silent data corruption from index aliasing - Fix genebody exonic/intronic base counting to use actual overlap length instead of full block length when blocks span exon-intron boundaries; cap exonic_bases at block_len to handle overlapping merged exons - Fix Qualimap 'secondary' count: use actual BamStat secondary+supplementary values instead of computing as residual (which included QC-fail, paired- status-filtered reads, producing incorrect 'reads aligned' in Qualimap) - Remove unused _interner parameter from process_chromosome_batch - Add rust-version = "1.87" to Cargo.toml (required for is_multiple_of)

…ools New output docs pages for preseq (library complexity), samtools (flagstat, idxstats, stats), and TIN/gene body coverage. New benchmark comparison pages for preseq and samtools tools. Updated all existing docs to include the new tools: README, AGENTS.md, CLI reference, configuration, quickstart, introduction, index, combined benchmarks, credits, and Astro sidebar. Added benchmark reference outputs for flagstat, idxstats, stats, TIN, preseq, and Qualimap gene body coverage (both large and small datasets).

- Replace Vec<u8> qname in MateBufferKey with FNV-1a hash (u64), eliminating ~200M heap allocations per BAM file in paired-end mode - Replace gene_hits.clone() with std::mem::take() for zero-cost ownership transfer when buffering mates - Replace HashMap-based gene scoring with inline sorted-merge algorithm for paired-end mate gene assignment - Replace String-based position keys in read_duplication with hash-based keys (HashMap<u64, u64>), eliminating per-read string formatting - Parallelize preseq bootstrap replicates using rayon 10 GB BAM benchmark (10 threads): Counting pass: 260s -> 176s (32% faster) Total runtime: 414s -> 303s (27% faster) All outputs remain bit-identical to the baseline.

… lookup - Replace seq.as_bytes() allocation in read_duplication with direct 4-bit BAM encoding iteration (hash_sequence_encoded), eliminating ~100M Vec<u8> allocations of ~100 bytes each - Pre-compute chromosome name prefix mapping per TID instead of calling format!() on every read 10 GB BAM benchmark (10 threads): Counting pass: 176s -> 152s (14% faster, cumulative 41% vs baseline) Total runtime: 303s -> 280s (8% faster, cumulative 32% vs baseline) All outputs remain bit-identical.

TIN is a reimplementation of RSeQC tin.py, so its benchmark belongs with the other RSeQC tools rather than on the samtools page.

Add benchmark runners for preseq, samtools (flagstat/idxstats/stats), TIN (tin.py), and Qualimap gene body coverage to run_benchmarks.py. Update hand-crafted SVG benchmark charts with rows for all 16 tools. Update timing references across README and docs to reflect the full tool suite (~2h 45m traditional vs ~5m RustQC).

…ared utilities - TIN: widen slot_idx and n_samples from u16 to u32, preventing silent truncation if sample_size exceeds 65535 - Gene body coverage: replace per-base iteration with bin-range approach, reducing inner loop from O(read_length) to O(bins_spanned) - samtools stats: add TLEN >65536 outlier filter for insert size metrics, matching samtools behavior; move reads_mq0 counting to include all mapped reads (not just primary) - Qualimap: use BamStatAccum.total_records for total_alignments instead of filtered stat_total_reads - Extract shared median() function to io.rs, replacing duplicate implementations in genebody.rs and tin.rs

… benchmark/input/large/ Consolidate all upstream tool reference outputs into benchmark/input/large/ with descriptive names (samtools_reference.*, tin_reference.*) to match the existing pattern used for preseq references. Removes the orphaned benchmark/expected/ directory.

Move benchmark comparison images from flat docs/public/benchmarks/large/rseqc/ into subdirectories named after their respective scripts (junction_annotation/, junction_saturation/, inner_distance/, read_duplication/). Update image paths in rseqc.mdx.

…ctories Move all Python RSeQC reference outputs from flat benchmark/rseqc/{large,small}/ into subdirectories named after their scripts: bam_stat/, infer_experiment/, read_distribution/, inner_distance/, junction_annotation/, junction_saturation/, read_duplication/, and tin/. Matches the structure used by RustQC outputs.

The RSeQC tin.py reference summary file was empty (0 bytes). Generated proper summary from the reference tin.xls: mean=55.22, median=62.04, stdev=28.58 (transcript-level, 97,750 transcripts).

…irectories Update run_benchmarks.py to write upstream RSeQC outputs to per-tool subdirectories under benchmark/rseqc/{dataset}/{tool}/ instead of a flat benchmark/RSeQC/{dataset}/ directory. Move TIN output from benchmark/tin/ to benchmark/rseqc/{dataset}/tin/ since it's an RSeQC tool. Update benchmark/README.md, CONTRIBUTING.md, and .gitignore with the new paths.

Move samtools and preseq reference outputs from benchmark/input/large/ into dedicated benchmark/samtools/ and benchmark/preseq/ directories matching the structure of benchmark/rseqc/. Add small dataset references for samtools (flagstat, idxstats, stats), preseq (lc_extrap_pe), and Qualimap (rnaseq_qc_results, coverage_profile). Large Qualimap reference omitted as it requires excessive disk for the name-sort pass. Update benchmark/README.md with current directory structure and reference generation commands for all upstream tools.

This reverts commit bcbfcc6.

ewels added 3 commits February 15, 2026 01:31

ewels added 18 commits February 15, 2026 17:43

docs: move TIN and gene body coverage benchmarks to RSeQC page

6b3345f

TIN is a reimplementation of RSeQC tin.py, so its benchmark belongs with the other RSeQC tools rather than on the samtools page.

fix: populate TIN reference summary with actual upstream tin.py values

60ed4ea

The RSeQC tin.py reference summary file was empty (0 bytes). Generated proper summary from the reference tin.xls: mean=55.22, median=62.04, stdev=28.58 (transcript-level, 97,750 transcripts).

Add TODO for generating large Qualimap reference output

bcbfcc6

Revert "Add TODO for generating large Qualimap reference output"

0b162b9

This reverts commit bcbfcc6.

ewels merged commit a31b9c3 into main Feb 16, 2026
4 of 8 checks passed

ewels deleted the samtools-tin-genebody branch February 16, 2026 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add samtools-compatible outputs, TIN analysis, and gene body coverage#17

Add samtools-compatible outputs, TIN analysis, and gene body coverage#17
ewels merged 21 commits into
mainfrom
samtools-tin-genebody

ewels commented Feb 15, 2026

Uh oh!

netlify Bot commented Feb 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ewels commented Feb 15, 2026

Summary

Changes

New files

Modified files

Testing

Uh oh!

netlify Bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for rustqc canceled.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

netlify Bot commented Feb 15, 2026 •

edited

Loading