Add samtools-compatible outputs, TIN analysis, and gene body coverage#17
Merged
Conversation
Extend BamStatAccum with counters for flagstat (paired, mapped, singletons, mate-diff-chr), idxstats (per-chromosome mapped/unmapped counts), and samtools stats SN section (sequence lengths, quality, insert size, CIGAR bases, pair orientation). New output modules: - flagstat.rs: samtools flagstat-compatible format - idxstats.rs: samtools idxstats-compatible TSV format - stats.rs: samtools stats SN section (MultiQC-parseable) All three are enabled by default and can be toggled via YAML config. Outputs are written to samtools/ subdirectory alongside existing tools. Exact match with samtools on test data; close match on large BAM with minor differences in computed averages (insert size filtering).
Implement RSeQC tin.py-compatible transcript integrity analysis that measures coverage uniformity across transcript bodies using Shannon entropy. New module tin.rs implements: - TinIndex: per-transcript exonic position sampling with binary-search index - TinAccum: per-position coverage tracking during BAM pass - TIN computation: entropy-based uniformity score (0-100 scale) - Output: .tin.xls (per-transcript scores) and .summary.txt (MultiQC-parseable) Builds from GTF transcripts or BED12, configurable sample_size and min_coverage. Integrated into the single-pass BAM processing pipeline alongside existing tools.
Implement gene body coverage analysis that computes a normalized coverage profile across 100 transcript body bins, producing two output files: - coverage_profile_along_genes_(total).txt (Qualimap-compatible) - rnaseq_qc_results.txt with alignment stats, genomic origin, and 5'/3' bias Wire coverage recording into all gene assignment points in counting.rs (SE/PE parallel and sequential paths). Cross-chromosome singleton mates are skipped as aligned blocks are unavailable.
✅ Deploy Preview for rustqc canceled.
|
Implement preseq-compatible library complexity estimation (Phase 3):
- PreseqAccum: fragment counting via hashed keys (PE read1 / SE)
- Frequency-of-frequencies histogram construction
- Heck 1975 interpolation with Lanczos ln_gamma
- Good-Toulmin power series → QD algorithm → continued fraction
extrapolation with Euler forward recurrence evaluation
- Optimal CF degree selection with stability checks
- Bootstrap multinomial resampling with quantile CIs
- TSV output matching preseq format (TOTAL_READS, EXPECTED_DISTINCT,
LOWER/UPPER CI)
Config: PreseqConfig with max_extrap, step_size, n_bootstraps,
confidence_level, seed, max_terms, defects toggle.
CLI: --skip-preseq, --preseq-max-extrap, --preseq-step-size,
--preseq-n-bootstraps flags.
Wired into RseqcAccumulators for both GTF and BED modes.
18 unit tests covering all numerical components.
Fix critical performance bug in bootstrap_resample where remaining probability was recomputed by summing the entire remaining slice on each iteration — O(n²) on 25M distinct molecules. Replace with precomputed running sum decremented in-place — now O(n). Add preseq reference outputs generated from preseq 3.2.0 Docker image for both the small test BAM and large benchmark BAM. Note: numerical accuracy vs reference preseq still needs work — extrapolation diverges at high multiples and degree selection falls back to max available degree too often.
Fix the root cause of ~40% singleton inflation in preseq library complexity estimation: the negative TLEN branch computed frag_start as pos + tlen, but the correct formula is cigar_end + tlen (since frag_end = cigar_end for the rightmost read, and frag_start = frag_end - abs(tlen)). This affected 43.6% of read pairs (all pairs where read1 maps right of read2). Singletons now match preseq v3.2.0 within 0.001% (was +40.3%), and the full extrapolation curve matches within <0.1% across the entire range. Additional changes: - Match preseq v3.2.0 read filtering: only exclude secondary (0x100), not supplementary (0x800); use proper pair flag (0x2) for PE/SE distinction - Remove single-threaded --preseq-mode preseq-compat (hash-based parallel counting now matches preseq, making compat mode obsolete) - Remove temporary histogram debug logging - Add preseq v3.2.0 PE reference output for benchmarking
…secondary - Fix TIN slot_idx u8 truncation: widen to u16 to support sample_size > 255 without silent data corruption from index aliasing - Fix genebody exonic/intronic base counting to use actual overlap length instead of full block length when blocks span exon-intron boundaries; cap exonic_bases at block_len to handle overlapping merged exons - Fix Qualimap 'secondary' count: use actual BamStat secondary+supplementary values instead of computing as residual (which included QC-fail, paired- status-filtered reads, producing incorrect 'reads aligned' in Qualimap) - Remove unused _interner parameter from process_chromosome_batch - Add rust-version = "1.87" to Cargo.toml (required for is_multiple_of)
…ools New output docs pages for preseq (library complexity), samtools (flagstat, idxstats, stats), and TIN/gene body coverage. New benchmark comparison pages for preseq and samtools tools. Updated all existing docs to include the new tools: README, AGENTS.md, CLI reference, configuration, quickstart, introduction, index, combined benchmarks, credits, and Astro sidebar. Added benchmark reference outputs for flagstat, idxstats, stats, TIN, preseq, and Qualimap gene body coverage (both large and small datasets).
- Replace Vec<u8> qname in MateBufferKey with FNV-1a hash (u64), eliminating ~200M heap allocations per BAM file in paired-end mode - Replace gene_hits.clone() with std::mem::take() for zero-cost ownership transfer when buffering mates - Replace HashMap-based gene scoring with inline sorted-merge algorithm for paired-end mate gene assignment - Replace String-based position keys in read_duplication with hash-based keys (HashMap<u64, u64>), eliminating per-read string formatting - Parallelize preseq bootstrap replicates using rayon 10 GB BAM benchmark (10 threads): Counting pass: 260s -> 176s (32% faster) Total runtime: 414s -> 303s (27% faster) All outputs remain bit-identical to the baseline.
… lookup - Replace seq.as_bytes() allocation in read_duplication with direct 4-bit BAM encoding iteration (hash_sequence_encoded), eliminating ~100M Vec<u8> allocations of ~100 bytes each - Pre-compute chromosome name prefix mapping per TID instead of calling format!() on every read 10 GB BAM benchmark (10 threads): Counting pass: 176s -> 152s (14% faster, cumulative 41% vs baseline) Total runtime: 303s -> 280s (8% faster, cumulative 32% vs baseline) All outputs remain bit-identical.
TIN is a reimplementation of RSeQC tin.py, so its benchmark belongs with the other RSeQC tools rather than on the samtools page.
Add benchmark runners for preseq, samtools (flagstat/idxstats/stats), TIN (tin.py), and Qualimap gene body coverage to run_benchmarks.py. Update hand-crafted SVG benchmark charts with rows for all 16 tools. Update timing references across README and docs to reflect the full tool suite (~2h 45m traditional vs ~5m RustQC).
…ared utilities - TIN: widen slot_idx and n_samples from u16 to u32, preventing silent truncation if sample_size exceeds 65535 - Gene body coverage: replace per-base iteration with bin-range approach, reducing inner loop from O(read_length) to O(bins_spanned) - samtools stats: add TLEN >65536 outlier filter for insert size metrics, matching samtools behavior; move reads_mq0 counting to include all mapped reads (not just primary) - Qualimap: use BamStatAccum.total_records for total_alignments instead of filtered stat_total_reads - Extract shared median() function to io.rs, replacing duplicate implementations in genebody.rs and tin.rs
… benchmark/input/large/ Consolidate all upstream tool reference outputs into benchmark/input/large/ with descriptive names (samtools_reference.*, tin_reference.*) to match the existing pattern used for preseq references. Removes the orphaned benchmark/expected/ directory.
Move benchmark comparison images from flat docs/public/benchmarks/large/rseqc/ into subdirectories named after their respective scripts (junction_annotation/, junction_saturation/, inner_distance/, read_duplication/). Update image paths in rseqc.mdx.
…ctories
Move all Python RSeQC reference outputs from flat benchmark/rseqc/{large,small}/
into subdirectories named after their scripts: bam_stat/, infer_experiment/,
read_distribution/, inner_distance/, junction_annotation/, junction_saturation/,
read_duplication/, and tin/. Matches the structure used by RustQC outputs.
The RSeQC tin.py reference summary file was empty (0 bytes). Generated proper summary from the reference tin.xls: mean=55.22, median=62.04, stdev=28.58 (transcript-level, 97,750 transcripts).
…irectories
Update run_benchmarks.py to write upstream RSeQC outputs to per-tool
subdirectories under benchmark/rseqc/{dataset}/{tool}/ instead of a
flat benchmark/RSeQC/{dataset}/ directory. Move TIN output from
benchmark/tin/ to benchmark/rseqc/{dataset}/tin/ since it's an RSeQC
tool. Update benchmark/README.md, CONTRIBUTING.md, and .gitignore
with the new paths.
Move samtools and preseq reference outputs from benchmark/input/large/ into dedicated benchmark/samtools/ and benchmark/preseq/ directories matching the structure of benchmark/rseqc/. Add small dataset references for samtools (flagstat, idxstats, stats), preseq (lc_extrap_pe), and Qualimap (rnaseq_qc_results, coverage_profile). Large Qualimap reference omitted as it requires excessive disk for the name-sort pass. Update benchmark/README.md with current directory structure and reference generation commands for all upstream tools.
This reverts commit bcbfcc6.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds three major feature groups to RustQC's RNA-seq QC pipeline, all computed in the existing single-pass BAM counting engine:
flagstat,idxstats, andstats(SN section) — reimplementations that produce output matching samtools format, integrated into the RSeQC accumulator frameworktin.py, computing per-transcript integrity scores from read coverage uniformity across exonic regionscoverage_profile_along_genes_(total).txtandrnaseq_qc_results.txtwith alignment stats, genomic origin percentages, and 5'/3' transcript coverage biasChanges
New files
src/rna/rseqc/flagstat.rs— samtools flagstat reimplementationsrc/rna/rseqc/idxstats.rs— samtools idxstats reimplementationsrc/rna/rseqc/stats.rs— samtools stats SN section reimplementationsrc/rna/rseqc/tin.rs— TIN analysis modulesrc/rna/genebody.rs— Gene body coverage module with Qualimap-compatible outputModified files
src/config.rs— Config structs for all new tools (flagstat, idxstats, stats, TIN, genebody coverage)src/main.rs— Wiring for all new outputs, transcript position map construction, output writingsrc/rna/dupradar/counting.rs— Gene body coverage recording at all gene assignment pointssrc/rna/rseqc/accumulators.rs— Accumulator integration for flagstat/idxstats/statssrc/rna/rseqc/bam_stat.rs— Extended to support flagstat/stats data collectionsrc/rna/rseqc/mod.rs— Module declarationssrc/rna/mod.rs— Module declaration for genebodyTesting
cargo test --release)cargo fmt --checkandcargo clippy -- -D warningsclean