Skip to content

sinmojito/MCPcounter-rust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mcpcounter-rust

Crates.io Docs.rs CI License: GPL-3.0-or-later

mcpcounter-rust is a pure-Rust, zero-dependency port of the MCPcounter family of cell-population quantification methods — MCP-counter for human tissue (Becht et al. 2016) and mMCP-counter for mouse (Petitprez et al. 2020). It reproduces the original R packages bit-for-bit on synthetic inputs and to the last bit of an IEEE-754 double (worst relative error 2.2e-16, ~1 ULP) on real public RNA-seq, with no BLAS, LAPACK, Python, or R at runtime — and no third-party crates at all.

Disclaimer. mcpcounter-rust is an independent, unofficial Rust port and is not affiliated with, endorsed by, or supported by the authors of MCP-counter or mMCP-counter. For the canonical, authoritative implementations, use the upstream R packages ebecht/MCPcounter and cit-bioinfo/mMCP-counter.

Highlights

  • Numeric parity with R, validated at two levels. Seeded synthetic fixtures match the R packages to a relative error of 0 (bit-identical); real public RNA-seq matches to ≤ 1 ULP (2.18e-16 human, 1.69e-16 mouse). Nothing is accepted unless it reproduces R 4.6.0.
  • Both family methods, one engine. Human MCP-counter (10 immune & stromal populations, arithmetic-mean scores) and mouse mMCP-counter (16 populations, median scores) share a single R-faithful mean/median core.
  • Faithful to R's quirks, on purpose. Reproduces R's order-sensitive two-pass mean, even-n median (mean(c(a, b)), not (a + b) / 2), MCP-counter's intersect de-duplication, and mMCP-counter's marker multiplicity (a marker listed k times counts k times) — the details that make last-bit parity hold. See Fidelity notes.
  • Zero dependencies. No native libraries, no build script, no transitive crates — a single, audit-tiny library. Marker signatures are embedded at compile time.
  • Library-first. use mcpcounter_rust::..., plus a dependency-free TSV reader for matrix IO.

Installation

cargo add mcpcounter-rust

The crate is mcpcounter-rust; the library imports as mcpcounter_rust (Cargo maps the dash to an underscore). It builds on Rust 1.74+ and pulls in no dependencies.

Quick start

mcp_counter on a tiny human matrix — the canonical call, exactly as MCPcounter.estimate in R. Markers are matched by name against the bundled signature; genes that aren't markers are ignored, and populations with no present markers are dropped:

use mcpcounter_rust::{mcp_counter, ExprMatrix, FeaturesType};

fn main() {
    // 6 genes (rows) x 3 samples (cols), log2-scale expression, stored row-major.
    // Five are MCP-counter markers; GAPDH is a non-marker housekeeping gene and
    // is ignored, exactly as in R.
    let genes = vec![
        "CD3D".to_string(),  // T cells
        "CD28".to_string(),  // T cells
        "CD8B".to_string(),  // CD8 T cells
        "CD19".to_string(),  // B lineage
        "CD79A".to_string(), // B lineage
        "GAPDH".to_string(), // not a marker -> ignored
    ];
    let samples = vec![
        "tumor_A".to_string(),
        "tumor_B".to_string(),
        "normal".to_string(),
    ];
    #[rustfmt::skip]
    let data = vec![
        //  tumor_A  tumor_B  normal
            8.1,     7.4,     2.0, // CD3D
            7.6,     7.1,     1.8, // CD28
            6.9,     6.2,     1.5, // CD8B
            4.2,     8.3,     2.1, // CD19
            4.0,     8.0,     1.9, // CD79A
           11.0,    11.1,    11.0, // GAPDH (flat housekeeping)
    ];

    let expr = ExprMatrix::new(genes, samples, data);
    let result = mcp_counter(&expr, FeaturesType::HugoSymbols);

    // Each score is the mean of a population's present markers; populations
    // appear in MCP-counter's signature order.
    for (p, pop) in result.populations.iter().enumerate() {
        print!("{pop:<14}");
        for s in 0..result.samples.len() {
            print!("{:>10.3}", result.score(p, s));
        }
        println!();
    }
}
T cells            7.850     7.250     1.900
CD8 T cells        6.900     6.200     1.500
B lineage          4.100     8.150     2.000

The mouse estimator has the same shape — mmcp_counter(&expr, MurineFeaturesType::GeneSymbol, GenomeVersion::GCRm39) — returning median scores over 16 murine populations. The runnable version of the above is examples/quickstart.rs; the real-data parity harness is examples/parity_realdata.rs.

Methods, inputs & licensing

Each method re-implements a published algorithm and inherits its upstream license; the crate is therefore distributed under GPL-3.0-or-later.

Method Function Populations Score Feature namespaces Upstream / license
MCP-counter (human) mcp_counter 10 mean HugoSymbols, Affy133P2Probesets, EntrezId, EnsemblId ebecht/MCPcounter — code GPL-3; signatures CC0 / public domain
mMCP-counter (mouse) mmcp_counter 16 median GeneSymbol, EnsemblId, ProbesGCRm38 / GCRm39) cit-bioinfo/mMCP-counter — code + signatures GPL-3

Input is an ExprMatrix (genes × samples; build with ExprMatrix::new, or parse a TSV via mcpcounter_rust::io::read_tsv_matrix). No internal transform or normalization is applied, matching R; NA maps to f64::NAN and is dropped per population (na.rm = TRUE).

Scope. This crate is the MCPcounter family only. Other deconvolution methods (ssGSEA-based, solver-based, …) are intentionally not bundled here — they belong in their own focused crates.

Validation

Every method is checked against the unmodified upstream R package (R 4.6.0), never against hand- or AI-computed values. Parity is verified at two levels, and both are required — the real-data level catches what synthetic data hides:

Method Dataset Organism Shape (genes × samples) Worst relative error
MCP-counter seeded synthetic committed fixtures 0 (bit-identical)
mMCP-counter seeded synthetic committed fixtures 0 (bit-identical)
MCP-counter dataset_racle (Racle 2017, eLife) human 32,467 × 4 2.18e-16
mMCP-counter dataset_petitprez (Petitprez 2020, Genome Med) mouse 53,819 × 6 1.69e-16

Synthetic inputs report a perfect 0; only real expression — with its hundreds of present markers per population — surfaces the sub-ULP rounding, because R accumulates means and medians in 80-bit long double that an f64 cannot match to the final bit. A flawless synthetic score would have hidden this, so real-data validation is treated as a required gate, not a bonus.

Reproduce it from public data only (the large input matrices are never committed):

Rscript benchmarks/parity_realdata.R   # dump R reference outputs to scratch/
cargo run --example parity_realdata    # revalidate; writes the metrics JSON

Synthetic fixtures run as plain unit tests (cargo test); aggregate real-data metrics live in benchmarks/parity_realdata_results.json.

Fidelity notes

The behaviours below are reproduced deliberately — they are why the numbers land on the last bit rather than merely "close":

  • Order-sensitive two-pass mean. R's mean accumulates in 80-bit long double over two passes; marker rows are summed in ascending matrix-row order (R's intersect(rownames, markers) order) so the rounding matches.
  • Even-n median = mean(c(a, b)). R averages the two central order statistics with that same two-pass mean, not the naive (a + b) / 2 — they can differ in the last ULP.
  • MCP-counter de-duplicates; mMCP-counter does not. MCP-counter selects rows via intersect (one row per marker, first occurrence); mMCP-counter selects via exp[sig[, features], ], so a marker listed k times contributes k copies to the median.
  • Fixed 16-population mMCP output order, with all-NA/absent populations dropped but all-zero populations kept.
  • Empty-result divergence (documented). When no markers match, R stop()s; this port returns an empty DeconvResult instead.

R → mcpcounter-rust name map

R mcpcounter-rust
MCPcounter.estimate(expr, featuresType = …) mcp_counter(&expr, FeaturesType::…)
mMCPcounter.estimate(exp, features = …, genomeVersion = …) mmcp_counter(&expr, MurineFeaturesType::…, GenomeVersion::…)
"HUGO_symbols" / "affy133P2_probesets" FeaturesType::HugoSymbols / Affy133P2Probesets
result matrix [population, sample] DeconvResult::score(pop, sample)

Performance

The crate is dependency-free and resolves each population's markers by direct row lookup, so scoring is O(present markers) per population rather than a per-population rescan of the whole gene panel — microsecond-scale per call on a 32k-gene human matrix (compute only). Note the shipped benchmark validates numeric parity, not wall-clock speed: there is no head-to-head timing harness against R, so no speedup multiple is claimed here.

Building from source

cargo build --release
cargo test                      # unit + synthetic-parity tests
cargo run --example quickstart

The crate has a single build configuration (no feature flags, no dependencies), so cargo build and cargo build --all-features are identical. The local CI gate is cargo fmt --all -- --check, then cargo clippy --all-targets --all-features -- -D warnings, then cargo test.

How this port was produced

mcpcounter-rust was written with AI assistance (Anthropic's Claude, via Claude Code); the Co-Authored-By trailers in the git history record this. Its outputs are checked numerically against the upstream R packages, as described under Validation — never against AI-generated expected values. Upstream sources and signature tables are fetched/exported locally for study; nothing is vendored beyond the embedded marker signatures.

License

MCP-counter and mMCP-counter are licensed GPL-3. As a derivative port, mcpcounter-rust is distributed under GPL-3.0-or-later; the full text is in LICENSE. The bundled MCP-counter marker signatures are upstream CC0 / public-domain data; the mMCP-counter signatures are GPL-3 (exported from the package's data tables).

Credits & citation

MCP-counter and mMCP-counter are the work of Etienne Becht, Florent Petitprez, Aurélien de Reyniès, and colleagues. This port reimplements their algorithms; it does not originate them. If you use mcpcounter-rust in research, please cite the original papers:

Becht, E., Giraldo, N.A., Lacroix, L., et al. (2016). Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression. Genome Biology 17, 218. doi:10.1186/s13059-016-1070-5

Petitprez, F., Levy, S., Sun, C.-M., et al. (2020). The murine Microenvironment Cell Population counter method to estimate abundance of tissue-infiltrating immune and stromal cell populations in murine samples using gene expression. Genome Medicine 12, 86. doi:10.1186/s13073-020-00783-w

About

Rust port of MCPcounter and mMCPcounter using claude code

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors