MitoDrift reconstructs single-cell lineage trees from mitochondrial DNA (mtDNA) mutations by modeling heteroplasmy drift and measurement noise with a Wright–Fisher hidden Markov Tree (WF-HMT). It applies population genetics principles (genetic drift) to model mtDNA heteroplasmy in single cells in order to reconstruct high-precision lineage trees from single-cell genomics/multiome data. MitoDrift uses expectation-maximization (EM) to obtain maximum-likelihood estimates of drift, mutation, and error rates, then performs phylogenetic MCMC to quantify the uncertainty in tree topology. The primary output is a phylogeny with posterior clade supports and refined tree topologies at different levels of confidence. Inputs can be mtDNA allele counts from any single-cell genomics assays that capture mtDNA variation (e.g., ReDeeM, mtscATAC-seq, MAESTER).
# from a local checkout
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
remotes::install_github("sankaranlab/mitodrift")For faster MCMC trace IO, install qs2 with TBB:
remotes::install_cran("qs2", type = "source", configure.args = "--with-TBB --with-simd=AVX2")This uses a pL1000 subset dataset packaged under inst/extdata/ (200 cells, 186 variants).
You can run this from a terminal in the package root:
Rscript inst/bin/run_mitodrift_em.R \
--mut_dat inst/extdata/pL1000_mut_dat.csv \
--outdir mitodrift_demo \
--tree_mcmc_iter 5000 \
--tree_mcmc_chains 4 \
--tree_mcmc_burnin 1000Outputs in mitodrift_demo/ include mitodrift_object.rds plus tree files and diagnostics.
Once the run finishes:
library(mitodrift)
mut_dat <- read.csv(
system.file("extdata", "pL1000_mut_dat.csv", package = "mitodrift")
)
md <- readRDS("mitodrift_demo/mitodrift_object.rds")
phy_trim <- trim_tree(md$tree_annot, conf = 0.5)
plot_phylo_heatmap2(
phy_trim,
mut_dat,
dot_size = 1,
branch_width = 0.3,
branch_length = FALSE,
node_conf = TRUE,
het_max = 1
)One row per cell–variant pair, with counts for alternate allele (a) and total depth (d).
| cell | variant | a | d |
|---|---|---|---|
| PD45534aj | MT_10448_T_C | 0 | 2348 |
| PD45534aj | MT_11787_T_C | 1462 | 2000 |
| PD45534aj | MT_1244_T_C | 2 | 1500 |
| PD45534ak | MT_10448_T_C | 5 | 2100 |
IMPORTANT: Include rows where a = 0 so that every cell × variant combination is represented (the observation model uses total depth).
Wide matrices with variant in the first column and cells in subsequent columns. Use the same layout for amat (alt counts) and dmat (total depth).
Full lineage inference pipeline is run with inst/bin/run_mitodrift_em.R.
--fit_params: Enable automatic parameter fitting via EM (default:TRUE)--fit_param_max_iter: Maximum EM iterations (default: 100)--fit_param_epsilon: EM convergence threshold on parameter changes (default: 1e-3)--k: Number of VAF bins for discretizing heteroplasmy levels (default: 20)--npop: Population size for Wright-Fisher model (default: 600)--eps: Mutation rate per branch (default: 0.001; auto-fitted iffit_params=TRUE)--err: Variant detection error rate (default: 0; auto-fitted iffit_params=TRUE)--ngen: Number of WF generations (default: 100; auto-fitted iffit_params=TRUE)
--tree_mcmc_chains: Number of independent MCMC runs (default: 1; recommended 10-50 for robust inference. For large trees e.g., >8000 cells, use less chains to avoid memory limits)--tree_mcmc_iter: Maximum iterations per chain (default: 100; can be overridden by automatic termination via ASDSF convergence check)--conv_thres: ASDSF threshold for MCMC termination (default:NULL; e.g., 0.05-0.1 for auto-convergence). ASDSF (Average Standard Deviation of Split Frequencies) summarizes topology agreement across chains. A lower value indicates good mixing. Values < 0.05 indicate good convergence; < 0.1 is acceptable for exploratory analyses or large trees.--tree_mcmc_batch_size: Iteration interval for saving traces and checking convergence (default: 1000)--tree_mcmc_burnin: Initial samples to discard from each chain (default: 0; recommended 10-20% of iterations)
--ncores: Default number of cores for likelihood computation during tree optimization (default: 1)--ncores_em: Cores for parallel EM parameter fitting across variants (default: 1)--ncores_nj: Cores for initial NJ tree distance matrix computation viaparallelDist(default: 1)- NOTE: Different
ncores_njvalues may yield slightly different NJ topologies due to floating-point rounding
- NOTE: Different
--ncores_qs: Cores for parallel file I/O (qs2compression) during MCMC and annotation (default: 1)--ncores_annot: Cores for computing clade frequencies from MCMC traces (default: same as--ncores)
- Use
--resume TRUEto continue interrupted runs from saved traces
For a complete walkthrough, see the Analysis Workflow tutorial.
