ROCIT (Read Origin Classifier In Tumors) is a deep learning tool that classifies individual sequencing reads from long-read bulk tumor sequencing as tumor-derived or normal-derived. By leveraging CpG methylation patterns, ROCIT enables read-level resolution of tumor heterogeneity from PacBio sequencing data.
ROCIT currently supports training and prediction on PacBio HiFi Tumor BAMs with CpG methylation probabilities produced by Jasmine. Oxford Nanopore support is planned for future releases.
ROCIT uses a multi-step approach:
- Data Preprocessing: Extracts CpG methylation from PacBio BAM files and labels the origin of a subset of reads based on somatic variants (SNVs) and loss of heterozygosity (LOH) events
- Input Features: Combines read-level methylation patterns with cell-type reference atlases and bulk sample methylation distributions
- Model Training: Trains a transformer-based neural network to classify the labelled read subset using chromosomal cross-validation
- Prediction: Applies the trained model to classify all reads in the sample
pip install rocitgit clone https://github.com/tobybaker/rocit.git
cd rocit
pip install -e .- Python ≥ 3.10
- PyTorch ≥ 2.9.1
- PyTorch Lightning ≥ 2.6.0
- Polars ≥ 1.36.1
- pysam == 0.22.1
- Additional dependencies listed in pyproject.toml
ROCIT requires a reference cell-type methylation atlas derived from whole-genome bisulfite sequencing data (GSE186458). Download the pre-computed atlas:
mkdir -p reference
wget https://zenodo.org/records/18859554/files/cell_map_reference_atlas.parquet -O reference/cell_map_reference_atlas.parquetAlternatively, you can generate the atlas from source using the provided scripts.
Citation: Loyfer, N., et al. (2023). A DNA methylation atlas of normal human cell types. Nature.
ROCIT provides a complete end-to-end pipeline via the rocit run command:
rocit run --config config.yamlFor more control, you can run individual steps:
# 1. Extract methylation from BAM
rocit extract-bam-methylation --sample-id SAMPLE_ID \
--sample-bam aligned.bam \
--output-dir methylation/
# 2. Compute methylation distribution
rocit extract-cpg-distribution --sample-id SAMPLE_ID \
--methylation-dir methylation/ \
--output-dir distribution/
# 3. Label reads for training
rocit preprocess --config preprocess_config.yaml
# 4. Train the model
rocit train --config train_config.yaml
# 5. Run predictions
rocit predict --config predict_config.yamlThe examples/ directory contains a worked example using the HG008 cancer cell line, including sample config files and instructions for downloading the example dataset from Zenodo. This is the best starting point if you want to see the expected input data formats and run ROCIT end-to-end on real data.
ROCIT uses YAML configuration files for reproducibility and ease of use. Below are templates for each command. Tabular input files (variants, copy number, haplotags, etc.) can be provided in any supported format — configuration examples use .parquet but .csv, .tsv, .arrow, and .ndjson are equally valid.
sample_id: "SAMPLE_ID"
labelled_data: "preprocessing/labelled_methylation_data.parquet"
sample_distribution: "distribution/SAMPLE_ID_methylation_distribution.parquet"
cell_atlas: "reference/cell_atlas.parquet"
val_chromosomes: ["chr20", "chr21"]
test_chromosomes: ["chr22"]
output_dir: "training/"
cache_dir: "/scratch/"Required fields:
sample_id: Unique identifier for the samplelabelled_data: Path to labelled methylation data from preprocessingsample_distribution: Path to sample methylation distributioncell_atlas: Path to cell-type methylation referenceval_chromosomes: Chromosomes reserved for validation (must not overlap with test)test_chromosomes: Chromosomes reserved for testing (must not overlap with validation)output_dir: Directory for training outputscache_dir: Cache directory for dataset processing (default:/scratch)
sample_id: "SAMPLE_ID"
best_checkpoint_path: "training/SAMPLE_ID/version_0/checkpoints/best-checkpoint.ckpt"
sample_distribution: "distribution/SAMPLE_ID_methylation_distribution.parquet"
cell_atlas: "reference/cell_atlas.parquet"
read_store_dir: "methylation/" # OR use read_store for single file
output_dir: "predictions/"
cache_dir: "/scratch/"Required fields:
sample_id: Unique identifier for the samplebest_checkpoint_path: Path to trained model checkpointsample_distribution: Path to sample methylation distributioncell_atlas: Path to cell-type methylation referenceread_storeORread_store_dir: Single file or directory of methylation parquet filesoutput_dir: Directory for prediction outputscache_dir: Cache directory (default:/scratch)
sample_id: "SAMPLE_ID"
bam: "data/aligned.bam"
methylation_dir: "methylation/"
copy_number: "data/copy_number_segments.parquet"
variants: "data/somatic_variants.parquet"
haplotags: "data/haplotags.parquet"
haploblocks: "data/haploblocks.parquet"
snv_clusters: "data/snv_clusters.parquet"
snv_cluster_assignments: "data/snv_cluster_assignments.parquet"
output_dir: "preprocessing/"Required fields:
sample_id: Unique identifier for the samplebam: Path to aligned BAM file with methylation tagsmethylation_dir: Directory containing extracted methylation datacopy_number: Path to copy number segments filevariants: Path to somatic variants filehaplotags: Path to read haplotype assignmentshaploblocks: Path to phased haplotype blockssnv_clusters: Path to SNV cluster assignmentsoutput_dir: Directory for preprocessing outputs
Optional fields:
snv_cluster_assignments: Path to SNV cluster assignments (if not provided, will be inferred)
sample_id: "SAMPLE_ID"
bam: "data/aligned.bam"
bam_index: "data/aligned.bam.bai"
copy_number: "data/copy_number_segments.parquet"
variants: "data/somatic_variants.parquet"
haplotags: "data/haplotags.parquet"
haploblocks: "data/haploblocks.parquet"
snv_clusters: "data/snv_clusters.parquet"
snv_cluster_assignments: "data/snv_cluster_assignments.parquet"
cell_atlas: "reference/cell_atlas.parquet"
val_chromosomes: ["chr20", "chr21"]
test_chromosomes: ["chr22"]
min_mapq: 0
workers: 8
output_dir: "output/"
cache_dir: "/scratch/"Required fields:
sample_id: Unique identifier for the samplebam: Path to aligned BAM file with methylation tagscopy_number: Path to copy number segments filevariants: Path to somatic variants filehaplotags: Path to read haplotype assignmentshaploblocks: Path to phased haplotype blockssnv_clusters: Path to SNV cluster assignmentscell_atlas: Path to cell-type methylation referenceval_chromosomes: Chromosomes reserved for validation (must not overlap with test)test_chromosomes: Chromosomes reserved for testing (must not overlap with validation)output_dir: Directory for all pipeline outputscache_dir: Cache directory for dataset processing
Optional fields:
bam_index: BAM index file (auto-detected if not provided)snv_cluster_assignments: Path to SNV cluster assignments (if not provided, will be inferred)chromosomes: Specific chromosomes to process (defaults to chr1-chrY)min_mapq: Minimum mapping quality for reads (default: 0)workers: Number of parallel workers for BAM processing (default: 1)
Run the complete ROCIT pipeline from BAM to predictions.
rocit run --config run_config.yamlPipeline steps:
- Extract BAM methylation
- Compute CpG distribution
- Label reads using somatic variants
- Train classification model
- Generate predictions
Outputs:
output/methylation/: Per-chromosome methylation dataoutput/distribution/: Sample methylation distributionoutput/preprocessing/: Labelled reads and methylation dataoutput/training/: Model checkpoints and training logsoutput/predictions/: Final tumor/normal predictions
Train a ROCIT classification model.
rocit train --config train_config.yamlOutputs:
{output_dir}/{sample_id}/version_X/checkpoints/best-checkpoint.ckpt: Best model{output_dir}/{sample_id}/version_X/metrics.csv: Training metrics (loss, AUROC, etc.)
Training parameters (modifiable in code via TrainingParams in the python API):
- Model architecture: 384-dim, 6 heads, 3 layers
- Max epochs: 100 with early stopping (patience=10)
- Learning rate: 1e-4 with warmup
- Batch size: 256
- Early Stopping Metric: Validation AUROC
Generate predictions using a trained model.
rocit predict --config predict_config.yamlOutputs:
{output_dir}/{sample_id}_tumor_origin_predictions.parquet: Read-level predictions with columns:read_index: Unique read identifierchromosome: Chromosome nametumor_probability: Predicted probability of tumor origin (0-1)
Label reads for training using somatic variant information.
rocit preprocess --config preprocess_config.yamlOutputs:
{output_dir}/labelled_reads.parquet: Read labels (tumor/normal){output_dir}/labelled_methylation_data.parquet: Methylation data with labels
Extract CpG methylation from PacBio BAM files.
rocit extract-bam-methylation \
--sample-id SAMPLE_ID \
--sample-bam aligned.bam \
--output-dir methylation/ \
--workers 8 \
--min-mapq 0 \
--chromosomes "chr1 chr2 chr3"Options:
--sample-id: Sample identifier for output naming--sample-bam: Input BAM file with MM/ML tags--output-dir: Output directory--index: BAM index file (optional, auto-detected)--min-mapq: Minimum mapping quality (default: 0)--workers: Number of parallel workers (default: 1)--chromosomes: Space-separated chromosomes to process (default: chr1-chrY)
Outputs:
{output_dir}/{chromosome}_cpg_methylation_data.parquetfor each chromosome
Aggregate methylation distribution from extracted data.
rocit extract-cpg-distribution \
--sample-id SAMPLE_ID \
--methylation-dir methylation/ \
--output-dir distribution/Outputs:
{output_dir}/{sample_id}_methylation_distribution.parquet: An aggregated distribution of methylation values across the sample, used for model context.
The primary output from ROCIT is a parquet file with read-level predictions:
import polars as pl
predictions = pl.read_parquet("predictions/SAMPLE_ID_tumor_origin_predictions.parquet")
print(predictions.head())
# Example output:
# ┌────────────┬────────────┬───────────────────┐
# │ read_index │ chromosome │ tumor_probability │
# ├────────────┼────────────┼───────────────────┤
# │ 1001 │ chr1 │ 0.87 │
# │ 1002 │ chr1 │ 0.12 │
# │ 1003 │ chr1 │ 0.94 │
# └────────────┴────────────┴───────────────────┘Training progress is logged to CSV:
metrics = pl.read_csv("training/SAMPLE_ID/version_0/metrics.csv")
# Contains: epoch, train_loss, train_auroc, val_loss, val_auroc, etc.ROCIT uses a transformer-based architecture designed for long-read methylation data:
- Input: CpG methylation patterns, cell atlas features, sample distribution features
ROCIT can also be used programmatically:
import rocit
from pathlib import Path
# Training
train_result = rocit.train(
sample_id="SAMPLE_ID",
labelled_data=labelled_df,
sample_distribution=distribution_df,
cell_atlas=atlas_df,
val_chromosomes=["chr20", "chr21"],
test_chromosomes=["chr22"],
output_dir=Path("training/"),
cache_dir=Path("/scratch/")
)
# Prediction
predictions = rocit.predict(
sample_id="SAMPLE_ID",
best_checkpoint_path=Path("training/best-checkpoint.ckpt"),
read_store=[methylation_lazy_df], # List of polars DataFrames or LazyFrames
sample_distribution=distribution_df,
cell_atlas=atlas_df,
output_dir=Path("predictions/"),
cache_dir=Path("/scratch/")
)If you prefer to build the cell-type methylation atlas yourself rather than using the pre-computed version, two scripts are provided. This process downloads and processes whole-genome bisulfite sequencing data from GEO accession GSE186458, which contains methylation profiles across diverse normal human cell types.
pip install pyBigWig polars tqdm requestsStep 1: Download the bigwig files
This will download ~60 GB of raw data from GEO NCBI:
python setup_scripts/download_bigwig_files.py \
--outdir /path/to/download_directory/Options:
--outdir: Output directory for downloaded files (default:GSE186458_hg38_bigwigs)--max-concurrent: Number of concurrent downloads (default: 4; do not exceed ~5)--dry-run: Print URLs without downloading
Step 2: Build the atlas from downloaded files
python setup_scripts/generate_cell_map_df.py \
--data-dir /path/to/download_directory/ \
--output reference/cell_atlas.parquet- Download (
download_bigwig_files.py): Fetches all*.hg38.bigwigfiles from the GSE186458 GEO series via NCBI FTP, with resume support and retry logic - Process (
generate_cell_map_df.py): For each cell type, aggregates methylation values across biological replicates - Combine: Joins all cell types into a single reference atlas
- Output: Saves a Parquet file with columns:
chromosome: chr1-chr22, chrXposition: CpG genomic positionaverage_methylation_{cell_type}: Mean methylation value (0-1) for each cell type
The resulting atlas enables ROCIT to contextualize read-level methylation patterns using cell-type-specific reference signatures.
GSE186458 contains whole-genome bisulfite sequencing (WGBS) data from normal human tissues and cell types. Each cell type typically has multiple biological replicates, which the script averages to produce robust methylation estimates.
Citation: Loyfer, N., et al. (2023). A DNA methylation atlas of normal human cell types. Nature.
ROCIT accepts the following tabular file formats for all non-BAM inputs (copy number, variants, haplotags, haploblocks, SNV clusters, cell atlas, etc.):
| Format | Extensions |
|---|---|
| Parquet | .parquet, .pqt |
| CSV | .csv |
| TSV | .tsv |
| Arrow IPC / Feather | .arrow, .feather, .ipc |
| Newline-delimited JSON | .ndjson |
All outputs from ROCIT are written as Parquet.
Required columns:
chromosome: Chromosome name (e.g., "chr1")start: Segment start positionend: Segment end positionminor_cn: Minor allele copy numbermajor_cn: Major allele copy numbertotal_cn: Total copy numberpurity: Tumor purity estimatenormal_total_cn: Normal total copy number (typically 2 except for chrX/chrY in XY subjects)
Required columns:
chromosome: Chromosome nameposition: Variant positionref: Reference allelealt: Alternate allele- Additional variant metadata as needed
Required columns:
read_index: Unique read identifierchromosome: Chromosome namehaplotag: Haplotype assignment (1 or 2)start: Read start positionend: Read end position
Required columns:
chromosome: Chromosome nameblock_start: Block start positionblock_end: Block end positionblock_size: Size of phased blockhaploblock_id: Unique block identifier
Required columns:
cluster_id: Unique cluster identifiercluster_ccf: Cancer cell fraction for the cluster (0-1)cluster_fraction: Fraction of variants assigned to this cluster (0-1)
This file is optional. If not provided, cluster assignments will be inferred using a binomial model.
Required columns:
chromosome: Chromosome nameposition: Variant positioncluster_id: Cluster identifier (must match IDs in SNV Clusters)n_copies: Number of allelic copies of the variant.
If you use ROCIT in your research, please cite:
Baker TM, Matulionis N, Andrasz C, Gerke D, Garcia-Dutton N, Atkinson D, et al. Genome-wide classification of tumor-derived reads from bulk long-read sequencing. bioRxiv [Preprint]. 2026 Mar 5. DOI: 10.64898/2026.03.03.709085
ROCIT is licensed under the BSD 3-Clause License. See the LICENSE file for details.
