SeekSoul™ Methyl Tools (SeekSoulMethyl) is a single-cell transcriptome + methylation analysis pipeline designed to analyze data generated using the Beijing SeekGene BioSciences Co., Ltd. SeekOne DD Single Cell Multiome Methylation + RNA kit.
SeekOne DD Single Cell Multiome Methylation + RNA kit comes in two chemistries. DD-MET3 (dual-label) means the RNA and DNA methylation data barcodes are different for the same cell, and the RNA library is a 3′-end transcriptome library. DD-MET5 (single-label) means the RNA and DNA methylation data barcodes are the same for the same cell, and the RNA library is a 5′-end transcriptome library. Below we describe the DNA methylation library structures for both chemistries.
DD-MET3 Methylation Library Structure
Figure 1. DD-MET3 methylation library structureStructure notes:
- SP1/SP2: Adapter sequences
- barcode: 17 bp cell barcode
- 7F: 7 bp linker sequence
- 17L: 17 bp fixed sequence CgtCCgtCgttgCtCgt
- ME: 19 bp fixed sequence AGATGTGTATAAGAGACAG
- 9 bp: extension sequence from the Tn5 insertion fragment
DD-MET5 Methylation Library Structure
Figure 2. DD-MET5 methylation library structureStructure notes:
- SP1/SP2: Adapter sequences
- barcode: 17 bp cell barcode
- UMI: 12 bp UMI sequence
- TSO: 13 bp TSO sequence TTTCTTATATGGG
- 17L: 17 bp fixed sequence CgtCCgtCgttgCtCgt
- ME: 19 bp fixed sequence AGATGTGTATAAGAGACAG
- 9 bp: extension sequence from the Tn5 insertion fragment
Since the enzymatic treatment converts unmethylated cytosines (C) to thymines (T), the C bases in SP1 and SP2 are methylated to prevent errors in the sequencing adapters during this conversion. Furthermore, the barcodes used for methylation data do not contain any C bases. In contrast, the C bases in 7F, 17L, and ME are not methylated and will be converted to T during the enzymatic process; we use these fixed sequences to calculate the C-to-T conversion rate.
- Clone the repository:
git clone https://github.com/seekgene/SeekSoulMethyl.git
cd SeekSoulMethyl- Create and activate conda environment:
For users in China:
conda env create -n seeksoulmethyl -f conda_dependencies.zh.yml
conda activate seeksoulmethylFor international users:
conda env create -n seeksoulmethyl -f conda_dependencies.yml
conda activate seeksoulmethyl- Install the package:
cd dependence
pip install . \
simpleqc/target/wheels/simpleqc-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl \
search-pattern/target/wheels/search_pattern-0.1.0-py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl
cd ..
We will download our modified versions of Bismark and ALLCools for analysis.
- Bismark adds the CB (error-corrected barcode) tag and the UR (raw UMI) tag to BAM files.
- ALLCools performs UMI deduplication and methylation level calculation based on the UR tag.
# Clone and install custom ALLCools
conda activate seeksoulmethyl
git clone https://github.com/seekgene/ALLCools.git && \
pip install ./ALLCools && \
rm -rf ./ALLCools
# Clone and install custom Bismark
git clone https://github.com/seekgene/Bismark.git && \
bin_path=$(dirname `which python`)
cp -r ./Bismark/* $bin_path/ && \
chmod +x $bin_path/bismark* && \
chmod +x $bin_path/deduplicate_bismark && \
rm -rf ./Bismark
# Download human reference genome (GRCh38)
wget -dc -O human-reference-GRCh38.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/v1.1/human-reference-GRCh38.tar.gz"
wget -dc -O human-reference-GRCh38.tar.gz.md5 "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/v1.1/human-reference-GRCh38.tar.gz.md5"
# Download mouse reference genome (GRCm39)
wget -dc -O mouse-reference-GRCm39.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/v1.1/mouse-reference-GRCm39.tar.gz"
wget -dc -O mouse-reference-GRCm39.tar.gz.md5 "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/v1.1/mouse-reference-GRCm39.tar.gz.md5"
# Extract reference genomes
tar -xzf human-reference-GRCh38.tar.gz
tar -xzf mouse-reference-GRCm39.tar.gzFor testing the pipeline with a small dataset, you can download the tiny test data:
# Download transcriptome test data
wget -dc -O XYRD-WTJW880-E_S1_L005_R1_001.fastq.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/fastq/XYRD-WTJW880-E_S1_L005_R1_001.fastq.gz"
wget -dc -O XYRD-WTJW880-E_S1_L005_R1_001.fastq.gz.md5 "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/fastq/XYRD-WTJW880-E_S1_L005_R1_001.fastq.gz.md5"
wget -dc -O XYRD-WTJW880-E_S1_L005_R2_001.fastq.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/fastq/XYRD-WTJW880-E_S1_L005_R2_001.fastq.gz"
wget -dc -O XYRD-WTJW880-E_S1_L005_R2_001.fastq.gz.md5 "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/fastq/XYRD-WTJW880-E_S1_L005_R2_001.fastq.gz.md5"
# Download methylation test data
wget -dc -O XYRD-WTJW880-MET_S01_L001_R1_001.fastq.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/tiny_fastq/XYRD-WTJW880-MET_S01_L001_R1_001.fastq.gz"
wget -dc -O XYRD-WTJW880-MET_S01_L001_R1_001.fastq.gz.md5 "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/tiny_fastq/XYRD-WTJW880-MET_S01_L001_R1_001.fastq.gz.md5"
wget -dc -O XYRD-WTJW880-MET_S01_L001_R2_001.fastq.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/tiny_fastq/XYRD-WTJW880-MET_S01_L001_R2_001.fastq.gz"
wget -dc -O XYRD-WTJW880-MET_S01_L001_R2_001.fastq.gz.md5 "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/tiny_fastq/XYRD-WTJW880-MET_S01_L001_R2_001.fastq.gz.md5"
Note: This is a small test dataset for pipeline validation. For production analysis, use your own sequencing data.
After cloning, the key Nextflow entry points and modules are:
nf/main.nf: Top-level entry. Select sub-workflow via--workflow(rna_met,methy_only,force_cell).nf/subworkflows/: Workflow definitions:rna_met/main.nf: Transcriptome + methylation end-to-end processing.met_only/main.nf: Methylation-only workflow.forcecell/main.nf: Force-cell workflow (recomputes/updates results using previous outputs).
nf/modules/: Step-wise process modules:step1.nfpreprocessing, QC, barcode extraction, transcriptome analysis.step2.nfBismark alignment and BAM sorting.step3.nfper-cell BAM splitting, ALLC generation/merge, multi-scale datasets.step4.nfsummaries, dimensionality reduction, joint report.utils.nfhelpers for methylation-only workflow (reads counting and cell estimation).
nf/bin/: Helper scripts and resources (e.g., barcode whitelists).nf/nextflow.config: Executors and resource configuration.nf/nextflow_schema.json: Pipeline parameter schema.sc_methy_workflow.sh: Shell script to run the dual-omics analysis pipeline.
We provide two methods for data analysis:
- Shell Script: Run the analysis pipeline directly via the
sc_methy_workflow.shscript. - Nextflow Pipeline: Run the Nextflow pipeline via
nf/main.nf.
Details of both methods are provided below.
conda activate seeksoulmethyl# sc_methy_workflow.sh can be found in the SeekSoulMethyl directory you cloned
bash sc_methy_workflow.sh \
/path/to/expression_R1.fastq.gz \
/path/to/expression_R2.fastq.gz \
/path/to/methy_R1.fastq.gz \
/path/to/methy_R2.fastq.gz \
--sample WTJW880 \
--outdir /path/to/results \
--database_dir /path/to/human-reference-GRCh38 \
--chemistry DD-MET3 \
--core 64 \
--filter_ch 2If a sample has multiple datasets, separate file paths with commas. Ensure that the FASTQ files are listed in the correct order for each dataset.
bash sc_methy_workflow.sh \
/path/to/WTJW969_E_L003_R1.fq.gz,/path/to/WTJW969_E_L004_R1.fq.gz \
/path/to/WTJW969_E_L003_R2.fq.gz,/path/to/WTJW969_E_L004_R2.fq.gz \
/path/to/WTJW969_Met_L000_R1.fq.gz,/path/to/WTJW969_Met_L001_R1.fq.gz,/path/to/WTJW969_Met_L002_R1.fq.gz,/path/to/WTJW969_Met_L003_R1.fq.gz,/path/to/WTJW969_Met_L004_R1.fq.gz \
/path/to/WTJW969_Met_L000_R2.fq.gz,/path/to/WTJW969_Met_L001_R2.fq.gz,/path/to/WTJW969_Met_L002_R2.fq.gz,/path/to/WTJW969_Met_L003_R2.fq.gz,/path/to/WTJW969_Met_L004_R2.fq.gz \
--sample WTJW969 \
--outdir /path/to/results \
--database_dir /path/to/human-reference-GRCh38 \
--chemistry DD-MET5 \
--core 64 \
--filter_ch 2- $1: Single-cell transcriptome Read1 fastq file path.
- $2: Single-cell transcriptome Read2 fastq file path.
- $3: Single-cell methylation Read1 fastq file path.
- $4: Single-cell methylation Read2 fastq file path.
- sample: Sample name.
- outdir: Output directory path.
- database_dir: Reference genome database path.
- chemistry: Methylation chemistry (DD-MET3 or DD-MET5). DD-MET3 is a dual-label dataset, meaning the RNA and DNA methylation data barcodes are different for the same cell, while DD-MET5 is single-label, meaning the RNA and DNA methylation data barcodes are the same for the same cell.
- core: Number of CPU cores.
- filter_ch: Filter reads that contain > n CH methylation sites. If you do not want to enable this filter, set filter_ch to 0.
Transcriptome data is analyzed using SeekSoulTools. See the official Algorithms Overviewfor detailed steps. Cells used in the downstream methylation library are determined based on the transcriptome library cell barcodes.
Barcode extraction and correction
Based on the designed structure, we locate the barcode in the read and extract the corresponding sequence. If the extracted barcode is in the whitelist, it is counted as a valid barcode; otherwise, SeekSoulMethyl attempts barcode correction, if the barcode has a one-base mismatch (Hamming distance = 1) from an entry in the whitelist:
- If exactly one whitelist candidate matches: correct the invalid barcode to that whitelist barcode.
- If multiple whitelist candidates match: correct to the candidate supported by the highest read count.
If correction fails, the read is discarded and considered a final invalid barcode read.
UMI extraction
UMI positions are read from the designed structure and extracted without correction.
Forward and Reverse reads determination
From the positions corresponding to 17L and ME, there are 7 bases that can be C or converted T (highlighted below). We use the first and the last two C/T positions to determine forward vs. reverse reads: if all three positions are C, it indicates a reverse read; otherwise, it is a forward read.
-
Forward: TgtTTgtTgttgTtTgtAGATGTGTATAAGAGAT
-
Reverse: CgtCCgtCgttgCtCgtAGATGTGTATAAGAGAC
Reverse reads correspond to CTOT/CTOB (reverse complement of the original strand) in methylation terminology; forward reads correspond to OT/OB (original strand). The "forward" or "reverse" determination is annotated in the read name.
Figure 3. Forward and reverse reads determinationC–T conversion rate
We calculate the C-to-T conversion rate using the original C positions within 17L and ME sequences. Since these are fixed sequences prone to sequencing errors, we restrict the calculation to reads with verified structures:
-
In DD-MET3, the 7F sequence must be
TTGCTGTorTTGTTGT, the sequence spanning 17L and ME must beGTAGATGTGTATAAGAGA, and the bases at first and the last two original C positions must be T. -
In DD-MET5, the sequence spanning 17L and ME must be
GTAGATGTGTATAAGAGA, and the bases at the first and last two original C positions must be T.
For the retained reads, we extract the bases at the corresponding positions to calculate the C-to-T conversion rate:
Figure 4. CT conversion rateNote
The above filtering steps are used only for calculating the C-to-T conversion rate; reads that do not meet these criteria are not filtered out from the final output FASTQ files.
Artifact removal
Remove TSO/7F linker, 17L and ME sequences from Read1 according to their predefined positions in the library design.
Adapter trimming
Use cutadapt to remove ME adapter sequences introduced by R1/R2 read-through events (overlapping paired-end reads).
Trim 9 bp gaps introduced by Tn5 transposase
After removing adapters and other artificial sequences, we additionally trim the 9 bp gaps flanking the inserted fragment that are introduced by Tn5 transposition. These 9 bp regions can carry artificial methylation and spuriously elevate CH methylation adjacent to the insert boundaries, so they are removed prior to downstream analysis.
Filter reads with too many non-CpG methylated C bases (optional)
Filter based on the number of non-CpG methylated C bases in a read pair. By default, pairs with > 2 non-CpG methylated Cs detected in read1/read2 are removed. If you do not want to enable this filter, set filter_ch to 0.
Note
This filtering strategy is based on findings by Lu et al. [1], which suggest that nicks in synthesized adapters can trigger Bst polymerase nick translation. This activity incorporates 5-methyl-dCTPs, leading to completely unconverted reads that appear as artificial methylation signals.
Filter too short reads After the preceding filtering and adapter trimming steps, if the length of R1 in a read pair is less than 20 bp or the length of R2 is less than 60 bp, the read pair is filtered out.
Alignment and tagging
We use Bismark for methylation alignment. Our modified Bismark adds --add_barcode and --add_umi to tag BAM files by read name with CB (error-corrected barcode) and UR (raw UMI). For forward reads, we use -X 1000 to allow insert sizes up to 1000 bp; for reverse reads, we use --pbat and -X 1000.
After alignment, sort BAMs by read name using samtools sort -n; the name-sorted BAMs serve as inputs for downstream analysis.
Split by cell barcode
Split name-sorted BAMs by RNA-derived cell barcodes into per-cell BAM files, each containing uniquely mapped reads for one cell.
Generate ALLC files
Sort each per-cell BAM by position and convert to ALLC using ALLCools bam-to-allc. Our modified ALLCools performs UR-tag-based UMI correction and deduplication per C site.
See the SeekGene ALLCools repository for details.
Generate MCDS
Run allcools generate-dataset to bin the genome (chrom10k/20k/50k/100k/500k/1M/geneslop2k) and compute per-cell methylation matrices. Geneslop2k bins are defined as 2k bp flanking each gene.
By default, perform dimensionality reduction with LSI on chrom20k bins using ALLCools, followed by UMAP visualization.
If you use sc_methy_workflow.sh, the system requirements are as follows:
- CPU: 64 cores (recommended)
- Memory: 128GB RAM (recommended)
- OS: Linux (recommended Ubuntu 18.04+ or CentOS 7+)
If you want to process samples in batch and get pipeline-level reports, use Nextflow:
Figure 6. SeekSoulMethyl nextflow pipeline workflow- Install nextflow:
conda install -n seeksoulmethyl -c bioconda nextflow- Prepare your input samplesheet
cat > samplelist.csv << EOF
sample_id,expression_r1,expression_r2,methylation_r1,methylation_r2
XYRD-WTJW880,/path/to/XYRD-WTJW880-E_S1_L005_R1_001.fastq.gz,/path/to/XYRD-WTJW880-E_S1_L005_R2_001.fastq.gz,/path/to/XYRD-WTJW880-MET_S01_L001_R1_001.fastq.gz,/path/to/XYRD-WTJW880-MET_S01_L001_R2_001.fastq.gz
EOF
# expression_r1: Single-cell transcriptome Read1 fastq file
# expression_r2: Single-cell transcriptome Read2 fastq file
# methylation_r1: Single-cell methylation Read1 fastq file
# methylation_r2: Single-cell methylation Read2 fastq file
If a single sample has multiple datasets and the transcriptome and methylation FASTQ counts do not match (e.g., WTJW969), add multiple rows to the samplesheet, with each row representing one dataset.
sample_id,expression_r1,expression_r2,methylation_r1,methylation_r2
WTJW969,/path/to/WTJW969_E_L003_R1.fq.gz,/path/to/WTJW969_E_L003_R2.fq.gz,/path/to/WTJW969_Met_L000_R1.fq.gz,/path/to/WTJW969_Met_L000_R2.fq.gz
WTJW969,/path/to/WTJW969_E_L004_R1.fq.gz,/path/to/WTJW969_E_L004_R2.fq.gz,/path/to/WTJW969_Met_L001_R1.fq.gz,/path/to/WTJW969_Met_L001_R2.fq.gz
WTJW969,,,/path/to/WTJW969_Met_L002_R1.fq.gz,/path/to/WTJW969_Met_L002_R2.fq.gz
WTJW969,,,/path/to/WTJW969_Met_L003_R1.fq.gz,/path/to/WTJW969_Met_L003_R2.fq.gz
WTJW969,,,/path/to/WTJW969_Met_L004_R1.fq.gz,/path/to/WTJW969_Met_L004_R2.fq.gz
- Run the pipeline:
nextflow run -bg SeekSoulMethyl/nf/main.nf \
--outdir /path/to/tiny_demo/results/ \
--samplesheet samplelist.csv \
-w /path/to/tiny_demo/results/work \
-c SeekSoulMethyl/nf/nextflow.config \
-profile aliyun_k8s \
--database_dir /path/to/human-reference-GRCh38/ \
--split_fastq 1 \
--filter_ch 2 \
--chemistry DD-MET3 > methy.log
# --outdir: final results directory
# --samplesheet: input samplesheet file
# -w: working directory for nextflow
# -c: nextflow configuration file. Must be modified according to your server configuration.
# -profile: nextflow profile for aliyun k8s
# --database_dir: reference genome database directory
# --split_fastq: To speed up the analysis process, split fastq according to first n bases of barcode. Default is 4.
# --filter_ch: filter reads that contain > 2 CH methylation sites.
# --chemistry: methylation chemistry (DD-MET3 or DD-MET5)nextflow.config declares the execution environment, resource quotas, and run policies. You must customize it to your own server or cluster.
- Location:
SeekSoulMethyl/nf/nextflow.config(specify with-c SeekSoulMethyl/nf/nextflow.configwhen running). - Key items to tailor to your infrastructure:
- Executor:
process.executor(e.g.,local,slurm,pbs,k8s,awsbatch). - Resources:
process.cpus,process.memory,process.time, or fine-grained overrides viawithLabel/withName. - Work directory:
workDir(can also be set via-w); ensure it is writable and has sufficient space. - Environment: enable one of
conda.enabled,docker.enabled, orsingularity.enabledaccording to what your server supports.
- Executor:
Example configurations (replace paths and parameters with values valid on your servers):
profiles {
// Local machine
local {
process.executor = 'local'
workDir = '/path/to/work'
process.cpus = 8
process.memory = '32 GB'
conda.enabled = true
// Or containers: singularity.enabled = true / docker.enabled = true
}
// Slurm cluster (HPC)
slurm {
process.executor = 'slurm'
workDir = '/lustre/work/your_user'
process.cpus = 8
process.memory = '32 GB'
process.queue = 'normal'
process.clusterOptions = '-A your_account --qos=normal'
withLabel: 'high_mem' {
cpus = 16
memory = '64 GB'
}
}
// Kubernetes (e.g., Alibaba Cloud ACK)
aliyun_k8s {
process.executor = 'k8s'
workDir = '/mnt/nf-work' // persistent volume path
k8s {
namespace = 'your-namespace'
storageClaimName = 'your-pvc'
cpu = 4
memory = '16 GB'
}
// If using container images globally: docker.enabled = true
}
}Tips:
- Pick the
-profilethat matches your environment (e.g.,slurm,local,aliyun_k8s), then adapt the parameters. - Keep
workDiron fast storage with ample capacity; the final results directory is controlled by--outdir. - If you use the README’s conda environment, prefer
conda.enabled; if your cluster favors containers, use Docker/Singularity.
References:
- Nextflow configuration & profiles: https://www.nextflow.io/docs/latest/config.html
- Executors (local, Slurm, K8s, etc.): https://www.nextflow.io/docs/latest/executor.html
- Kubernetes guide: https://www.nextflow.io/docs/latest/kubernetes.html
This section describes each Nextflow process with inputs, core logic, key parameters, and outputs for troubleshooting and interpretation. The workflow is defined in nf/main.nf and processes are implemented in nf/modules/*.nf.
-
Compute genome-wide CpG count:
COMPUTE_CPG_SITES- Input:
params.genomefa,params.chrom_size_path - Action: run
count_cg_sites.pyto count CpG sites - Output:
genome_cg_info.json
- Input:
-
Expression FASTQ QC (multi-group):
FASTP_EXPRESSION_MULTI- Input: paired FASTQs per sample (groups G1/G2/...)
- Action:
fastptrimming and QC - Output: cleaned
*_expression_clean_R1/2.fastq.gz,*.html,*.json
-
Methylation FASTQ QC (multi-group):
FASTP_METHYLATION_MULTI- Input: paired FASTQs per sample (groups G1/G2/...)
- Action:
fastpQC (adapter detection disabled, trimming as per pipeline) - Output: cleaned
*_methylation_clean_R1/2.fastq.gz,*.html,*.json
-
RNA alignment and quantification:
SEEKSOULTOOLS_RNA- Input: cleaned expression R1/R2 lists
- Action:
seeksoultools rna run(STAR mapping, counting, filtering, clustering, DE) - Output:
Analysis/step3/filtered_feature_bc_matrix/etc.
-
Methylation barcode parsing:
METHYLATION_BARCODE_EXTRACTION- Input: cleaned methylation R1/R2 lists, whitelist
params.methy_barcode_wl - Action: run
barcode_cs_multi.pyto parse/correct cell barcodes and UMIs byparams.chemistry; optional--split_fastq nto shard reads by first n barcode bases - Output:
step1/${sample}_forward_*{1,2}.fq.gz,step1/${sample}_reverse_*{1,2}.fq.gz,${sample}_methy_summary.json, optional${sample}_barcode_stats.txt
- Input: cleaned methylation R1/R2 lists, whitelist
-
Build forward/reverse pairing lists:
PARSE_FASTQ_FILES- Input: forward/reverse FASTQ sets
- Action: scan
step1/and pair files by identifiers - Output:
forward_pairs.txt,reverse_pairs.txt
-
Post-barcode-extraction QC:
FASTP_METHYLATION_BARCODE_EXTRACT- Input: paired sub-FASTQs
- Action:
fastpQC - Output: per-pair
*.html,*.json
-
Forward-strand alignment:
BISMARK_ALIGNMENT_FORWARD- Key flags:
--add_barcode,--add_umi; max insert size-X 1000 - Output:
*_bismark_bt2_pe.bam,*_bismark_bt2_PE_report.txt
- Key flags:
-
Reverse (PBAT) alignment:
BISMARK_ALIGNMENT_REVERSE- Key flags:
--pbat,--add_barcode,--add_umi - Output: same as above
- Key flags:
-
Sort by read name:
SORT_BAM_BY_NAME- Action:
samtools sort -n - Output:
*_sortbyname.bam
- Action:
-
Split BAMs by cell barcode:
SPLIT_BAM_FILES- Input: name-sorted BAM and GEX barcodes
barcodes.tsv.gz - Action: run
step3_split_bams.pyto split by cell barcode and keep shared cells only - Output: per-cell BAM dir,
*_filtered_barcode,*_filtered_barcode_reads_counts.csv
- Input: name-sorted BAM and GEX barcodes
-
Merge forward/reverse per-cell BAMs and counts:
MERGE_BISMARK_BAM- Action:
samtools merge -nper matching cell; merge/deduplicate barcodes and read counts - Output:
*_merged_fr_bam/,*_merge_filtered_barcode,*_merge_filtered_barcode_reads_counts.csv
- Action:
-
Generate per-cell ALLC:
ALLCOOLS_BAM_TO_ALLC- Action: run
step3_bam_to_allc.py(custom ALLCools), UR-based dedup and methylation quantification - Output: per-cell
*_allc.gzand index
- Action: run
-
Merge cell metrics:
MERGE_FILTERED_BARCODE_READS_COUNTS- Action: merge barcodes and read counts, create
${sample}_cells.csvand.json - Output:
filtered_barcode,filtered_barcode_reads_counts.csv,${sample}_cells.{csv,json}
- Action: merge barcodes and read counts, create
-
Build multi-scale dataset:
ALLCOOLS_GENERATE_DATASETS- Action:
allcools generate-datasetfor chrom10k/20k/50k/100k/500k/1M, metrics likecountandhypo-score - Output:
${sample}.mcds
- Action:
-
Merge ALLC (when sharded):
ALLCOOLS_SUBMERGE,ALLCOOLS_MERGE- Action: merge per-shard/per-sample ALLCs
- Output:
${sample}_merge_allc.gzand index
-
Extract CG context ALLC:
ALLCOOLS_EXTRACT- Action:
allcools extract-allc --mc_contexts CGN --strandness merge - Output:
*.CGN-Merge*
- Action:
(Methylation-only workflow methy_only.nf additionally includes COUNTS_MAPPED_READS and ESTIMATED_CELLS for read-count-based cell estimation and barcode filtering)
-
Methylation summary:
METHYLATION_SUMMARY- Action:
step4_wgs_summary.pyaggregates Bismark reports, cell metrics and CpG stats to produce${sample}_methy_summary.jsonand${sample}_wgs_summary.csv
- Action:
-
LSI/PCA clustering and visualization:
METHYLATION_LSI_PCA_CLUSTERING- Action:
step4_allcools_PCA_cluster.py --var_dim chrom20k --reduc lsi - Output:
*.h5ad,*.pdf,*.png
- Action:
-
Transcriptome+Methylation joint report:
MULTI_REPORT- Action:
step4_report_rna_met.pyintegrates transcriptome and methylation outputs - Output:
${sample}_rna_methyl_report.html,${sample}_rna_met.json
- Action:
fastp/: QC reports for raw and post-barcode FASTQs (html/json)${sample}/${sample}_exp/: Transcriptome analysis directory with filtered matrix, clustering, DE results${sample}/${sample}_methy/step1/: Barcode-parsed and sharded FASTQs${sample}/${sample}_methy/step2/: Bismark BAMs and reports${sample}/${sample}_methy/step3/:split_bams/andsplit_bams/merged/: Per-cell BAMs, merged BAMs, barcode countsallcools/andallcools_generate_datasets/: Per-cell ALLCs and${sample}.mcds${sample}_merge_allc.gz,*.CGN-Merge*
${sample}/${sample}_methy/step4/: Clustering plots and*.h5ad${sample}/:${sample}_methy_summary.json,${sample}_wgs_summary.csv${sample}_rna_methyl_report.html(if running the main workflow)
- Nextflow run artifacts (as configured by
-c nf/nextflow.config):execution_report.html,execution_timeline.html,pipeline_dag.html,execution_trace.txt
Use the simplified workflow when you only have methylation reads:
nextflow run SeekSoulMethyl/nf/main.nf \
--workflow methy_only \
--outdir /path/to/results \
--samplesheet samplelist.csv \
-w /path/to/work \
-c SeekSoulMethyl/nf/nextflow.config \
-profile aliyun_k8s \
--database_dir /path/to/reference \
--split_fastq 4 \
--filter_ch 2 \
--chemistry DD-MET3--database_dir: reference directory withfasta/genome.fa,genes/genes.gtf,star/,bed/chr_nochrM.bed--chemistry:DD-MET3orDD-MET5; also sets transcriptome chemistry and barcode whitelist--split_fastq: shard by the first n barcode bases (default 4; increases parallelism but adds scheduling/merge overhead)--filter_ch: filter reads with > n CH methylation sites (default 2). If you do not want to enable this filter, setfilter_chto 0.- The samplesheet header must be
sample_id
- Recommended container image:
ghcr.io/seekgene/seeksoulmethyl_docker:v1.0.0 - Choose
-profile aliyun_k8sor-profile docker, and editnf/nextflow.configfor your infra - Heavy steps: Bismark/ALLCools need substantial CPU/RAM; defaults are set in
withNameblocks, scale up if needed
- Samplesheet parsing error: ensure first column is
sample_id, use absolute paths - Missing
${sample}.mcds: checkALLCOOLS_BAM_TO_ALLCproduced per-cell*_allc.gzandchrom_size_pathis correct - Stuck at Bismark: verify reference indices and that
params.bismark_refis visible in the container - Resume runs: use
-resumewith the same-wwork directory
Tutorials and reference materials are available under the docs/ directory:
- How to build the reference genome database (
--database_dir) - How to obtain single-cell BAM files (中文)
- How to deduplicate single-cell BAM files (中文)
This project is licensed under the MIT License - see the LICENSE file for details.
[1] Lu X, Yuan Y, et al. Improved tagmentation-based whole-genome bisulfite sequencing for input DNA from less than 100 mammalian cells. Epigenomics. 2015;7(1):47-56. doi:10.2217/epi.14.76.
"Furthermore, by manually checking the reads, we found a part of the reads were completely unconverted. We suspected that a nick in the synthesized adapters will cause the whole fragment displaced with incorporation of 5-methyl-dCTPs due to nick translation activity of Bst polymerase."





