SeekSoulMethyl

SeekSoul™ Methyl Tools (SeekSoulMethyl) is a single-cell transcriptome + methylation analysis pipeline designed to analyze data generated using the Beijing SeekGene BioSciences Co., Ltd. SeekOne DD Single Cell Multiome Methylation + RNA kit.

Data Structure

SeekOne DD Single Cell Multiome Methylation + RNA kit comes in two chemistries. DD-MET3 (dual-label) means the RNA and DNA methylation data barcodes are different for the same cell, and the RNA library is a 3′-end transcriptome library. DD-MET5 (single-label) means the RNA and DNA methylation data barcodes are the same for the same cell, and the RNA library is a 5′-end transcriptome library. Below we describe the DNA methylation library structures for both chemistries.

DD-MET3 Methylation Library Structure

Figure 1. DD-MET3 methylation library structure

Structure notes:

SP1/SP2: Adapter sequences
barcode: 17 bp cell barcode
7F: 7 bp linker sequence
17L: 17 bp fixed sequence CgtCCgtCgttgCtCgt
ME: 19 bp fixed sequence AGATGTGTATAAGAGACAG
9 bp: extension sequence from the Tn5 insertion fragment

DD-MET5 Methylation Library Structure

Figure 2. DD-MET5 methylation library structure

Structure notes:

SP1/SP2: Adapter sequences
barcode: 17 bp cell barcode
UMI: 12 bp UMI sequence
TSO: 13 bp TSO sequence TTTCTTATATGGG
17L: 17 bp fixed sequence CgtCCgtCgttgCtCgt
ME: 19 bp fixed sequence AGATGTGTATAAGAGACAG
9 bp: extension sequence from the Tn5 insertion fragment

Since the enzymatic treatment converts unmethylated cytosines (C) to thymines (T), the C bases in SP1 and SP2 are methylated to prevent errors in the sequencing adapters during this conversion. Furthermore, the barcodes used for methylation data do not contain any C bases. In contrast, the C bases in 7F, 17L, and ME are not methylated and will be converted to T during the enzymatic process; we use these fixed sequences to calculate the C-to-T conversion rate.

Installation

Clone the repository:

git clone https://github.com/seekgene/SeekSoulMethyl.git
cd SeekSoulMethyl

Create and activate conda environment:

For users in China:

conda env create -n seeksoulmethyl -f conda_dependencies.zh.yml
conda activate seeksoulmethyl

For international users:

conda env create -n seeksoulmethyl -f conda_dependencies.yml
conda activate seeksoulmethyl

Install the package:

cd dependence
pip install . \
  simpleqc/target/wheels/simpleqc-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl \
  search-pattern/target/wheels/search_pattern-0.1.0-py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl
cd ..

We will download our modified versions of Bismark and ALLCools for analysis.

Bismark adds the CB (error-corrected barcode) tag and the UR (raw UMI) tag to BAM files.
ALLCools performs UMI deduplication and methylation level calculation based on the UR tag.

# Clone and install custom ALLCools
conda activate seeksoulmethyl
git clone https://github.com/seekgene/ALLCools.git && \
pip install ./ALLCools && \
rm -rf ./ALLCools

# Clone and install custom Bismark
git clone https://github.com/seekgene/Bismark.git && \
bin_path=$(dirname `which python`)
cp -r ./Bismark/* $bin_path/ && \
    chmod +x $bin_path/bismark* && \
    chmod +x $bin_path/deduplicate_bismark && \
    rm -rf ./Bismark

Download Reference Database

# Download human reference genome (GRCh38)
wget -dc -O human-reference-GRCh38.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/v1.1/human-reference-GRCh38.tar.gz"
wget -dc -O human-reference-GRCh38.tar.gz.md5 "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/v1.1/human-reference-GRCh38.tar.gz.md5"

# Download mouse reference genome (GRCm39)
wget -dc -O mouse-reference-GRCm39.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/v1.1/mouse-reference-GRCm39.tar.gz"
wget -dc -O mouse-reference-GRCm39.tar.gz.md5 "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/v1.1/mouse-reference-GRCm39.tar.gz.md5"

# Extract reference genomes
tar -xzf human-reference-GRCh38.tar.gz
tar -xzf mouse-reference-GRCm39.tar.gz

Download Test Data (Optional)

For testing the pipeline with a small dataset, you can download the tiny test data:

# Download transcriptome test data
wget -dc -O XYRD-WTJW880-E_S1_L005_R1_001.fastq.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/fastq/XYRD-WTJW880-E_S1_L005_R1_001.fastq.gz"
wget -dc -O XYRD-WTJW880-E_S1_L005_R1_001.fastq.gz.md5 "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/fastq/XYRD-WTJW880-E_S1_L005_R1_001.fastq.gz.md5"
wget -dc -O XYRD-WTJW880-E_S1_L005_R2_001.fastq.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/fastq/XYRD-WTJW880-E_S1_L005_R2_001.fastq.gz"
wget -dc -O XYRD-WTJW880-E_S1_L005_R2_001.fastq.gz.md5 "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/fastq/XYRD-WTJW880-E_S1_L005_R2_001.fastq.gz.md5"

# Download methylation test data
wget -dc -O XYRD-WTJW880-MET_S01_L001_R1_001.fastq.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/tiny_fastq/XYRD-WTJW880-MET_S01_L001_R1_001.fastq.gz"
wget -dc -O XYRD-WTJW880-MET_S01_L001_R1_001.fastq.gz.md5 "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/tiny_fastq/XYRD-WTJW880-MET_S01_L001_R1_001.fastq.gz.md5"
wget -dc -O XYRD-WTJW880-MET_S01_L001_R2_001.fastq.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/tiny_fastq/XYRD-WTJW880-MET_S01_L001_R2_001.fastq.gz"
wget -dc -O XYRD-WTJW880-MET_S01_L001_R2_001.fastq.gz.md5 "https://seekgene-public.oss-cn-beijing.aliyuncs.com/methy_demo/methy_exp/tiny_fastq/XYRD-WTJW880-MET_S01_L001_R2_001.fastq.gz.md5"

Note: This is a small test dataset for pipeline validation. For production analysis, use your own sequencing data.

Repository Layout

After cloning, the key Nextflow entry points and modules are:

nf/main.nf: Top-level entry. Select sub-workflow via --workflow (rna_met, methy_only, force_cell).
nf/subworkflows/: Workflow definitions:
- rna_met/main.nf: Transcriptome + methylation end-to-end processing.
- met_only/main.nf: Methylation-only workflow.
- forcecell/main.nf: Force-cell workflow (recomputes/updates results using previous outputs).
nf/modules/: Step-wise process modules:
- step1.nf preprocessing, QC, barcode extraction, transcriptome analysis.
- step2.nf Bismark alignment and BAM sorting.
- step3.nf per-cell BAM splitting, ALLC generation/merge, multi-scale datasets.
- step4.nf summaries, dimensionality reduction, joint report.
- utils.nf helpers for methylation-only workflow (reads counting and cell estimation).
nf/bin/: Helper scripts and resources (e.g., barcode whitelists).
nf/nextflow.config: Executors and resource configuration.
nf/nextflow_schema.json: Pipeline parameter schema.
sc_methy_workflow.sh: Shell script to run the dual-omics analysis pipeline.

We provide two methods for data analysis:

Shell Script: Run the analysis pipeline directly via the sc_methy_workflow.sh script.
Nextflow Pipeline: Run the Nextflow pipeline via nf/main.nf.

Details of both methods are provided below.

Usage

Activate Environment

conda activate seeksoulmethyl

Run Dual-omics Analysis (Shell script)

# sc_methy_workflow.sh can be found in the SeekSoulMethyl directory you cloned
bash sc_methy_workflow.sh \
/path/to/expression_R1.fastq.gz \
/path/to/expression_R2.fastq.gz \
/path/to/methy_R1.fastq.gz \
/path/to/methy_R2.fastq.gz \
--sample WTJW880 \
--outdir /path/to/results \
--database_dir /path/to/human-reference-GRCh38 \
--chemistry DD-MET3 \
--core 64 \
--filter_ch 2

If a sample has multiple datasets, separate file paths with commas. Ensure that the FASTQ files are listed in the correct order for each dataset.

bash sc_methy_workflow.sh \
/path/to/WTJW969_E_L003_R1.fq.gz,/path/to/WTJW969_E_L004_R1.fq.gz \
/path/to/WTJW969_E_L003_R2.fq.gz,/path/to/WTJW969_E_L004_R2.fq.gz \
/path/to/WTJW969_Met_L000_R1.fq.gz,/path/to/WTJW969_Met_L001_R1.fq.gz,/path/to/WTJW969_Met_L002_R1.fq.gz,/path/to/WTJW969_Met_L003_R1.fq.gz,/path/to/WTJW969_Met_L004_R1.fq.gz \
/path/to/WTJW969_Met_L000_R2.fq.gz,/path/to/WTJW969_Met_L001_R2.fq.gz,/path/to/WTJW969_Met_L002_R2.fq.gz,/path/to/WTJW969_Met_L003_R2.fq.gz,/path/to/WTJW969_Met_L004_R2.fq.gz \
--sample WTJW969 \
--outdir /path/to/results \
--database_dir /path/to/human-reference-GRCh38 \
--chemistry DD-MET5 \
--core 64 \
--filter_ch 2

Input Parameters:

$1: Single-cell transcriptome Read1 fastq file path.
$2: Single-cell transcriptome Read2 fastq file path.
$3: Single-cell methylation Read1 fastq file path.
$4: Single-cell methylation Read2 fastq file path.
sample: Sample name.
outdir: Output directory path.
database_dir: Reference genome database path.
chemistry: Methylation chemistry (DD-MET3 or DD-MET5). DD-MET3 is a dual-label dataset, meaning the RNA and DNA methylation data barcodes are different for the same cell, while DD-MET5 is single-label, meaning the RNA and DNA methylation data barcodes are the same for the same cell.
core: Number of CPU cores.
filter_ch: Filter reads that contain > n CH methylation sites. If you do not want to enable this filter, set filter_ch to 0.

Process Details

Transcriptome Processing Workflow

Transcriptome data is analyzed using SeekSoulTools. See the official Algorithms Overviewfor detailed steps. Cells used in the downstream methylation library are determined based on the transcriptome library cell barcodes.

Methylation Processing Workflow

Step 1: Preprocessing and Barcode Parsing

Barcode extraction and correction

Based on the designed structure, we locate the barcode in the read and extract the corresponding sequence. If the extracted barcode is in the whitelist, it is counted as a valid barcode; otherwise, SeekSoulMethyl attempts barcode correction, if the barcode has a one-base mismatch (Hamming distance = 1) from an entry in the whitelist:

If exactly one whitelist candidate matches: correct the invalid barcode to that whitelist barcode.
If multiple whitelist candidates match: correct to the candidate supported by the highest read count.

If correction fails, the read is discarded and considered a final invalid barcode read.

UMI extraction

UMI positions are read from the designed structure and extracted without correction.

Forward and Reverse reads determination

From the positions corresponding to 17L and ME, there are 7 bases that can be C or converted T (highlighted below). We use the first and the last two C/T positions to determine forward vs. reverse reads: if all three positions are C, it indicates a reverse read; otherwise, it is a forward read.

Forward: TgtTTgtTgttgTtTgtAGATGTGTATAAGAGAT
Reverse: CgtCCgtCgttgCtCgtAGATGTGTATAAGAGAC

Reverse reads correspond to CTOT/CTOB (reverse complement of the original strand) in methylation terminology; forward reads correspond to OT/OB (original strand). The "forward" or "reverse" determination is annotated in the read name.

Figure 3. Forward and reverse reads determination

C–T conversion rate

We calculate the C-to-T conversion rate using the original C positions within 17L and ME sequences. Since these are fixed sequences prone to sequencing errors, we restrict the calculation to reads with verified structures:

In DD-MET3, the 7F sequence must be TTGCTGT or TTGTTGT, the sequence spanning 17L and ME must be GTAGATGTGTATAAGAGA, and the bases at first and the last two original C positions must be T.
In DD-MET5, the sequence spanning 17L and ME must be GTAGATGTGTATAAGAGA, and the bases at the first and last two original C positions must be T.

For the retained reads, we extract the bases at the corresponding positions to calculate the C-to-T conversion rate:

Figure 4. CT conversion rate

Note

The above filtering steps are used only for calculating the C-to-T conversion rate; reads that do not meet these criteria are not filtered out from the final output FASTQ files.

Artifact removal

Remove TSO/7F linker, 17L and ME sequences from Read1 according to their predefined positions in the library design.

Adapter trimming

Use cutadapt to remove ME adapter sequences introduced by R1/R2 read-through events (overlapping paired-end reads).

Trim 9 bp gaps introduced by Tn5 transposase

After removing adapters and other artificial sequences, we additionally trim the 9 bp gaps flanking the inserted fragment that are introduced by Tn5 transposition. These 9 bp regions can carry artificial methylation and spuriously elevate CH methylation adjacent to the insert boundaries, so they are removed prior to downstream analysis.

Filter reads with too many non-CpG methylated C bases (optional)

Filter based on the number of non-CpG methylated C bases in a read pair. By default, pairs with > 2 non-CpG methylated Cs detected in read1/read2 are removed. If you do not want to enable this filter, set filter_ch to 0.

Note

This filtering strategy is based on findings by Lu et al. [1], which suggest that nicks in synthesized adapters can trigger Bst polymerase nick translation. This activity incorporates 5-methyl-dCTPs, leading to completely unconverted reads that appear as artificial methylation signals.

Filter too short reads After the preceding filtering and adapter trimming steps, if the length of R1 in a read pair is less than 20 bp or the length of R2 is less than 60 bp, the read pair is filtered out.

Step 2: Bismark alignment and sorting by name

Alignment and tagging

We use Bismark for methylation alignment. Our modified Bismark adds --add_barcode and --add_umi to tag BAM files by read name with CB (error-corrected barcode) and UR (raw UMI). For forward reads, we use -X 1000 to allow insert sizes up to 1000 bp; for reverse reads, we use --pbat and -X 1000. After alignment, sort BAMs by read name using samtools sort -n; the name-sorted BAMs serve as inputs for downstream analysis.

Step 3: ALLCools analysis

Split by cell barcode

Split name-sorted BAMs by RNA-derived cell barcodes into per-cell BAM files, each containing uniquely mapped reads for one cell.

Generate ALLC files

Sort each per-cell BAM by position and convert to ALLC using ALLCools bam-to-allc. Our modified ALLCools performs UR-tag-based UMI correction and deduplication per C site.

Figure 5. UMI correction detailed diagram

See the SeekGene ALLCools repository for details.

Generate MCDS

Run allcools generate-dataset to bin the genome (chrom10k/20k/50k/100k/500k/1M/geneslop2k) and compute per-cell methylation matrices. Geneslop2k bins are defined as 2k bp flanking each gene.

Step 4: Reduction and clustering

By default, perform dimensionality reduction with LSI on chrom20k bins using ALLCools, followed by UMAP visualization.

System Requirements

If you use sc_methy_workflow.sh, the system requirements are as follows:

CPU: 64 cores (recommended)
Memory: 128GB RAM (recommended)
OS: Linux (recommended Ubuntu 18.04+ or CentOS 7+)

Running with Nextflow (Recommended)

If you want to process samples in batch and get pipeline-level reports, use Nextflow:

Figure 6. SeekSoulMethyl nextflow pipeline workflow

Install nextflow:

conda install -n seeksoulmethyl -c bioconda nextflow

Prepare your input samplesheet

cat > samplelist.csv << EOF
sample_id,expression_r1,expression_r2,methylation_r1,methylation_r2
XYRD-WTJW880,/path/to/XYRD-WTJW880-E_S1_L005_R1_001.fastq.gz,/path/to/XYRD-WTJW880-E_S1_L005_R2_001.fastq.gz,/path/to/XYRD-WTJW880-MET_S01_L001_R1_001.fastq.gz,/path/to/XYRD-WTJW880-MET_S01_L001_R2_001.fastq.gz
EOF

# expression_r1: Single-cell transcriptome Read1 fastq file
# expression_r2: Single-cell transcriptome Read2 fastq file
# methylation_r1: Single-cell methylation Read1 fastq file
# methylation_r2: Single-cell methylation Read2 fastq file

If a single sample has multiple datasets and the transcriptome and methylation FASTQ counts do not match (e.g., WTJW969), add multiple rows to the samplesheet, with each row representing one dataset.

sample_id,expression_r1,expression_r2,methylation_r1,methylation_r2
WTJW969,/path/to/WTJW969_E_L003_R1.fq.gz,/path/to/WTJW969_E_L003_R2.fq.gz,/path/to/WTJW969_Met_L000_R1.fq.gz,/path/to/WTJW969_Met_L000_R2.fq.gz
WTJW969,/path/to/WTJW969_E_L004_R1.fq.gz,/path/to/WTJW969_E_L004_R2.fq.gz,/path/to/WTJW969_Met_L001_R1.fq.gz,/path/to/WTJW969_Met_L001_R2.fq.gz
WTJW969,,,/path/to/WTJW969_Met_L002_R1.fq.gz,/path/to/WTJW969_Met_L002_R2.fq.gz
WTJW969,,,/path/to/WTJW969_Met_L003_R1.fq.gz,/path/to/WTJW969_Met_L003_R2.fq.gz
WTJW969,,,/path/to/WTJW969_Met_L004_R1.fq.gz,/path/to/WTJW969_Met_L004_R2.fq.gz

Run the pipeline:

nextflow run -bg SeekSoulMethyl/nf/main.nf \
--outdir /path/to/tiny_demo/results/ \
--samplesheet samplelist.csv \
-w /path/to/tiny_demo/results/work \
-c SeekSoulMethyl/nf/nextflow.config \
-profile aliyun_k8s \
--database_dir /path/to/human-reference-GRCh38/ \
--split_fastq 1 \
--filter_ch 2 \
--chemistry DD-MET3 > methy.log

# --outdir: final results directory
# --samplesheet: input samplesheet file
# -w: working directory for nextflow
# -c: nextflow configuration file. Must be modified according to your server configuration.
# -profile: nextflow profile for aliyun k8s
# --database_dir: reference genome database directory
# --split_fastq: To speed up the analysis process, split fastq according to first n bases of barcode. Default is 4.
# --filter_ch: filter reads that contain > 2 CH methylation sites.
# --chemistry: methylation chemistry (DD-MET3 or DD-MET5)

Notes on nextflow.config

nextflow.config declares the execution environment, resource quotas, and run policies. You must customize it to your own server or cluster.

Location: SeekSoulMethyl/nf/nextflow.config (specify with -c SeekSoulMethyl/nf/nextflow.config when running).
Key items to tailor to your infrastructure:
- Executor: process.executor (e.g., local, slurm, pbs, k8s, awsbatch).
- Resources: process.cpus, process.memory, process.time, or fine-grained overrides via withLabel/withName.
- Work directory: workDir (can also be set via -w); ensure it is writable and has sufficient space.
- Environment: enable one of conda.enabled, docker.enabled, or singularity.enabled according to what your server supports.

Example configurations (replace paths and parameters with values valid on your servers):

profiles {
  // Local machine
  local {
    process.executor = 'local'
    workDir          = '/path/to/work'
    process.cpus     = 8
    process.memory   = '32 GB'
    conda.enabled    = true
    // Or containers: singularity.enabled = true / docker.enabled = true
  }

  // Slurm cluster (HPC)
  slurm {
    process.executor   = 'slurm'
    workDir            = '/lustre/work/your_user'
    process.cpus       = 8
    process.memory     = '32 GB'
    process.queue      = 'normal'
    process.clusterOptions = '-A your_account --qos=normal'

    withLabel: 'high_mem' {
      cpus   = 16
      memory = '64 GB'
    }
  }

  // Kubernetes (e.g., Alibaba Cloud ACK)
  aliyun_k8s {
    process.executor   = 'k8s'
    workDir            = '/mnt/nf-work'    // persistent volume path
    k8s {
      namespace        = 'your-namespace'
      storageClaimName = 'your-pvc'
      cpu              = 4
      memory           = '16 GB'
    }
    // If using container images globally: docker.enabled = true
  }
}

Tips:

Pick the -profile that matches your environment (e.g., slurm, local, aliyun_k8s), then adapt the parameters.
Keep workDir on fast storage with ample capacity; the final results directory is controlled by --outdir.
If you use the README’s conda environment, prefer conda.enabled; if your cluster favors containers, use Docker/Singularity.

References:

Nextflow configuration & profiles: https://www.nextflow.io/docs/latest/config.html
Executors (local, Slurm, K8s, etc.): https://www.nextflow.io/docs/latest/executor.html
Kubernetes guide: https://www.nextflow.io/docs/latest/kubernetes.html

Nextflow Step-by-Step Details

This section describes each Nextflow process with inputs, core logic, key parameters, and outputs for troubleshooting and interpretation. The workflow is defined in nf/main.nf and processes are implemented in nf/modules/*.nf.

Step 1: Preprocessing and Barcode Parsing (step1.nf)

Compute genome-wide CpG count: COMPUTE_CPG_SITES
- Input: params.genomefa, params.chrom_size_path
- Action: run count_cg_sites.py to count CpG sites
- Output: genome_cg_info.json
Expression FASTQ QC (multi-group): FASTP_EXPRESSION_MULTI
- Input: paired FASTQs per sample (groups G1/G2/...)
- Action: fastp trimming and QC
- Output: cleaned *_expression_clean_R1/2.fastq.gz, *.html, *.json
Methylation FASTQ QC (multi-group): FASTP_METHYLATION_MULTI
- Input: paired FASTQs per sample (groups G1/G2/...)
- Action: fastp QC (adapter detection disabled, trimming as per pipeline)
- Output: cleaned *_methylation_clean_R1/2.fastq.gz, *.html, *.json
RNA alignment and quantification: SEEKSOULTOOLS_RNA
- Input: cleaned expression R1/R2 lists
- Action: seeksoultools rna run (STAR mapping, counting, filtering, clustering, DE)
- Output: Analysis/step3/filtered_feature_bc_matrix/ etc.
Methylation barcode parsing: METHYLATION_BARCODE_EXTRACTION
- Input: cleaned methylation R1/R2 lists, whitelist params.methy_barcode_wl
- Action: run barcode_cs_multi.py to parse/correct cell barcodes and UMIs by params.chemistry; optional --split_fastq n to shard reads by first n barcode bases
- Output: step1/${sample}_forward_*{1,2}.fq.gz, step1/${sample}_reverse_*{1,2}.fq.gz, ${sample}_methy_summary.json, optional ${sample}_barcode_stats.txt
Build forward/reverse pairing lists: PARSE_FASTQ_FILES
- Input: forward/reverse FASTQ sets
- Action: scan step1/ and pair files by identifiers
- Output: forward_pairs.txt, reverse_pairs.txt
Post-barcode-extraction QC: FASTP_METHYLATION_BARCODE_EXTRACT
- Input: paired sub-FASTQs
- Action: fastp QC
- Output: per-pair *.html, *.json

Step 2: Bismark alignment and BAM sorting (step2.nf)

Forward-strand alignment: BISMARK_ALIGNMENT_FORWARD
- Key flags: --add_barcode, --add_umi; max insert size -X 1000
- Output: *_bismark_bt2_pe.bam, *_bismark_bt2_PE_report.txt
Reverse (PBAT) alignment: BISMARK_ALIGNMENT_REVERSE
- Key flags: --pbat, --add_barcode, --add_umi
- Output: same as above
Sort by read name: SORT_BAM_BY_NAME
- Action: samtools sort -n
- Output: *_sortbyname.bam

Step 3: Per-cell split, ALLC generation/merge, dataset building (step3.nf)

Split BAMs by cell barcode: SPLIT_BAM_FILES
- Input: name-sorted BAM and GEX barcodes barcodes.tsv.gz
- Action: run step3_split_bams.py to split by cell barcode and keep shared cells only
- Output: per-cell BAM dir, *_filtered_barcode, *_filtered_barcode_reads_counts.csv
Merge forward/reverse per-cell BAMs and counts: MERGE_BISMARK_BAM
- Action: samtools merge -n per matching cell; merge/deduplicate barcodes and read counts
- Output: *_merged_fr_bam/, *_merge_filtered_barcode, *_merge_filtered_barcode_reads_counts.csv
Generate per-cell ALLC: ALLCOOLS_BAM_TO_ALLC
- Action: run step3_bam_to_allc.py (custom ALLCools), UR-based dedup and methylation quantification
- Output: per-cell *_allc.gz and index
Merge cell metrics: MERGE_FILTERED_BARCODE_READS_COUNTS
- Action: merge barcodes and read counts, create ${sample}_cells.csv and .json
- Output: filtered_barcode, filtered_barcode_reads_counts.csv, ${sample}_cells.{csv,json}
Build multi-scale dataset: ALLCOOLS_GENERATE_DATASETS
- Action: allcools generate-dataset for chrom10k/20k/50k/100k/500k/1M, metrics like count and hypo-score
- Output: ${sample}.mcds
Merge ALLC (when sharded): ALLCOOLS_SUBMERGE, ALLCOOLS_MERGE
- Action: merge per-shard/per-sample ALLCs
- Output: ${sample}_merge_allc.gz and index
Extract CG context ALLC: ALLCOOLS_EXTRACT
- Action: allcools extract-allc --mc_contexts CGN --strandness merge
- Output: *.CGN-Merge*

(Methylation-only workflow methy_only.nf additionally includes COUNTS_MAPPED_READS and ESTIMATED_CELLS for read-count-based cell estimation and barcode filtering)

Step 4: Summary, DR & joint report (step4.nf)

Methylation summary: METHYLATION_SUMMARY
- Action: step4_wgs_summary.py aggregates Bismark reports, cell metrics and CpG stats to produce ${sample}_methy_summary.json and ${sample}_wgs_summary.csv
LSI/PCA clustering and visualization: METHYLATION_LSI_PCA_CLUSTERING
- Action: step4_allcools_PCA_cluster.py --var_dim chrom20k --reduc lsi
- Output: *.h5ad, *.pdf, *.png
Transcriptome+Methylation joint report: MULTI_REPORT
- Action: step4_report_rna_met.py integrates transcriptome and methylation outputs
- Output: ${sample}_rna_methyl_report.html, ${sample}_rna_met.json

Outputs (by outdir structure)

fastp/: QC reports for raw and post-barcode FASTQs (html/json)
${sample}/${sample}_exp/: Transcriptome analysis directory with filtered matrix, clustering, DE results
${sample}/${sample}_methy/step1/: Barcode-parsed and sharded FASTQs
${sample}/${sample}_methy/step2/: Bismark BAMs and reports
${sample}/${sample}_methy/step3/:
- split_bams/ and split_bams/merged/: Per-cell BAMs, merged BAMs, barcode counts
- allcools/ and allcools_generate_datasets/: Per-cell ALLCs and ${sample}.mcds
- ${sample}_merge_allc.gz, *.CGN-Merge*
${sample}/${sample}_methy/step4/: Clustering plots and *.h5ad
${sample}/:
- ${sample}_methy_summary.json, ${sample}_wgs_summary.csv
- ${sample}_rna_methyl_report.html (if running the main workflow)
Nextflow run artifacts (as configured by -c nf/nextflow.config): execution_report.html, execution_timeline.html, pipeline_dag.html, execution_trace.txt

Methylation-only workflow (test version, currently not recommended for use)

Use the simplified workflow when you only have methylation reads:

nextflow run SeekSoulMethyl/nf/main.nf \
  --workflow methy_only \
  --outdir /path/to/results \
  --samplesheet samplelist.csv \
  -w /path/to/work \
  -c SeekSoulMethyl/nf/nextflow.config \
  -profile aliyun_k8s \
  --database_dir /path/to/reference \
  --split_fastq 4 \
  --filter_ch 2 \
  --chemistry DD-MET3

Key parameters and tips

--database_dir: reference directory with fasta/genome.fa, genes/genes.gtf, star/, bed/chr_nochrM.bed
--chemistry: DD-MET3 or DD-MET5; also sets transcriptome chemistry and barcode whitelist
--split_fastq: shard by the first n barcode bases (default 4; increases parallelism but adds scheduling/merge overhead)
--filter_ch: filter reads with > n CH methylation sites (default 2). If you do not want to enable this filter, set filter_ch to 0.
The samplesheet header must be sample_id

Execution environment and resources

Recommended container image: ghcr.io/seekgene/seeksoulmethyl_docker:v1.0.0
Choose -profile aliyun_k8s or -profile docker, and edit nf/nextflow.config for your infra
Heavy steps: Bismark/ALLCools need substantial CPU/RAM; defaults are set in withName blocks, scale up if needed

FAQ

Samplesheet parsing error: ensure first column is sample_id, use absolute paths
Missing ${sample}.mcds: check ALLCOOLS_BAM_TO_ALLC produced per-cell *_allc.gz and chrom_size_path is correct
Stuck at Bismark: verify reference indices and that params.bismark_ref is visible in the container
Resume runs: use -resume with the same -w work directory

Documentation

Tutorials and reference materials are available under the docs/ directory:

License

This project is licensed under the MIT License - see the LICENSE file for details.

References

[1] Lu X, Yuan Y, et al. Improved tagmentation-based whole-genome bisulfite sequencing for input DNA from less than 100 mammalian cells. Epigenomics. 2015;7(1):47-56. doi:10.2217/epi.14.76.

"Furthermore, by manually checking the reads, we found a part of the reads were completely unconverted. We suspected that a nick in the synthesized adapters will cause the whole fragment displaced with incorporation of 5-methyl-dCTPs due to nick translation activity of Bst polymerase."

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.github/workflows		.github/workflows
dependence		dependence
docs		docs
nf		nf
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
conda_dependencies.yml		conda_dependencies.yml
conda_dependencies.zh.yml		conda_dependencies.zh.yml
sc_methy_workflow.sh		sc_methy_workflow.sh

Folders and files

Latest commit

History

Repository files navigation

SeekSoulMethyl

Data Structure

Installation

Download Reference Database

Download Test Data (Optional)

Repository Layout

Usage

Activate Environment

Run Dual-omics Analysis (Shell script)

Input Parameters:

Process Details

Transcriptome Processing Workflow

Methylation Processing Workflow

Step 1: Preprocessing and Barcode Parsing

Step 2: Bismark alignment and sorting by name

Step 3: ALLCools analysis

Step 4: Reduction and clustering

System Requirements

Running with Nextflow (Recommended)

Notes on nextflow.config

Nextflow Step-by-Step Details

Step 1: Preprocessing and Barcode Parsing (step1.nf)

Step 2: Bismark alignment and BAM sorting (step2.nf)

Step 3: Per-cell split, ALLC generation/merge, dataset building (step3.nf)

Step 4: Summary, DR & joint report (step4.nf)

Outputs (by outdir structure)

Methylation-only workflow (test version, currently not recommended for use)

Key parameters and tips

Execution environment and resources

FAQ

Documentation

License

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages