Fugue: A simple batch-correction method that is scalable for integrating super large-scale single-cell transcriptomes

Introduction of Fugue

Batch effects are fundamental issues to be addressed for integration of single-cell transcriptomes. Here, we present Fugue, a simple yet efficient solution for batch-correction of super large-scale single-cell transcriptomes.

Fugue is a musical genre in which the principle theme is repeated or imitated in different time and music scale. In concept, the gene expression profile could be seen as superposition of the biological information and batch information, which is likely this musical genre.

Fugue algorithm is based on a self-supervised learning-based framework. We encode batch information as trainable parameters and add them into expression profiles. A contrastive learning method is applied to narrow the gap between single-cell and its various augmentation one. The batch information can be properly represented after training. By taking batch information as trainable variable, Fugue is scalable in atlasing-scale data integration with fixed memory usage.

Architecture of Fugue

Given a set of uncorrected single-cells (A), Fugue embedded their batch information as a learnable matrix and added them to the expression profile for feature encoder training (B) . The feature encoder was trained with contrastive loss (C). At the feature extraction stage, single-cell expression profiles were provided to the feature encoder to extract embedding representation (D) . The embedding representation could be utilized for downstream analysis such as visualization and cell clustering (E).

Installation

The Fugue package is available through the following codes:

git clone https://github.com/xilinshen/Fugue.git

The following packages are required:

python==3.8.8 
numpy==1.21.4
torch==1.8.0
torchvision==0.9.0
scanpy==1.7.0rc1

Some source codes were copied from facebookresearch/MoCo.

Training Fugue on own data

Data preprocessing

We recommend the user to normalize the single cell count matrix as counts per million normalization (CPM) and took logarithmic transformation (i.e. log2(CPM+1)).

import numpy as np
from utils.preprocessing import *

count_matrix = np.load("./data/splatter_simulation_count.npz")["x"] # expression profiles
batch = np.load("./data/splatter_simulation_count.npz")["y"] # batch labels
celltype = np.load("./data/splatter_simulation_count.npz")["g"] # cell type labels

X = data_preprocessing(count_matrix)

np.savez_compressed("./data/splatter_simulation.npz", x = X, y = batch, g = celltype)

The data needs to be preprocessed as numpy compressd .npz format. Key "x" should be the profile and "y" should be batch labels.

Model training

The model was trained on GPUs:

python main.py \
--file "./data/splatter_simulation.npz" \
--arch densenet21 \
--batch-size 128 \
--dist-url "tcp://localhost:10000" \
--outdir "./result/" \
--mlp --moco-k 1408 --moco-m 0.999 \
--in_features 2000 \
--num_batches 5 \
--shuffle-ratio 0.1 \
--randomzero-ratio 0.3 \
--multiprocessing-distributed

Extract batch removing representation of single cells from the pretrained feature encoder

import utils
import numpy as np
import scanpy as sc

# load data
X = np.load("./data/splatter_simulation.npz")["x"] # expression profile
batch = np.load("./data/splatter_simulation.npz")["y"] # batch label
celltype = np.load("./data/splatter_simulation.npz")["g"] # cell type label

in_features = 2000
assert in_features == X.shape[1]

# load the pretrained feature encoder
arch = "densenet21"
checkpoint = "./result/checkpoint_0029.pth.tar"
model = utils.load_pretrained_model(arch, in_features, checkpoint, return_feature=True)

# extract embeddings of single-cells
features = utils.extract_features(model, X)

# visualization
adata=sc.AnnData(features)
adata.obs["celltype"] = celltype
adata.obs["batch"] = batch
adata.obs["celltype"] = adata.obs["celltype"].astype("category")
adata.obs["batch"] = adata.obs["batch"].astype("category")

sc.pp.neighbors(adata)
sc.tl.leiden(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["celltype","batch"])

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
images		images
result		result
utils		utils
LICENSE		LICENSE
main.py		main.py
readme.md		readme.md
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fugue: A simple batch-correction method that is scalable for integrating super large-scale single-cell transcriptomes

Introduction of Fugue

Architecture of Fugue

Installation

Training Fugue on own data

Data preprocessing

Model training

Extract batch removing representation of single cells from the pretrained feature encoder

About

Releases

Packages

Languages

License

xilinshen/Fugue

Folders and files

Latest commit

History

Repository files navigation

Fugue: A simple batch-correction method that is scalable for integrating super large-scale single-cell transcriptomes

Introduction of Fugue

Architecture of Fugue

Installation

Training Fugue on own data

Data preprocessing

Model training

Extract batch removing representation of single cells from the pretrained feature encoder

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages