scCDCG

scCDCG, a clustering model based on deep cut-informed graph for scRNA-seq data. See details in our paper: "scCDCG: Efficient Deep Structural Clustering for single-cell RNA-seq via Deep Cut-informed Graph Embedding" published in DASFAA2024（CCF-B）. （Accepted as a long paper for the research track at DASFAA 2024）

(arXiv: https://arxiv.org/abs/2404.06167)

（DOI：10.48550/arXiv.2404.06167 ）

Overview

Single-cell RNA sequencing (scRNA-seq) is essential for unraveling cellular heterogeneity and diversity, offering invaluable insights for bioinformatics advancements. Despite its potential, traditional clustering methods in scRNA-seq data analysis often neglect the structural information embedded in gene expression profiles, crucial for understanding cellular correlations and dependencies. Existing strategies, including graph neural networks, face challenges in handling the inefficiency due to scRNA-seq data's intrinsic high-dimension and high-sparsity. Addressing these limitations, we introduce scCDCG (single-cell RNA-seq Clustering via Deep Cut-informed Graph), a novel framework designed for efficient and accurate clustering of scRNA-seq data that simultaneously utilizes intercellular high-order structural information. scCDCG comprises three main components: (i) A graph embedding module utilizing deep cut-informed techniques, which effectively captures intercellular high-order structural information, overcoming the over-smoothing and inefficiency issues prevalent in prior graph neural network methods. (ii) A self-supervised learning module guided by optimal transport, tailored to accommodate the unique complexities of scRNA-seq data, specifically its high-dimension and high-sparsity. (iii) An autoencoder-based feature learning module that simplifies model complexity through effective dimension reduction and feature extraction. Our extensive experiments on 6 datasets demonstrate scCDCG's superior performance and efficiency compared to 7 established models, underscoring scCDCG's potential as a transformative tool in scRNA-seq data analysis.

We propose scCDCG, a deep cut-informed graph model-based single cell cluster-ing method (see Fig. 1 for its architecture), which includes (i) an autoencoder-based feature learning module for learning gene expression embeddings, (ii) agraph embedding module based on deep cut-informed techniques for capturingintercellular high-order structural information, and (iii) a self-supervision learn-ing module via optimal transport for generating clustering assignments.

In conclusion, our study introduces scCDCG, an innovative framework for the efficient and accurate clustering of single-cell RNA sequencing (scRNA-seq) data. scCDCG successfully navigates the challenges of high-dimension and high-sparsity through a synergistic combination of a graph embedding module with deep cut-informed techniques, a self-supervised learning module guided by optimal transport, and an autoencoder-based feature learning module. Our extensive evaluations on six datasets confirm scCDCG's superior performance over seven established models, marking it as a transformative tool for bioinformatics and cellular heterogeneity analysis. Looking forward, we aim to extend scCDCG's capabilities to integrate multi-omics data, enhancing its applicability in more complex biological contexts. Additionally, further exploration into the interpretability of the clustering results generated by scCDCG will be crucial for providing deeper biological insights and facilitating its adoption in clinical research settings. This future work will continue to expand the frontiers of scRNA-seq data analysis and its impact on understanding the complexities of cellular systems.

Run Example

python train_scCDCG.py --dataname 'Meuro_human_Pancreas_cell' --num_class 9 --epochs 200 --foldername 'logger_folder' --gpu 0 --learning_rate 5e-3 --weight_decay 5e-3 --balancer 0.42 --factor_ort 0.87 --factor_KL 0.45 --factor_corvar 0.2 --factor_construct 0.93 --alpha_pre 0.8

Here, we give the hyperparameters used for the Meuro_human_Pancreas_cell dataset. The hyperparameters for the rest of the datasets are found in the file train_scCDCG.py.

If you want to replicate our experimental results, please use the hyperparameters we provided.

Please contact us if you encounter problems during the replication process.

Requirements

We implement scCDCG in Python 3.7 based on PyTorch (version 1.12+cu113).

Keras --- 2.4.3
njumpy --- 1.19.5
pandas --- 1.3.5
Scanpy --- 1.8.2
torch --- 1.12.0

Please note that if using different versions, the results reported in our paper might not be able to repeat.

The raw data

Setting data_file to the destination to the data (stored in h5 format, with two components X and Y, where X is the cell by gene count matrix and Y is the true labels), n_clusters to the number of clusters.

In order to ensure the accuracy of the experimental results, we conducted more than 10 times runs on all the datasets and reported the mean and variance of these running results, reducing the result bias caused by randomness and variability, so as to obtain more reliable and stable results. Hyperparameter settings for all datasets can be found in the code. The final output reports the clustering performance, here is an example on Meuro_human_Pancreas_cell scRNA-seq data:

Final: ACC= 0.9265, NMI= 0.8681, ARI= 0.9137

The raw data used in this paper can be found:https://github.com/XPgogogo/scCDCG/tree/master/datasets

Please cite our paper if you use this code or or the dataset we provide in your own work:

@article{xu2024sccdcg,
  title={scCDCG: Efficient Deep Structural Clustering for single-cell RNA-seq via Deep Cut-informed Graph Embedding},
  author={Xu, Ping and Ning, Zhiyuan and Xiao, Meng and Feng, Guihai and Li, Xin and Zhou, Yuanchun and Wang, Pengfei},
  journal={arXiv preprint arXiv:2404.06167},
  year={2024}
}

Contact

Ph.D student Ping XU

Computer Network Information Center, Chinese Academy of Sciences

University of Chinese Academy of Sciences

No.2 Dongshen South St

Beijing, P.R China, 100190

Personal Email: xuping0098@gmail.com

Official Email: xuping@cnic.cn

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
datasets		datasets
layer.py		layer.py
model.py		model.py
preprocess.py		preprocess.py
readme.md		readme.md
requirment.txt		requirment.txt
train_scCDCG.py		train_scCDCG.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

layer.py

layer.py

model.py

model.py

preprocess.py

preprocess.py

readme.md

readme.md

requirment.txt

requirment.txt

train_scCDCG.py

train_scCDCG.py

utils.py

utils.py

Repository files navigation

scCDCG

Overview

Run Example

Requirements

The raw data

Please cite our paper if you use this code or or the dataset we provide in your own work:

Contact

About

Releases

Packages

Languages

XPgogogo/scCDCG

Folders and files

Latest commit

History

Repository files navigation

scCDCG

Overview

Run Example

Requirements

The raw data

Please cite our paper if you use this code or or the dataset we provide in your own work:

Contact

About

Resources

Stars

Watchers

Forks

Languages