scCDCG, a clustering model based on deep cut-informed graph for scRNA-seq data. See details in our paper: "scCDCG: Efficient Deep Structural Clustering for single-cell RNA-seq via Deep Cut-informed Graph Embedding" published in DASFAA2024(CCF-B). (Accepted as a long paper for the research track at DASFAA 2024)
(arXiv: https://arxiv.org/abs/2404.06167)
(DOI:10.48550/arXiv.2404.06167 )
Single-cell RNA sequencing (scRNA-seq) is essential for unraveling cellular heterogeneity and diversity, offering invaluable insights for bioinformatics advancements. Despite its potential, traditional clustering methods in scRNA-seq data analysis often neglect the structural information embedded in gene expression profiles, crucial for understanding cellular correlations and dependencies. Existing strategies, including graph neural networks, face challenges in handling the inefficiency due to scRNA-seq data's intrinsic high-dimension and high-sparsity. Addressing these limitations, we introduce scCDCG (single-cell RNA-seq Clustering via Deep Cut-informed Graph), a novel framework designed for efficient and accurate clustering of scRNA-seq data that simultaneously utilizes intercellular high-order structural information. scCDCG comprises three main components: (i) A graph embedding module utilizing deep cut-informed techniques, which effectively captures intercellular high-order structural information, overcoming the over-smoothing and inefficiency issues prevalent in prior graph neural network methods. (ii) A self-supervised learning module guided by optimal transport, tailored to accommodate the unique complexities of scRNA-seq data, specifically its high-dimension and high-sparsity. (iii) An autoencoder-based feature learning module that simplifies model complexity through effective dimension reduction and feature extraction. Our extensive experiments on 6 datasets demonstrate scCDCG's superior performance and efficiency compared to 7 established models, underscoring scCDCG's potential as a transformative tool in scRNA-seq data analysis.
We propose scCDCG, a deep cut-informed graph model-based single cell cluster-ing method (see Fig. 1 for its architecture), which includes (i) an autoencoder-based feature learning module for learning gene expression embeddings, (ii) agraph embedding module based on deep cut-informed techniques for capturingintercellular high-order structural information, and (iii) a self-supervision learn-ing module via optimal transport for generating clustering assignments.
In conclusion, our study introduces scCDCG, an innovative framework for the efficient and accurate clustering of single-cell RNA sequencing (scRNA-seq) data. scCDCG successfully navigates the challenges of high-dimension and high-sparsity through a synergistic combination of a graph embedding module with deep cut-informed techniques, a self-supervised learning module guided by optimal transport, and an autoencoder-based feature learning module. Our extensive evaluations on six datasets confirm scCDCG's superior performance over seven established models, marking it as a transformative tool for bioinformatics and cellular heterogeneity analysis. Looking forward, we aim to extend scCDCG's capabilities to integrate multi-omics data, enhancing its applicability in more complex biological contexts. Additionally, further exploration into the interpretability of the clustering results generated by scCDCG will be crucial for providing deeper biological insights and facilitating its adoption in clinical research settings. This future work will continue to expand the frontiers of scRNA-seq data analysis and its impact on understanding the complexities of cellular systems.
python train_scCDCG.py --dataname 'Meuro_human_Pancreas_cell' --num_class 9 --epochs 200 --foldername 'logger_folder' --gpu 0 --learning_rate 5e-3 --weight_decay 5e-3 --balancer 0.42 --factor_ort 0.87 --factor_KL 0.45 --factor_corvar 0.2 --factor_construct 0.93 --alpha_pre 0.8
Here, we give the hyperparameters used for the Meuro_human_Pancreas_cell dataset. The hyperparameters for the rest of the datasets are found in the file train_scCDCG.py.
If you want to replicate our experimental results, please use the hyperparameters we provided.
Please contact us if you encounter problems during the replication process.
We implement scCDCG in Python 3.7 based on PyTorch (version 1.12+cu113).
Keras --- 2.4.3
njumpy --- 1.19.5
pandas --- 1.3.5
Scanpy --- 1.8.2
torch --- 1.12.0
Please note that if using different versions, the results reported in our paper might not be able to repeat.
Setting data_file to the destination to the data (stored in h5 format, with two components X and Y, where X is the cell by gene count matrix and Y is the true labels), n_clusters to the number of clusters.
In order to ensure the accuracy of the experimental results, we conducted more than 10 times runs on all the datasets and reported the mean and variance of these running results, reducing the result bias caused by randomness and variability, so as to obtain more reliable and stable results. Hyperparameter settings for all datasets can be found in the code. The final output reports the clustering performance, here is an example on Meuro_human_Pancreas_cell scRNA-seq data:
Final: ACC= 0.9265, NMI= 0.8681, ARI= 0.9137
The raw data used in this paper can be found:https://github.com/XPgogogo/scCDCG/tree/master/datasets
@article{xu2024sccdcg,
title={scCDCG: Efficient Deep Structural Clustering for single-cell RNA-seq via Deep Cut-informed Graph Embedding},
author={Xu, Ping and Ning, Zhiyuan and Xiao, Meng and Feng, Guihai and Li, Xin and Zhou, Yuanchun and Wang, Pengfei},
journal={arXiv preprint arXiv:2404.06167},
year={2024}
}
Ph.D student Ping XU
Computer Network Information Center, Chinese Academy of Sciences
University of Chinese Academy of Sciences
No.2 Dongshen South St
Beijing, P.R China, 100190
Personal Email: xuping0098@gmail.com
Official Email: xuping@cnic.cn