scClustEval

📖 Documentation: https://zaoqu-liu.github.io/scClustEval/

Single Cell Clustering Evaluation and Optimization Framework

scClustEval is a comprehensive R package designed for rigorous evaluation and iterative optimization of cell clustering in single-cell RNA sequencing (scRNA-seq) data. The package implements a self-projection machine learning framework that systematically assesses clustering reliability and identifies biologically indistinguishable cell populations for potential merging.

This package provides an R implementation inspired by the SCCAF (Single Cell Clustering Assessment Framework) Python package (Miao et al., 2020, Nature Methods).

Overview

Accurate cell type identification is fundamental to scRNA-seq analysis. However, conventional clustering algorithms often produce results that are sensitive to parameter choices and may not reflect true biological distinctions. scClustEval addresses this challenge through:

Quantitative Assessment: Objectively measures clustering quality using cross-validated classification accuracy
Confusion Matrix Analysis: Identifies cluster pairs with high misclassification rates indicating potential over-clustering
Iterative Optimization: Systematically merges indistinguishable clusters until a target discrimination accuracy is achieved

Installation

From R-Universe (Recommended)

install.packages("scClustEval", repos = "https://zaoqu-liu.r-universe.dev")

From GitHub

# Install remotes if not available
if (!require("remotes")) install.packages("remotes")

# Install scClustEval from GitHub
remotes::install_github("Zaoqu-Liu/scClustEval")

System Requirements

R (≥ 4.0.0)
C++ compiler with C++11 support (for Rcpp components)

Dependencies

Core dependencies (installed automatically):

Seurat (≥ 4.0.0), SeuratObject, Matrix
glmnet, caret, rpart, igraph
pROC, ggplot2, rlang

Optional dependencies (for extended functionality):

install.packages(c("randomForest", "ranger", "e1071", "xgboost", 
                   "leiden", "patchwork", "ggalluvial", "ComplexHeatmap"))

Methodology

Self-Projection Framework

The core algorithm employs a self-projection strategy:

Data Partitioning: Stratified splitting into training and test sets while preserving cluster proportions
Classifier Training: A multi-class classifier is trained to discriminate between clusters
Cross-Validation: Performance estimation via k-fold cross-validation on training data
Confusion Matrix Computation: Quantifies misclassification patterns between all cluster pairs
Normalization: Two complementary normalization schemes:
- R1: Confusion rate relative to correctly classified cells
- R2: Confusion rate relative to total cell count

Optimization Strategy

The iterative optimization proceeds as follows:

Begin with over-clustered data (high resolution)
Assess clustering quality via self-projection
Identify cluster pairs exceeding confusion thresholds
Construct adjacency graph from confusion matrix
Apply community detection (Louvain/Leiden) to merge confused clusters
Repeat until target accuracy is achieved or convergence

Usage

Quick Assessment with Seurat Objects

library(scClustEval)
library(Seurat)

# Load preprocessed Seurat object
seurat_obj <- readRDS("your_seurat_object.rds")

# Rapid assessment of current clustering
result <- RunAssessment(
  seurat_obj,
  cluster_col = "seurat_clusters",
  use = "pca",
  dims = 1:30,
  classifier = "LR"
)

# Summary statistics
print(result)
# Test Accuracy: 0.8532 (85.3%)
# CV Accuracy:   0.8467 (84.7%)
# Max R1:        0.2341
# Max R2:        0.0156

# Visualize ROC curves
plot_roc(result)

Clustering Optimization Pipeline

# Start with high-resolution clustering
seurat_obj <- FindClusters(seurat_obj, resolution = 2.0)

# Iterative optimization
seurat_obj <- RunOptimization(
  seurat_obj,
  cluster_col = "seurat_clusters",
  result_col = "optimized_clusters",
  min_accuracy = 0.90,
  max_rounds = 10,
  classifier = "LR",
  r1_cutoff = 0.5,
  verbose = TRUE
)

# Compare clustering results
DimPlot(seurat_obj, group.by = c("seurat_clusters", "optimized_clusters"), 
        ncol = 2)

Direct Matrix Input

For non-Seurat workflows:

# Assessment with expression/embedding matrix
result <- sc_assessment(
  X = pca_embeddings,      # cells × features matrix
  labels = cluster_labels,
  classifier = "LR",
  penalty = "l1",
  test_size = 0.5,
  n_per_class = 100,
  cv = 5,
  seed = 42
)

# Full optimization pipeline
optim_result <- sc_optimize_all(
  X = pca_embeddings,
  labels = initial_clusters,
  min_accuracy = 0.90,
  r1_cutoff = 0.5,
  r2_cutoff = 0.05,
  classifier = "LR"
)

# Extract final clustering
final_clusters <- optim_result$final_labels

Visualization Functions

# ROC and Precision-Recall curves
plot_roc(result, plot_type = "both", show_auc = TRUE)

# R1-normalized confusion matrix heatmap
plot_confusion_heatmap(result, normalized = "R1")

# Optimization trajectory
plot_optimization_history(optim_result, metric = "both")

# Sankey diagram of cluster reassignments
plot_cluster_sankey(
  labels_from = initial_clusters,
  labels_to = final_clusters,
  title = "Cluster Optimization Flow"
)

Supported Classifiers

Classifier	Identifier	R Package	Notes
Logistic Regression	`"LR"`	glmnet	L1/L2/Elastic-net regularization; recommended
Random Forest	`"RF"`	randomForest	Feature importance available
Ranger	`"RANGER"`	ranger	Fast RF implementation
Support Vector Machine	`"SVM"`	e1071	RBF/linear/polynomial kernels
Naive Bayes	`"NB"`	e1071	Efficient for high-dimensional data
Decision Tree	`"DT"`	rpart	Interpretable; feature importance
XGBoost	`"XGB"`	xgboost	Gradient boosting; requires installation

Key Parameters

Parameter	Description	Default
`classifier`	Machine learning algorithm	`"LR"`
`test_size`	Fraction of data for testing	`0.5`
`n_per_class`	Maximum training samples per cluster	`100`
`cv`	Cross-validation folds	`5`
`r1_cutoff`	R1 confusion threshold for merging	`0.5`
`r2_cutoff`	R2 confusion threshold for merging	`0.05`
`min_accuracy`	Target accuracy for optimization	`0.9`
`max_rounds`	Maximum optimization iterations	`10`

Performance

C++ Acceleration: Core confusion matrix computations implemented in C++ via Rcpp/RcppArmadillo
Parallel Processing: Optional multi-core support via the future framework
Memory Efficient: Native sparse matrix support for large datasets
Seurat Compatible: Full support for Seurat v4 and v5 object structures

Citation

If you use scClustEval in your research, please cite:

@Manual{scClustEval2026,
  title = {scClustEval: Single Cell Clustering Evaluation and Optimization Framework},
  author = {Zaoqu Liu},
  year = {2026},
  note = {R package version 1.0.0},
  url = {https://github.com/Zaoqu-Liu/scClustEval}
}

Please also cite the original SCCAF methodology:

@Article{Miao2020,
  title = {Putative cell type discovery from single-cell gene expression data},
  author = {Miao, Zhichao and Moreno, Pablo and Huang, Ni and Papatheodorou, Irene and Brazma, Alvis and Teichmann, Sarah A.},
  journal = {Nature Methods},
  year = {2020},
  volume = {17},
  pages = {621--628},
  doi = {10.1038/s41592-020-0825-9}
}

Related Resources

Original SCCAF: github.com/SCCAF/sccaf (Python implementation)
Publication: Miao et al., 2020, Nature Methods

Contact

Author: Zaoqu Liu
Email: liuzaoqu@163.com
GitHub: github.com/Zaoqu-Liu/scClustEval
Issues: github.com/Zaoqu-Liu/scClustEval/issues

License

This package incorporates concepts from the SCCAF Python package, which is also released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
R		R
docs		docs
inst/doc		inst/doc
man		man
src		src
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
_pkgdown.yml		_pkgdown.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scClustEval

Single Cell Clustering Evaluation and Optimization Framework

Overview

Installation

From R-Universe (Recommended)

From GitHub

System Requirements

Dependencies

Methodology

Self-Projection Framework

Optimization Strategy

Usage

Quick Assessment with Seurat Objects

Clustering Optimization Pipeline

Direct Matrix Input

Visualization Functions

Supported Classifiers

Key Parameters

Performance

Citation

Related Resources

Contact

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scClustEval

Single Cell Clustering Evaluation and Optimization Framework

Overview

Installation

From R-Universe (Recommended)

From GitHub

System Requirements

Dependencies

Methodology

Self-Projection Framework

Optimization Strategy

Usage

Quick Assessment with Seurat Objects

Clustering Optimization Pipeline

Direct Matrix Input

Visualization Functions

Supported Classifiers

Key Parameters

Performance

Citation

Related Resources

Contact

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages