Train, evaluate, compare, and visualize baseline deep-learning models for single-cell data without writing PyTorch from scratch.
- Documentation site: https://uddamvathanak.github.io/scDLKit/
- Primary notebook tutorial:
examples/train_vae_pbmc.ipynb - Install path for tutorials:
python -m pip install "scdlkit[tutorials]" - CPU and GPU use the same notebook path through
device="auto" - Secondary notebooks:
examples/compare_models_pbmc.ipynb,examples/classification_demo.ipynb - Synthetic smoke examples:
examples/first_run_synthetic.ipynb,examples/first_run_synthetic.py
- AnnData-native workflow for single-cell users.
- Baseline-first model zoo: AE, VAE, DAE, Transformer AE, and MLP classification.
- Built-in training, evaluation, comparison, and plotting.
- Reproducible reports and notebooks for portfolio-ready demonstrations.
- Gene-expression-focused scope while the core toolkit stabilizes.
- Linux: supported
- macOS: supported
- Windows: supported
Primary tutorial install path:
python -m pip install "scdlkit[tutorials]"Windows note: if you install into a deeply nested virtual environment path, Jupyter dependencies can hit Windows path-length limits. Use a short environment path such as C:\venvs\scdlkit, or enable Windows Long Paths if needed.
Optional extras:
python -m pip install "scdlkit[scanpy]"
python -m pip install "scdlkit[notebook]"
python -m pip install scdlkit
python -m pip install "scdlkit[dev,docs]"For GPU users, install the matching PyTorch build first using the official selector:
Then install scdlkit[tutorials]. The same notebook examples run on CPU or GPU with device="auto".
Primary tutorial example. The notebook now uses a quickstart profile by default and exposes a full profile in its first config cell:
quickstart: CPU-friendly, docs-friendly, reproduciblefull: longer run for stronger qualitative separation
import scanpy as sc
from scdlkit import TaskRunner
adata = sc.datasets.pbmc3k_processed()
runner = TaskRunner(
model="vae",
task="representation",
label_key="louvain",
device="auto",
epochs=20,
batch_size=128,
model_kwargs={"kl_weight": 1e-3},
)
runner.fit(adata)
adata.obsm["X_scdlkit_vae"] = runner.encode(adata)For the PBMC quickstart, use a light VAE KL term so the latent UMAP preserves broad cell-type structure instead of collapsing into a uniform blob. A healthy result should show broad cell-type groups as visibly separated regions rather than a single mixed cloud.
Then continue with Scanpy:
import scanpy as sc
sc.pp.neighbors(adata, use_rep="X_scdlkit_vae")
sc.tl.umap(adata)
sc.pl.umap(adata, color="louvain")Most researchers should start with the Scanpy PBMC quickstart:
python -m pip install "scdlkit[tutorials]"
jupyter notebook examples/train_vae_pbmc.ipynbThis notebook:
- loads PBMC data through Scanpy
- trains a VAE baseline with scDLKit
- writes the latent representation into
adata.obsm - continues with Scanpy neighbors and UMAP
- explains the quickstart versus full tutorial profiles
- works on CPU or GPU through
device="auto"
Additional Scanpy-first notebooks:
examples/compare_models_pbmc.ipynb: comparePCA,autoencoder,vae, andtransformer_aeexamples/classification_demo.ipynb: run themlp_classifierbaseline and inspect a confusion matrix
The synthetic notebook and script are still available, but they are now the smoke-test path rather than the primary researcher onboarding flow:
python -m pip install "scdlkit[notebook]"
jupyter notebook examples/first_run_synthetic.ipynb
python examples/first_run_synthetic.pyThese write small reproducible artifacts to artifacts/first_run_notebook/ and artifacts/first_run/.
Conda is kept for contributors and demos. It is not the primary public install path.
Official installers:
- Miniconda install guide: https://www.anaconda.com/docs/getting-started/miniconda/install
- Anaconda Distribution download: https://www.anaconda.com/download
From the repo root:
conda env create -f environment.yml
conda activate scdlkitHigh-level:
from scdlkit import TaskRunnerLower-level:
from scdlkit import Trainer, create_model, prepare_dataComparison:
from scdlkit import compare_models
benchmark = compare_models(
adata,
models=["autoencoder", "vae", "transformer_ae"],
task="representation",
shared_kwargs={"epochs": 10, "label_key": "cell_type"},
output_dir="artifacts/compare",
)autoencodervaedenoising_autoencodertransformer_aemlp_classifier
representationreconstructionclassification
- Gene-expression baselines for AnnData workflows
- Scanpy-first tutorial and downstream embedding usage
- Built-in deep-learning baselines plus classical comparison context in notebooks
Spatial omics, multimodal workflows, and custom PyTorch model adapters are future work once the gene-expression toolkit quality gates stay stable.
Project documentation is published as a Sphinx-based scientific docs site:
- Docs site: https://uddamvathanak.github.io/scDLKit/
- Tutorials: Scanpy-first notebook walkthroughs rendered in the docs site
- API reference:
docs/api/index.md - Example notebooks:
examples/
The docs workflow expects GitHub Pages to be enabled once at the repository level.
- Open
Settings -> Pagesfor this repo:https://github.com/uddamvathanak/scDLKit/settings/pages - Under
Build and deployment, setSourcetoGitHub Actions. - Save the setting.
- Re-run the
docsworkflow.
Without that one-time setting, GitHub returns a 404 when actions/configure-pages or actions/deploy-pages tries to access the Pages site.
If you want the workflow to bootstrap Pages automatically instead of doing the one-time manual setup:
- Create a repository secret named
PAGES_ENABLEMENT_TOKEN. - Use a Personal Access Token with
reposcope or Pages write permission. - Re-run the
docsworkflow.
- Stage to TestPyPI first with
release-testpypi.yml. - Publish the final release from a
v*tag withrelease.yml. - Use trusted publishing instead of long-lived PyPI API tokens.
- See
RELEASING.mdfor the full checklist.
examples/train_vae_pbmc.ipynbis the primary Scanpy-first notebook tutorial.examples/compare_models_pbmc.ipynbcomparesautoencoder,vae, andtransformer_aeon PBMC data.examples/classification_demo.ipynbcovers themlp_classifierworkflow and confusion-matrix reporting.examples/first_run_synthetic.ipynbis the secondary smoke-test notebook with minimal setup.examples/first_run_synthetic.pyis the secondary smoke-test script.
Immediate roadmap target:
- quality-only hardening toward the next patch release
- longer notebook tutorials with quickstart and full profiles
- explicit toolkit-quality benchmarking on small Scanpy built-ins
- internal release gates for latent quality, classification quality, and seed stability
Released so far:
v0.1
- Expanded core workflow with training, evaluation, reporting, and plotting.
- Staged TestPyPI and PyPI publishing.
- Cross-platform smoke validation and reproducible notebooks.
Later:
- adapter-based custom PyTorch model support
- deeper downstream tutorials
- spatial baselines only after the gene-expression toolkit is stable
If you use scDLKit, cite the software entry in CITATION.cff.