embpy

embpy is a Python package for generating embeddings of biological perturbations and cell lines using 60+ foundation models through a unified interface.

Given a perturbation (genetic or chemical) and/or single-cell expression data, embpy resolves the underlying biological sequences, routes them to the appropriate foundation models, and returns dense vector representations ready for downstream machine learning.

Workflow

Architecture

Mermaid diagram (click to expand)

flowchart TD
    subgraph inputs [Input Modalities]
        direction TB
        GeneInput["Genetic Perturbations\nGene Symbol | Ensembl ID\nDNA Sequence | Exons/Introns"]
        ProtInput["Protein Targets\nUniProt ID | Canonical\nIsoforms"]
        MolInput["Chemical Perturbations\nSMILES | InChI\nDrug Name | PubChem CID"]
        CellInput["Single-Cell Data\nAnnData | Raw Counts\nLog-normalized"]
    end

    subgraph resolvers [Sequence Resolution]
        direction TB
        GeneRes["GeneResolver\nEnsembl REST | pyensembl\nMyGene.info"]
        ProtRes["ProteinResolver\nUniProt REST\nMyGene.info"]
        DrugRes["DrugResolver\nPubChem | NIH Cactus\nCIRpy"]
    end

    subgraph annotators [Annotation Sources]
        direction TB
        MolAnn["MoleculeAnnotator\nRDKit | ChEMBL | ChEBI\nKEGG | PubChem | UniChem"]
        GeneAnn["GeneAnnotator\nMyGene | GTEx | STRING-DB\nOpen Targets | GWAS Catalog\nDoRothEA"]
        ProtAnn["ProteinAnnotator\nUniProt JSON | InterPro\nGO Terms"]
    end

    subgraph dna_models [DNA Models]
        Enformer["Enformer 250M"]
        Borzoi["Borzoi v0-v3 200M"]
        Flashzoi["Flashzoi v0-v3 200M"]
        Evo["Evo 1/1.5/2 7B-40B"]
        NT["Nucleotide Transformer\nv1/v2/v3 50M-2.5B"]
        HyenaDNA["HyenaDNA 1.6M-6.6M"]
        GENALM["GENA-LM 110M-336M"]
        Caduceus2["Caduceus 16M"]
    end

    subgraph prot_models [Protein Models]
        ESM1["ESM-1b/1v 650M"]
        ESM2["ESM-2 8M-15B"]
        ESMC["ESM-C 300M-6B"]
        ESM3M["ESM3 1.4B-98B"]
        ProtT5["ProtT5 3B"]
    end

    subgraph mol_models [Molecule Models]
        ChemBERTa["ChemBERTa 77M-100M"]
        MolFormer["MolFormer XL"]
        RDKitFP["RDKit Fingerprints\nMorgan | MACCS | Torsion"]
        MiniMolM["MiniMol 10M"]
        MHGGNN["MHG-GNN"]
        MolEM["MolE"]
    end

    subgraph sc_models [Single-Cell Models]
        scGPTM["scGPT 51M"]
        GeneformerM["Geneformer v1/v2\n10M-316M"]
        UCEM["UCE 1.3B"]
        TFM["TranscriptFormer\n368M-542M"]
        TahoeM["Tahoe-x1 70M-3B"]
        C2SM["Cell2Sentence 2B-27B"]
        PCAM["PCA"]
        scVIM["scVI | scANVI | totalVI"]
    end

    subgraph strategies [Embedding Strategies]
        StdPool["Standard: mean | max | cls"]
        TPMWeight["TPM-Weighted Isoform Average"]
        AnnWeight["Annotation-Weighted\nResidue Pooling"]
        ExprCtx["Expression-Context\nConcatenation"]
        RegionEmb["Region-Specific:\nfull | exons | introns"]
    end

    subgraph tools [Analysis Tools]
        direction TB
        Preproc["Preprocessing\nNormalize | Log1p | HVG | Scale\nCPU or GPU via rapids"]
        Sim["Similarity\nCosine | Pearson | Spearman\nWasserstein"]
        Cluster["Clustering\nLeiden | K-means | Spectral"]
        DimRed["Dim Reduction\nUMAP | t-SNE | PCA"]
        Bench["Benchmarking\nKNN Overlap | Ranking\nMetrics"]
    end

    subgraph outputBlock [Output]
        ObsmOut[".obsm embeddings"]
        ObsOut[".obs annotations"]
        UnsOut[".uns metadata"]
        NpzOut[".npz matrices"]
    end

    GeneInput --> GeneRes
    GeneInput --> ProtRes
    MolInput --> DrugRes
    CellInput --> sc_models

    GeneRes --> dna_models
    ProtRes --> prot_models
    DrugRes --> mol_models

    GeneInput --> GeneAnn
    MolInput --> MolAnn
    GeneInput --> ProtAnn

    dna_models --> strategies
    prot_models --> strategies
    mol_models --> strategies
    sc_models --> strategies

    strategies --> outputBlock
    annotators --> ObsOut
    outputBlock --> tools

Key Features

60+ foundation models across DNA, protein, molecule, single-cell, and text modalities
Unified BioEmbedder interface -- one class to access all models with automatic sequence resolution
embed_adata() -- embed cells and perturbations together in a single call
Weighted protein embeddings -- TPM-weighted isoform averaging, annotation-weighted residue pooling, expression-context concatenation
Multi-source annotation -- MoleculeAnnotator (RDKit, ChEMBL, ChEBI, KEGG, PubChem), GeneAnnotator (MyGene, GTEx, STRING-DB, Open Targets, GWAS Catalog), ProteinAnnotator (UniProt functional metadata, InterPro domains)
GPU acceleration via rapids_singlecell for preprocessing, PCA, UMAP, neighbors, and Leiden
Batch processing with SLURM array job scripts for full-genome embedding
scverse integration -- AnnData-native throughout, compatible with scanpy/scvi-tools/pertpy

Quick Start

Embed a gene with a DNA model

from embpy.embedder import BioEmbedder

embedder = BioEmbedder(device="auto")

# DNA embedding (resolves gene -> genomic sequence -> model)
emb = embedder.embed_gene("TP53", model="enformer_human_rough", pooling_strategy="mean")
print(emb.shape)  # (3072,)

Embed a protein with ESM-2

# Protein embedding (resolves gene -> UniProt sequence -> model)
emb = embedder.embed_gene("TP53", model="esm2_650M", pooling_strategy="mean")
print(emb.shape)  # (1280,)

# All isoforms
isoforms = embedder.embed_protein("TP53", model="esm2_650M", isoform="all")
for iso_id, emb in isoforms.items():
    print(f"  {iso_id}: {emb.shape}")

Embed a small molecule

emb = embedder.embed_molecule("CC(=O)OC1=CC=CC=C1C(=O)O", model="chemberta2MTR")
print(emb.shape)  # (768,)

Embed cells from an AnnData

import anndata as ad

adata = ad.read_h5ad("perturbseq.h5ad")

result = embedder.embed_cells(
    adata,
    models=["pca", "scvi", "scgpt"],
    preprocessing="standard",
    n_pca_components=50,
    n_latent=30,
)
# result.obsm["X_pca"]   -> (n_cells, 50)
# result.obsm["X_scvi"]  -> (n_cells, 30)
# result.obsm["X_scgpt"] -> (n_cells, 512)

Combined cell + perturbation embedding

result = embedder.embed_adata(
    adata,
    cell_models=["pca", "scgpt"],
    perturbation_models=["esm2_650M"],
    perturbation_column="perturbation",
    perturbation_type="auto",
)
# Cell embeddings + perturbation embeddings side by side in .obsm

Weighted perturbation embedding

from embpy.tl import WeightedProteinEmbedder

wpe = WeightedProteinEmbedder(embedder)

# TPM-weighted isoform average
emb = wpe.embed_perturbation(
    "TP53", model="esm2_650M", strategy="tpm_weighted",
    tpm_values={"P04637": 45.2, "P04637-2": 12.8},
)

# Annotation-weighted: active/binding sites get 3x weight
emb = wpe.embed_perturbation(
    "TP53", model="esm2_650M", strategy="annotation_weighted",
    site_boost=3.0,
)

Annotate perturbations

from embpy.tl import annotate_molecules, annotate_gene_perturbations, annotate_proteins

# Molecule annotations (physicochemical, bioactivities, pathways, diseases)
adata = annotate_molecules(adata, column="drug_name")

# Gene annotations (pathways, tissue expression, PPI, diseases)
adata = annotate_gene_perturbations(adata, column="gene")

# Protein annotations (UniProt function, domains, PTMs, GO terms)
adata = annotate_proteins(adata, column="gene")

Available Models

DNA Models

Model	Key	Parameters
Enformer	`enformer_human_rough`	250M
Borzoi (4 replicates)	`borzoi_v0` -- `borzoi_v3`	200M
Borzoi Mouse	`borzoi_v0_mouse` -- `borzoi_v3_mouse`	200M
Flashzoi (4 replicates)	`flashzoi_v0` -- `flashzoi_v3`	200M
Evo 1 / 1.5	`evo1_8k`, `evo1_131k`, `evo1.5_8k`	7B
Evo 2	`evo2_7b`, `evo2_40b`	7B / 40B
Nucleotide Transformer v1/v2	`nt_500m_human_ref`, `nt_v2_500m`, ...	50M -- 2.5B
Nucleotide Transformer v3	`ntv3_100m_pre`, `ntv3_650m_pos`, ...	8M -- 650M
HyenaDNA	`hyenadna_tiny_1k` -- `hyenadna_large_1m`	1.6M -- 6.6M
GENA-LM	`gena_lm_bert_base`, `gena_lm_bert_large`, ...	110M -- 336M
Caduceus	`caduceus_ph_131k`, `caduceus_ps_131k`	16M

Protein Models

Model	Key	Parameters
ESM-1b	`esm1b`	650M
ESM-1v (5 seeds)	`esm1v_1` -- `esm1v_5`	650M
ESM-2	`esm2_8M` -- `esm2_15B`	8M -- 15B
ESM-C	`esmc_300m`, `esmc_600m`, `esmc_6b`	300M -- 6B
ESM3	`esm3_small`, `esm3_medium`, `esm3_large`	1.4B -- 98B
ProtT5	`prot_t5_xl`, `prot_t5_xl_half`	3B

Molecule Models

Model	Key	Type
ChemBERTa	`chemberta2MTR`, `chemberta2MLM`	Transformer
MolFormer	`molformer_base`	Transformer
RDKit Fingerprints	`rdkit_fp`, `morgan_fp`, `maccs_fp`, ...	Classical
MiniMol	`minimol`	GNN
MHG-GNN	`mhg_gnn`	Hypergraph GNN
MolE	`mole`	Graph Transformer

Single-Cell Foundation Models

Model	Key	Parameters
scGPT	`scgpt`	51M
Geneformer v1/v2	`geneformer_v1_6L` -- `geneformer_v2_18L`	10M -- 316M
UCE	`uce`	1.3B
TranscriptFormer	`transcriptformer_metazoa`, `transcriptformer_sapiens`	368M -- 542M
Tahoe-x1	`tahoe_70m`, `tahoe_1b`, `tahoe_3b`	70M -- 3B
Cell2Sentence-Scale	`cell2sentence_2b`, `cell2sentence_27b`	2B -- 27B
PCA	`pca`	--
scVI / scANVI / totalVI	`scvi`, `scanvi`, `totalvi`	--

Text Models

Model	Key
MiniLM	`minilm_l6_v2`
BERT	`bert_base_uncased`

Installation

Requires Python 3.11+. Choose the method that best fits your workflow.

Option 1: Mamba / Conda (recommended for HPC)

The fastest way to get a fully working environment with GPU support, RDKit, and all compiled dependencies resolved automatically.

# GPU environment (CUDA 12.4)
git clone https://github.com/theislab/embpy.git
cd embpy
mamba env create -f environment.yml
mamba activate embpy

# CPU-only environment
mamba env create -f environment-cpu.yml
mamba activate embpy-cpu

The environment files install PyTorch, RDKit, and the core scientific stack via conda-forge, then install embpy and its Python-only dependencies via pip.

To add optional extras after activation:

pip install ".[evo2,helical]"    # add Evo2 + single-cell models
pip install ".[all]"             # everything

Option 2: uv (fastest pip alternative)

uv is a drop-in pip replacement that resolves and installs packages 10-100x faster.

# Install uv if you don't have it
pip install uv

# CPU install
uv pip install git+https://github.com/theislab/embpy.git@main

# GPU install (CUDA 12.4)
uv pip install "embpy[torch-cu124]" --extra-index-url https://download.pytorch.org/whl/cu124

# Full GPU install with all extras
uv pip install "embpy[all-cu124]" --extra-index-url https://download.pytorch.org/whl/cu124

# Development install (editable)
git clone https://github.com/theislab/embpy.git
cd embpy
uv pip install -e ".[dev,test]"

Option 3: pip

# Base install (CPU) -- includes all HuggingFace models
pip install git+https://github.com/theislab/embpy.git@main

# GPU install -- pick your CUDA version
pip install "embpy[torch-cu121]" --extra-index-url https://download.pytorch.org/whl/cu121
pip install "embpy[torch-cu124]" --extra-index-url https://download.pytorch.org/whl/cu124
pip install "embpy[torch-cu128]" --extra-index-url https://download.pytorch.org/whl/cu128

# Full GPU install (CUDA 12.4)
pip install "embpy[all-cu124]" --extra-index-url https://download.pytorch.org/whl/cu124

Optional extras

Install only what you need:

Extra	What it enables
(base)	GENA-LM, NT v1/v2/v3, HyenaDNA, ESM-2/C, ProtT5, ChemBERTa, MolFormer, RDKit
`caduceus`	Caduceus (SSM/Mamba DNA model, requires CUDA)
`evo`	Evo v1/v1.5
`evo2`	Evo 2
`helical`	Single-cell foundation models (scGPT, Geneformer, UCE, TranscriptFormer, Tahoe, Cell2Sentence)
`ppi`	PPI GNN encoder
`pertpy`	pertpy metadata annotation
`scanpy`	scanpy integration
`all`	Everything above

# Mix and match
pip install "embpy[torch-cu124,helical,evo2,pertpy]" \
  --extra-index-url https://download.pytorch.org/whl/cu124

Verifying the installation

from embpy.embedder import BioEmbedder

embedder = BioEmbedder(device="auto")
print(f"Device: {embedder.device}")
print(f"Models: {len(embedder.list_available_models())} available")

Tutorials

#	Topic	Notebook
01	Identifiers and Preprocessing	01_identifiers_and_preprocessing.ipynb
02	Gene (DNA) Embeddings	02_gene_embeddings.ipynb
03	Protein Embeddings + Isoforms	03_protein_embeddings.ipynb
04	Molecule Embeddings	04_molecule_embeddings.ipynb
05	Text Embeddings	05_text_embeddings.ipynb
06	PPI Embeddings	06_ppi_embeddings.ipynb
07	Combined Analysis	07_combined_analysis.ipynb
08	Visualization and Analysis	08_visualization_and_analysis.ipynb
09	Embedding Benchmark	09_embedding_benchmark.ipynb
10	DNA Embeddings (Advanced)	10_dna_embeddings.ipynb
11	DepMap Analysis	11_depmap_analysis.ipynb
12	Single-Cell Foundation Models	12_singlecell_foundation_models.ipynb
13	JUMP Cell Painting	13_jump_cell_painting.ipynb
14	Unified Embedding (embed_adata)	14_unified_embedding.ipynb
15	Molecule Annotation	15_molecule_annotation.ipynb
16	Gene/Protein Annotation + Weighted Embeddings	16_gene_protein_annotation.ipynb

Package Structure

embpy/
    embedder.py          # BioEmbedder -- unified embedding interface
    models/
        dna_models.py    # Enformer, Borzoi, Evo, NT, HyenaDNA, Caduceus, GENA-LM
        protein_models.py # ESM-2, ESM-C, ESM3, ProtT5
        molecule_models.py # ChemBERTa, MolFormer, RDKit, MiniMol, MHG-GNN, MolE
        singlecell_models.py # scGPT, Geneformer, UCE, PCA, scVI
    resources/
        gene_resolver.py      # Gene symbol <-> Ensembl ID <-> DNA sequence
        protein_resolver.py   # Gene -> UniProt canonical/isoform sequences
        molecule_annotator.py # Small molecule annotations (6 sources)
        gene_annotator.py     # Gene annotations (pathways, expression, PPI, diseases)
        protein_annotator.py  # Protein annotations (UniProt, InterPro)
        drug_resolver.py      # Drug name <-> SMILES resolution
    pp/
        sc_preprocessing.py   # Single-cell preprocessing (raw/standard pipelines)
        basic.py              # Perturbation embedding matrix construction
    tl/
        similarity.py         # Cosine/Pearson/Spearman similarity, KNN overlap
        dimred.py             # UMAP, t-SNE (CPU/GPU)
        clustering.py         # Leiden, k-means, spectral (CPU/GPU)
        weighted_protein_embedding.py # TPM-weighted, annotation-weighted, expression-context
        metrics.py            # Benchmarking metrics
        pipeline.py           # Automated evaluation pipelines
        metadata.py           # pertpy-based metadata annotation

Release Notes

See the changelog.

Contact

For questions and help requests, you can reach out in the scverse discourse. If you found a bug, please use the issue tracker.

Citation

t.b.a

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.github		.github
.vscode		.vscode
docs		docs
err		err
src/embpy		src/embpy
submission_scripts		submission_scripts
tests		tests
.codecov.yaml		.codecov.yaml
.cruft.json		.cruft.json
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierrc.yaml		.prettierrc.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
embedding_job.log		embedding_job.log
environment-cpu.yml		environment-cpu.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

embpy

Workflow

Architecture

Key Features

Quick Start

Embed a gene with a DNA model

Embed a protein with ESM-2

Embed a small molecule

Embed cells from an AnnData

Combined cell + perturbation embedding

Weighted perturbation embedding

Annotate perturbations

Available Models

DNA Models

Protein Models

Molecule Models

Single-Cell Foundation Models

Text Models

Installation

Option 1: Mamba / Conda (recommended for HPC)

Option 2: uv (fastest pip alternative)

Option 3: pip

Optional extras

Verifying the installation

Tutorials

Package Structure

Release Notes

Contact

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages