# Train an scVI model using Census data

This notebook demonstrates a scalable approach to training an [scVI](https://docs.scvi-tools.org/en/latest/user_guide/models/scvi.html) model on Census data. The [scvi-tools](https://scvi-tools.org/) library is built around [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/). [TileDB-SOMA-ML](https://github.com/single-cell-data/TileDB-SOMA-ML) assists with streaming Census query results to PyTorch in batches, allowing for training datasets larger than available RAM.

## Contents

1. Training the model
2. Generate cell embeddings
3. Analyzing the results

## Training the model 

Let's start by importing the necessary dependencies.

In [1]:
print("Stha2rt")

Stha2rt


In [7]:
# build_subset.ipynb  (run once)
import cellxgene_census as cxc
from tiledbsoma.io import from_anndata                        
from pathlib import Path
from cellxgene_census import get_anndata 
import cellxgene_census.experimental

# --- parameters ------------------------------------------------------------
CENSUS_VERSION = "2025-01-30"
OBS_FILTER = (
    "is_primary_data == True and "
    "tissue_general in ['pancreas', 'kidney'] and "
    "nnz >= 300"
)
SUBSET_URI = Path("/home/ec2-user/mm_pan_kidney_subset_soma").as_posix()
# ---------------------------------------------------------------------------

with cxc.open_soma(census_version=CENSUS_VERSION) as census:          # remote read
    hvgs_df = cellxgene_census.experimental.pp.get_highly_variable_genes(
        census=census,
        organism="mus_musculus",
        obs_value_filter=OBS_FILTER,
        n_top_genes=8000,
    )
    hvg_idx = hvgs_df.query("highly_variable").index          # Pandas Index

    adata = get_anndata(
        census=census,
        organism="mus_musculus",
        measurement_name="RNA",
        obs_value_filter=OBS_FILTER,
        # var_coords={"coords": list(hvg_idx)},   # ← server-side gene subset
        X_name="raw",
    )

# (Optional) keep only top 8 k highly-variable genes

from_anndata(SUBSET_URI, anndata=adata, measurement_name="RNA")       # writes ≈ 1 GB
print("✅  Local SOMA written to", SUBSET_URI)


✅  Local SOMA written to /home/ec2-user/mm_pan_kidney_subset_soma


We'll now prepare the necessary parameters for running a training pass of the model.

For this notebook, we'll use a stable version of the Census: