# Quick Start: AIDO.Cell

**Estimated time to complete**: under 10 minutes (A100 GPU system)

**Google Colab Note:** This notebook requires A100 GPU only included with Google Colab Pro or Enterprise paid services.
Alternatively, a "pay as you go" option is available to purchase premium GPUs. See [Colab Service Plans](https://colab.research.google.com/signup?utm_source=notebook_settings&utm_medium=link&utm_campaign=premium_gpu_selector) for details.

## Learning Goals

*   Install ModelGenerator, a plug-and-play framework for using AIDO.Cell models
*   Download a single-cell RNA dataset from the Gene Expression Omnibus (GEO) repository
*  Preprocess data
*  Generate embeddings using the pre-trained AIDO.Cell-3M model

## Pre-requisites

*   A100 GPU or equivalent
*   Python 3.10 or Python 3.11

## Introduction

### Model
The AIDO.Cell models are a family of scalable transformer-based models that were trained on 50 million cells spanning a diverse set of human tissues and organs. The models aim to learn accurate and general representations of the human cell's entire transcriptional context and can be used for various tasks including zero-shot clustering, cell type classification, and perturbation modeling. This quickstart implements AIDO.Cell-3M, the smallest variant of the AIDO.Cell models, to embedd single-cell RNA data.

AIDO.Cell was designed for use with the ModelGenerator CLI. It is strongly recommended to use ModelGenerator for running AIDO.Cell models. For more information, check out:

*   [Using ModelGenerator to finetune AIDO.Cell](https://github.com/genbio-ai/ModelGenerator/blob/6ad2e776749e506525d5a4c3d8ef0dfdb87d2664/experiments/AIDO.Cell/tutorial_cell_classification.ipynb)
*  [ ModelGenerator Docs](https://genbio-ai.github.io/ModelGenerator/)


### Example Dataset
The GEO dataset used in this quickstart includes single-cell RNA data obtained from colon biopsies collected from patients with ulcerative colatis (UC) and Chron's disease (CD). The dataset also includes samples from a healthy control (HC).

## Setup

The steps below will install the required ModelGenerator package and associated dependencies and download the example dataset and model checkpoint. It may take a few minutes to download all the files.


### Setup Google Colab

To run this quickstart using Google Colab, you will need to choose the 'A100' GPU runtime from the "Connect" dropdown menu in the upper-right corner of this notebook. Note that this runtime configuration is not available in the free Colab version. To access premium GPUs, you will need to purchase additional compute units. The current quickstart was tested in Colab Enterprise using the following runtime configuration:

*   Machine type: a2-highgpu-1g
*   GPU type: NVIDIA_TESLA_A100 x 1
*   Data disk type:100 GB Standard Disk (pd-standard)


### Setup Local Environment

ModelGenerator is an open-source and convenient plug-and-play software stack to run AIDO.Cell moldels. It automatically interfaces with Hugging Face and allows easy one-command embedding and adaptation of the models for a wide variety of fine-tuning tasks. To run ModelGenerator, the GPU must be ampere-generation or later to support flash attention (e.g., A100, H100).

### Step 1: Install ModelGenerator and required dependencies

In [None]:
!git clone https://github.com/genbio-ai/ModelGenerator.git
%cd ModelGenerator
!pip install -e ".[flash_attn]"
!pip install -r requirements.txt
%cd experiments/AIDO.Cell

In [None]:
# Restart the session after installing

# Then navigate back to the AIDO.Cell directory
%cd ModelGenerator/experiments/AIDO.Cell

### Step 2: Download example dataset from GEO and load into anndata

In [None]:
%%bash
mkdir -p data
cd data
wget -nv -O GSE214695.tar 'http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE214695&format=file'
tar -xvf GSE214695.tar
cd ..

In [None]:
import anndata as ad
import scanpy as sc

adata = sc.read_10x_mtx('data', prefix='GSM6614348_HC-1_')
sc.pp.filter_cells(adata, min_genes=500)
sc.pp.filter_genes(adata, min_cells=3)
# No more normalization needed, AIDO.Cell uses raw counts

### Step 3: Preprocess the anndata for AIDO.Cell

In [None]:
import cell_utils
aligned_adata, attention_mask = cell_utils.align_adata(adata)

###########  Aligning data to AIDO.Cell  ###########
AIDO.Cell was pretrained on a fixed set of 19264 genes.
Aligning your data to the AIDO.Cell gene set...
2428 in your data that cannot be used by AIDO.Cell. Removing these.
['A1BG-AS1' 'A2M-AS1' 'AAED1' ... 'ZNRD1' 'ZNRF3-AS1' 'ZSCAN16-AS1']
5837 genes in the AIDO.Cell pretraining set missing in your data.
AIDO.Cell is trained with zero-masking. Setting these to zero for AIDO.Cell to ignore.
['A2ML1' 'A3GALT2' 'A4GNT' ... 'ZSWIM5' 'ZYG11A' 'ZZZ3']
13427 non-zero genes remaining.
Reordering genes to match AIDO.Cell gene ordering
Gathering attention mask for nonzero genes
####################  Finished  ####################


### Step 4: Generate AIDO.Cell embeddings

In [None]:
# Embed
import anndata as ad
import numpy as np
import torch
import sys
from modelgenerator.tasks import Embed

# The following is equivalent to the ModelGenerator CLI command:
# mgen predict --model Embed --model.backbone aido_cell_3m \
#   --data CellClassificationDataModule --data.test_split_files <your_anndata>.h5ad

# If not using mgen, this should be configured manually.
device = 'cuda'
batch_size = 2

model = Embed.from_config({
        "model.backbone": "aido_cell_3m",
        "model.batch_size": batch_size
    }).eval()
model = model.to(device).to(torch.bfloat16)

# All data must be in bfloat16
batch_np = aligned_adata[:batch_size].X.toarray()
batch_tensor = torch.from_numpy(batch_np).to(torch.bfloat16).to(device)
# Call transform and embed.
batch_transformed = model.transform({'sequences': batch_tensor})
embs = model(batch_transformed)

# Full Embeddings
print('FULL EMBEDDING')
print('(batch_size, genes, embedding_dim)')
print(embs.shape)
print(embs)
print('-------------------------------------')

# Non-Zero Genes Embeddings
print('NON-ZERO GENES EMBEDDING')
embs = embs[:, attention_mask.astype(bool), :]
print('(batch_size, genes, embedding_dim)')
print(embs.shape)
print(embs)

  input_ids = torch.tensor(input_ids, dtype=torch.long).to(self.device)
  X = torch.tensor(


FULL EMBEDDING
(batch_size, genes, embedding_dim)
torch.Size([2, 19264, 128])
tensor([[[-2.0469,  0.4199, -1.6719,  ..., -0.9258,  0.3730,  1.5938],
         [-0.6445, -1.9062, -2.7969,  ..., -1.5391,  0.9414, -0.5273],
         [-1.0703, -1.5234, -0.9648,  ..., -0.6445,  0.6406,  0.8867],
         ...,
         [ 0.5586, -1.8672, -2.6562,  ..., -0.3438, -0.2100,  0.9297],
         [ 0.0037,  0.0347,  0.2969,  ..., -0.4258,  1.3438, -0.4121],
         [-1.1172, -1.5156, -1.0781,  ..., -1.0781,  1.4531, -0.9727]],

        [[-2.3125,  1.0391, -2.3125,  ..., -0.2471,  0.5312,  0.1572],
         [-0.8008, -2.0000, -2.7344,  ..., -1.4688,  0.6328, -0.7422],
         [-0.0918, -2.2188, -0.0815,  ..., -1.4453,  0.0179,  0.8438],
         ...,
         [ 0.0698, -1.3359, -2.4375,  ..., -0.0195,  0.0396,  1.0547],
         [ 0.1777,  0.0664,  0.3223,  ..., -0.1631,  1.0938, -0.3145],
         [-0.6602, -1.0000, -1.5469,  ..., -1.0312,  0.9883, -0.7266]]],
       device='cuda:0', dtype=torch.bf

# Contacts and Acknowledgements

For issues with this tutorial please contact virtualcellmodels@chanzuckerberg.com or Caleb Ellington at caleb.ellington@genbio.ai.

Thanks to Caleb Ellington, all the AIDO.Cell model developers, and the [GenBio AI](https://genbio.ai/) team for creating and supporting this resource.


# Responsible Use

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](https://virtualcellmodels.cziscience.com/acceptable-use-policy) when engaging with our services.