# Cell Classification with AIDO.Cell

Here, we do cell type classification from the AIDO.Cell paper using 3 datasets.
- [Zheng68K et al. 2017](https://www.nature.com/articles/ncomms14049)
- [Segerstolpe et al. 2016](https://www.cell.com/cell-metabolism/fulltext/S1550-4131(16)30436-3)
- [scTab et al. 2024](https://www.nature.com/articles/s41467-024-51059-5)

Both of Zheng68K and Segerstolpe are preprocessed and available to download from the [GenBio AI HuggingFace](https://huggingface.co/datasets/genbio-ai/cell-downstream-tasks/tree/main).

As for scTab dataset, because of its large size, we are using TileDB to load the data by chunks during training and testing. To run experiments with scTab, please download the data files from the official [scTab repo](https://github.com/theislab/scTab) and then convert the `.parquet` data files into TileDB format with this [script](./sctab_conversion.py). (Note: This conversion could take a few hours.) 

We also provided a minimal version of scTab in TileDB format with a subset of ~32k observations in each split (train, val, test) as an example. This is available for downloading at [here](https://huggingface.co/datasets/genbio-ai/cell-downstream-tasks/blob/main/sctab/soma-exp-scTab-minimal.tar.gz). When you have a tileDB data folder ready, simply add the data root path (either local or in docker workspace) to `config.data.init_args.path` (see [sctab_classification.yaml](./sctab_classification.yaml) as example). The train/validation/test split subfolders should be automatically ready under the root path after the conversion. 

For installation, see the quickstart tutorial.

__Requirements__:
- A100 GPU or equivalent
- [ModelGenerator](https://genbio-ai.github.io/ModelGenerator/) installed
- [HuggingFace CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli) installed

In [None]:
!huggingface-cli download genbio-ai/cell-downstream-tasks \
  --repo-type dataset \
  --local-dir data/genbio-ai/cell-downstream-tasks

## ModelGenerator

Using large models like AIDO.Cell can be a headache due to their size.
To make it easier to work with large models, we developed [ModelGenerator](https://genbio-ai.github.io/ModelGenerator/), a research framework for cross-disciplinary teams in ML & Bio.
ModelGenerator is designed to automatically take advantage of available of distributed training/inference workflows to scale with available hardware.
It also provides reproducible configs for every training run, and a simple CLI to run training and inference.

In this example we run cell type classification with AIDO.Cell using the ModelGenerator CLI.


### Data Alignment

Normally the dataset must be aligned to AIDO.Cell's pretraining gene set. An example is below.

Here, the Zheng and Stegerstolpe datasets are pre-aligned, and this is just for example.

In [None]:
# import scanpy as sc
# import cell_utils

# my_data = sc.read_hyad('my_data.h5ad')
# adata_aligned = cell_utils.align_adata(my_data)
# adata_aligned.write_h5ad('my_data_aligned.h5ad')

In [32]:
# Export timestamp for linking train and test runs
import os
import time
timestamp = time.time()
os.environ["TIMESTAMP"] = str(timestamp)

### Finetune AIDO.Cell on Zheng68K dataset

In [None]:
!mgen fit --config cell_type_classification.yaml \
    --model.backbone aido_cell_3m \
    --model.adapter LinearMaxPoolAdapter \
    --data.path data/genbio-ai/cell-downstream-tasks/zheng \
    --trainer.logger lightning.pytorch.loggers.WandbLogger \
    --trainer.logger.version zheng_cell_type_classification_$TIMESTAMP \
    --trainer.val_check_interval 100 \
    --trainer.limit_val_batches 100

### Test the Best Val F1 Checkpoint on the Zheng68K test split

To test, just use `mgen test` with the same command, and point to the checkpoint path.

In [None]:
!mgen test --config cell_type_classification.yaml \
    --model.backbone aido_cell_3m \
    --model.adapter LinearMaxPoolAdapter \
    --data.path data/genbio-ai/cell-downstream-tasks/zheng \
    --trainer.default_root_dir logs \
    --trainer.callbacks.dirpath logs/zheng_cell_type_classification/ckpts \
    --ckpt_path lightning_logs/cell_type_classification_$TIMESTAMP/checkpoints/best_val_f1*.ckpt

### Fit and Test on Segerstolpe dataset

In [None]:
!mgen fit --config cell_type_classification.yaml \
    --model.backbone aido_cell_3m \
    --model.adapter LinearMaxPoolAdapter \
    --data.path data/genbio-ai/cell-downstream-tasks/Stegerstolpe \
    --trainer.logger lightning.pytorch.loggers.WandbLogger \
    --trainer.logger.version stegerstolpe_cell_type_classification_$TIMESTAMP \
    --trainer.val_check_interval 100 \
    --trainer.limit_val_batches 100

In [None]:
!mgen test --config cell_type_classification.yaml \
    --model.backbone aido_cell_3m \
    --model.adapter LinearMaxPoolAdapter \
    --data.path data/genbio-ai/cell-downstream-tasks/Stegerstolpe \
    --ckpt_path lightning_logs/stegerstolpe_cell_type_classification_$TIMESTAMP/checkpoints/best_val_f1*.ckpt

### Finetune AIDO.Cell on scTab dataset

In [None]:
!mgen fit --config sctab_classification.yaml \
    --model.backbone aido_cell_3m \
    --model.adapter LinearMaxPoolAdapter \
    --data.path TODO \
    --trainer.logger lightning.pytorch.loggers.WandbLogger \
    --trainer.logger.version sctab_cell_type_classification_$TIMESTAMP \
    --trainer.val_check_interval 100 \
    --trainer.limit_val_batches 100