# Tutorial on preparing single-cell data for MORPH transfer learning


### Step 1. Standard data preprocessing
MORPH expects input in the form of a `Scanpy` AnnData object (`adata`) that meets the following criteria:

1. **Perturbation Labels** (`adata.obs['gene']`)
 - The `adata.obs` DataFrame must include a column named `gene`, which specifies the perturbation applied to each cell.
 - Use the following labeling format:
    - Control cells: 'non-targeting'
    - Single perturbations: e.g., 'GeneA'
    - Double perturbations: e.g., 'GeneA+GeneB' (with gene names joined by a +)

2. **Gene Expression Matrix** (`adata.X`)
 - The `.X` matrix should contain **either raw gene expression counts** or **imaging features** for each cell, depending on the modality of your dataset.
 
### Step 2. Data Alignment (specific to transfer learning)
Since the released model was trained on the genome-wide Perturb-seq dataset from [Replogle et al. (2022)](https://www.sciencedirect.com/science/article/pii/S0092867422005979?via%3Dihub), it is important to ensure that your input data is aligned with the training data. Specifically, the gene names must match in exact order and content. Below is a sample code snippet to perform this alignment.


### Step 3. Preprocessing
If you are using raw gene expression counts, please follow the steps in [`data_tutorial.ipynb`](data_tutorial.ipynb) to preprocess the data.

**Note: please process the data *only* after alignment**

In [1]:
import numpy as np
import scanpy as sc
from tqdm import tqdm
import anndata
import pandas as pd

In [None]:
# Load training and target datasets
adata_train = anndata.read_h5ad('path/to/your/train_data.h5ad')  # Replace with the actual path to the training data
adata_target = anndata.read_h5ad('path/to/your/target_data.h5ad')  # Replace with the actual path to your target data

# Extract the list of genes measured in the training data
train_genes = adata_train.var.index.tolist()

# Subset the target data to include only genes present in the training data
adata_target = adata_target[:, adata_target.var.index.isin(train_genes)].copy()

# Reorder target genes to match the training data
adata_target = adata_target[:, train_genes]

# Check that all training genes are present in the target data
assert adata_target.shape[1] == len(train_genes), (
    "Mismatch in gene sets: some genes in the training data are missing from the target data. "
    "Please ensure gene identifiers are consistent and complete before proceeding."
)

# Note: In some cases, certain genes present in the training dataset may not be expressed 
# (or even present) in the target dataset. This mismatch makes it incompatible to directly 
# apply the trained model, as the model expects input features (genes) to exactly match 
# what it was trained on. If this happens, you will need to retrain the model using only 
# the set of genes that are common to both the training and target datasets.

Now you can process the data (i.e., normalize, log10, hvg) as in [`data_tutorial.ipynb`](data_tutorial.ipynb)