# Tutorial on preparing Pertub-seq data for MORPH

MORPH accepts a Scanpy adata object with the following requirements:

1. The `adata.obs` data must contain a `gene` column, where `gene` represents the perturbation name for each cell. Control cells are labeled as `non-targeting`, single perturbations follow the format `PerturbationA`, and combination perturbations follow the format `PerturbationA+PerturbationB`.

2. The `adata.X` matrix stores post-perturbation gene expression data.

Here is an example using Norman K562 dataset.

In [1]:
import numpy as np
import scanpy as sc
from tqdm import tqdm

In [2]:
# read in raw data
raw_adata = sc.read_h5ad('./data/sample_data.h5ad')
adata = raw_adata.copy()
adata



AnnData object with n_obs × n_vars = 111122 × 5044
    obs: 'guide_id', 'read_count', 'UMI_count', 'coverage', 'gemgroup', 'good_coverage', 'number_of_cells', 'tissue_type', 'cell_line', 'cancer', 'disease', 'perturbation_type', 'celltype', 'organism', 'perturbation', 'nperts', 'ngenes', 'ncounts', 'percent_mito', 'percent_ribo', 'n_counts', 'condition', 'pert_type', 'cell_type', 'source', 'condition_ID', 'control', 'dose_value', 'pathway_old', 'cov_cond', 'pert', 'split_hardest', 'split_1', 'split_2', 'split_3', 'split_4', 'split_5', 'split_6', 'cond_harm', 'gene', 'pathway'
    var: 'ensemble_id', 'ncounts', 'ncells', 'symbol', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'cell_type_colors', 'gene_embedding_path', 'hvg', 'log1p', 'neighbors', 'rank_genes_groups_cov', 'source_colors', 'split_1_colors', 'split_2_colors', 'split_3_colors', 'split_4_colors', 'split_5_colors', 'split_hardest_colors', 'umap'
    obsm: 'X_pca', 'X_umap'
    layers: 'counts'
    obs

In [7]:
adata.obs['gene'].value_counts()

gene
non-targeting    11855
KLF1              1960
BAK1              1457
CEBPE             1233
CEBPE+RUNX1T1     1219
                 ...  
FOSB+CEBPB          71
CBL+UBASH3A         64
CEBPB+CEBPA         64
JUN+CEBPB           59
JUN+CEBPA           54
Name: count, Length: 235, dtype: int64

### Suggested pre-processing steps

In [None]:
# filter out cells with low number of counts
sc.pp.filter_cells(adata, min_counts=100)
# filter out genes that are detected in less than 5 cells
sc.pp.filter_genes(adata, min_counts=5)

# filter out perturbation groups with data in less than 32 cells
gene_counts = adata.obs['gene'].value_counts()
gene_counts = gene_counts[gene_counts >= 32]
adata = adata[adata.obs['gene'].isin(gene_counts.index)]
adata

In [None]:
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# subset to top 5000 highly variable genes
sc.pp.highly_variable_genes(adata,n_top_genes=5000, subset=True)

### Specify the path to adata in `../data/scdata_file_path.csv`

Specify the dataset name in column 'dataset' and the path in column 'path'.

Example see: `../data/scdata_file_path.csv`