## (1) Load Processed Data in the Paper

Currently, we support norman / adamson / dixit.

In [1]:
import sys
sys.path.append('../')

from gears import PertData

pert_data = PertData('./data') # specific saved folder
pert_data.load(data_name = 'norman') # specific dataset name
pert_data.prepare_split(split = 'simulation', seed = 1) # get data split with seed
pert_data.get_dataloader(batch_size = 32, test_batch_size = 128) # prepare data loader

Downloading...
100%|█████████████████████████████████████| 9.46M/9.46M [00:01<00:00, 8.04MiB/s]
Downloading...
100%|███████████████████████████████████████| 559k/559k [00:00<00:00, 1.54MiB/s]
Downloading...
100%|███████████████████████████████████████| 169M/169M [00:35<00:00, 4.75MiB/s]
Extracting zip file...
Done!
These perturbations are not in the GO graph and their perturbation can thus not be predicted
['RHOXF2BB+ctrl' 'LYL1+IER5L' 'ctrl+IER5L' 'KIAA1804+ctrl' 'IER5L+ctrl'
 'RHOXF2BB+ZBTB25' 'RHOXF2BB+SET']
Creating pyg object for each cell in the data...
100%|█████████████████████████████████████████| 277/277 [04:15<00:00,  1.08it/s]
Saving new dataset pyg object at ./data/norman/data_pyg/cell_graphs.pkl
Done!
Creating new splits....
Saving new splits at ./data/norman/splits/norman_simulation_1_0.75.pkl
Simulation split test composition:
combo_seen0:9
combo_seen1:43
combo_seen2:19
unseen_single:36
Done!
Creating dataloaders....
Done!


## (2) Create your own Perturb-Seq data
Prepare a scanpy adata object with 
1. `adata.obs` dataframe has `condition` and `cell_type` columns, where `condition` is the perturbation name for each cell. Control cells have condition format of `ctrl`, single perturbation has condition format of `A+ctrl` or `ctrl+A`, combination perturbation has condition format of `A+B`.
2. `adata.var` dataframe has `gene_name` column, where each gene name is the gene symbol.
3. `adata.X` stores the post-perturbed gene expression. 

Here an example using dixit 2016 dataset.

In [2]:
import scanpy as sc
adata = sc.read_h5ad('sample_adata.h5ad')
adata

AnnData object with n_obs × n_vars = 44735 × 5012
    obs: 'condition', 'cell_type'
    var: 'gene_name'

In [3]:
adata.var.head(2)

Unnamed: 0,gene_name
2,A1BG-AS1
7,AAK1


In [4]:
adata.obs.head(2)

Unnamed: 0,condition,cell_type
TAACATGAAAAGTG_p7d_F3,CREB1+ctrl,K562
CTATCCCTTGTGCA_p7d_G3,GABPA+ctrl,K562


### Suggested normalization

For raw count data we recommend the following normalization and subsetting to the top 5000 most variable genes

In [None]:
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata,n_top_genes=5000, subset=True)

### Create dataloader

GEARS will take it from here. The new data processing takes around 15 minutes for 5K genes and 100K cells. 

In [5]:
import sys
sys.path.append('../')

from gears import PertData

pert_data = PertData('./data') # specific saved folder
pert_data.new_data_process(dataset_name = 'dixit', adata = adata) # specific dataset name and adata object
pert_data.load(data_path = './data/dixit') # load the processed data, the path is saved folder + dataset_name
pert_data.prepare_split(split = 'simulation', seed = 1) # get data split with seed
pert_data.get_dataloader(batch_size = 32, test_batch_size = 128) # prepare data loader

Trying to set attribute `.obs` of view, copying.
... storing 'dose_val' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'condition_name' as categorical
... storing 'dose_val' as categorical
... storing 'condition_name' as categorical
Creating pyg object for each cell in the data...
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [00:51<00:00,  2.56s/it]
Saving new dataset pyg object at ./data/dixit/data_pyg/cell_graphs.pkl
Done!
Local copy of pyg dataset is detected. Loading...
Done!
Creating new splits....
Saving new splits at ./data/dixit/splits/dixit_simulation_1_0.75.pkl
Simulation split test composition:
combo_seen0:0
combo_seen1:0
combo_seen2:0
unseen_single:5
Done!
Creating dataloaders....
Done!
