**Requires**
* `'lincs_full_smiles.h5ad'`
* `'sciplex_raw_chunk_{i}.h5ad'` with $i \in \{0,1,2,3,4\}$

**Output**
* `'sciplex3_matched_genes_lincs.h5ad'`
* Only with genes that are shared with `lincs`: `'sciplex3_lincs_genes.h5ad'`
* Only with genes that are shared with `sciplex`: `'lincs_full_smiles_sciplex_genes.h5ad'`

## Imports

In [1]:
import os 
import scanpy as sc
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
import sfaira

sc.set_figure_params(dpi=80, frameon=False)
sc.logging.print_header()
os.getcwd()

from compert.paths import DATA_DIR, PROJECT_DIR

scanpy==1.9.0.dev41+g58f4904c anndata==0.7.6 umap==0.5.1 numpy==1.19.2 scipy==1.6.2 pandas==1.2.4 scikit-learn==0.24.2 statsmodels==0.12.2 python-igraph==0.9.1 louvain==0.7.0 pynndescent==0.5.2


In [2]:
%load_ext autoreload
%autoreload 2

## Load data

Load lincs

In [4]:
adata_lincs = sc.read(PROJECT_DIR/'datasets'/'lincs_full_smiles.h5ad' )

Load trapnell

In [5]:
adatas = []
for i in range(5):
    adatas.append(sc.read(PROJECT_DIR/'datasets'/f'sciplex_raw_chunk_{i}.h5ad'))
adata = adatas[0].concatenate(adatas[1:])

Add gene_id to trapnell

In [6]:
adata.var['gene_id'] = adata.var.id.str.split('.').str[0]

### Get gene ids from symbols via sfaira

Load genome container with sfaira

In [7]:
genome_container = sfaira.versions.genomes.GenomeContainer(organism="homo_sapiens", release="82")

Extend symbols dict with unknown symbol

In [8]:
symbols_dict = genome_container.symbol_to_id_dict
symbols_dict.update({'PLSCR3':'ENSG00000187838'})

Identify genes that are shared between lincs and trapnell

In [9]:
# For lincs
adata_lincs.var['gene_id'] = adata_lincs.var_names.map(symbols_dict)
adata_lincs.var['in_sciplex'] = adata_lincs.var.gene_id.isin(adata.var.gene_id)

In [10]:
# For trapnell
adata.var['in_lincs'] = adata.var.gene_id.isin(adata_lincs.var.gene_id)

## Preprocess trapnell dataset

See `sciplex3.ipynb`

In [11]:
sc.pp.subsample(adata, fraction=0.5)
sc.pp.normalize_per_cell(adata)

In [12]:
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000, subset=False)

### Combine HVG with lincs genes

Union of genes that are considered highly variable and those that are shared with lincs

In [13]:
((adata.var.in_lincs) | (adata.var.highly_variable)).sum()

5894

Subset to that union of genes

In [14]:
adata = adata[:, (adata.var.in_lincs) | (adata.var.highly_variable)].copy()

### Create additional meta data 

Normalise dose values

In [15]:
adata.obs['dose_val'] = adata.obs.dose.astype(float) / np.max(adata.obs.dose.astype(float))
adata.obs.loc[adata.obs['product_name'].str.contains('Vehicle'), 'dose_val'] = 1.0

In [16]:
adata.obs['dose_val'].value_counts()

0.001    76760
0.010    73754
0.100    70910
1.000    69464
Name: dose_val, dtype: int64

Change `product_name`

In [17]:
adata.obs['product_name'] = [x.split(' ')[0] for x in adata.obs['product_name']]
adata.obs.loc[adata.obs['product_name'].str.contains('Vehicle'), 'product_name'] = 'control'

Create copy of `product_name` with column name `control`

In [18]:
adata.obs['condition'] = adata.obs.product_name.copy()

Add combinations of drug (`condition`), dose (`dose_val`), and cell_type (`cell_type`)

In [19]:
adata.obs['drug_dose_name'] = adata.obs.condition.astype(str) + '_' + adata.obs.dose_val.astype(str)
adata.obs['cov_drug_dose_name'] = adata.obs.cell_type.astype(str) + '_' + adata.obs.drug_dose_name.astype(str)
adata.obs['cov_drug'] = adata.obs.cell_type.astype(str) + '_' + adata.obs.condition.astype(str)

Add `control` columns with vale `1` where only the vehicle was used

In [20]:
adata.obs['control'] = [1 if x == 'control_1.0' else 0 for x in adata.obs.drug_dose_name.values]

In [21]:
from compert.helper import rank_genes_groups_by_cov
rank_genes_groups_by_cov(adata, groupby='cov_drug', covariate='cell_type', control_group='control')

Using backend: pytorch


A549


Trying to set attribute `.obs` of view, copying.
... storing 'cell_type' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'pathway' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'product_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'target' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'condition' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug' as categorical


MCF7


Trying to set attribute `.obs` of view, copying.
... storing 'cell_type' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'pathway' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'product_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'target' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'condition' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug' as categorical


K562


Trying to set attribute `.obs` of view, copying.
... storing 'cell_type' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'pathway' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'product_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'target' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'condition' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug' as categorical


In [22]:
new_genes_dict = {}
for cat in adata.obs.cov_drug_dose_name.unique():
    if 'control' not in cat:
        rank_keys = np.array(list(adata.uns['rank_genes_groups_cov'].keys()))
        bool_idx = [x in cat for x in rank_keys]
        genes = adata.uns['rank_genes_groups_cov'][rank_keys[bool_idx][0]]
        new_genes_dict[cat] = genes

In [23]:
adata.uns['rank_genes_groups_cov'] = new_genes_dict

## Split

This is not the right configuration fot the experiments we want but for the moment this is okay

In [24]:
adata.obs['split'] = 'train'  # reset
ho_drugs = [
    # selection of drugs from various pathways
    "Azacitidine",
    "Carmofur",
    "Pracinostat",
    "Cediranib",
    "Luminespib",
    "Crizotinib",
    "SNS-314",
    "Obatoclax",
    "Momelotinib",
    "AG-14361",
    "Entacapone",
    "Fulvestrant",
    "Mesna",
    "Zileuton",
    "Enzastaurin",
    "IOX2",
    "Alvespimycin",
    "XAV-939",
    "Fasudil"
]
ood = adata.obs['condition'].isin(ho_drugs)
len(ho_drugs)

19

In [25]:
adata.obs.loc[ood & (adata.obs['dose_val'] == 1.0), 'split'] = 'ood'
test_idx = sc.pp.subsample(adata[adata.obs['split'] != 'ood'], .10, copy=True).obs.index
adata.obs.loc[test_idx, 'split'] = 'test'

In [26]:
pd.crosstab(adata.obs['split'], adata.obs['condition'])

condition,(+)-JQ1,2-Methoxyestradiol,A-366,ABT-737,AC480,AG-14361,AG-490,AICAR,AMG-900,AR-42,...,Valproic,Vandetanib,Veliparib,WHI-P154,WP1066,XAV-939,YM155,ZM,Zileuton,control
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ood,0,0,0,0,0,355,0,0,0,0,...,0,0,0,0,0,249,0,0,403,0
test,168,141,170,133,165,137,189,173,149,156,...,192,142,152,180,190,120,40,134,127,664
train,1338,1346,1511,1261,1490,1162,1559,1650,1187,1250,...,1612,1225,1445,1584,1661,1057,354,1260,1198,5800


In [27]:
adata.obs['split'].value_counts()

train    256214
test      28468
ood        6206
Name: split, dtype: int64

In [28]:
adata[adata.obs.split == 'ood'].obs.condition.value_counts()

Fasudil         474
Mesna           464
IOX2            444
Entacapone      433
Fulvestrant     417
Zileuton        403
Azacitidine     385
Carmofur        379
Enzastaurin     366
AG-14361        355
Pracinostat     318
SNS-314         280
Crizotinib      256
XAV-939         249
Momelotinib     249
Cediranib       248
Obatoclax       195
Luminespib      194
Alvespimycin     97
Name: condition, dtype: int64

In [29]:
adata[adata.obs.split == 'test'].obs.condition.value_counts()

control         664
ENMD-2076       280
MK-0752         202
RG108           196
Ramelteon       195
               ... 
Flavopiridol     68
Luminespib       68
Patupilone       65
Epothilone       56
YM155            40
Name: condition, Length: 188, dtype: int64

Also a split which sees all data:

In [30]:
adata.obs['split_all'] = 'train'
test_idx = sc.pp.subsample(adata, .10, copy=True).obs.index
adata.obs.loc[test_idx, 'split_all'] = 'test'

In [31]:
adata.obs['ct_dose'] = adata.obs.cell_type.astype('str') + '_' + adata.obs.dose_val.astype('str')

Round robin splits: dose and cell line combinations will be held out in turn.

In [32]:
i = 0
split_dict = {}

In [33]:
# single ct holdout
for ct in adata.obs.cell_type.unique():
    for dose in adata.obs.dose_val.unique():
        i += 1
        split_name = f'split{i}'
        split_dict[split_name] = f'{ct}_{dose}'
        
        adata.obs[split_name] = 'train'
        adata.obs.loc[adata.obs.ct_dose == f'{ct}_{dose}', split_name] = 'ood'
        
        test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index
        adata.obs.loc[test_idx, split_name] = 'test'
        
        display(adata.obs[split_name].value_counts())

train    229595
test      43732
ood       17561
Name: split1, dtype: int64

train    228515
test      43526
ood       18847
Name: split2, dtype: int64

train    229387
test      43692
ood       17809
Name: split3, dtype: int64

train    229731
test      43758
ood       17399
Name: split4, dtype: int64

train    214593
test      40874
ood       35421
Name: split5, dtype: int64

train    212368
test      40450
ood       38070
Name: split6, dtype: int64

train    213013
test      40573
ood       37302
Name: split7, dtype: int64

train    214820
test      40917
ood       35151
Name: split8, dtype: int64

train    229287
test      43673
ood       17928
Name: split9, dtype: int64

train    227678
test      43367
ood       19843
Name: split10, dtype: int64

train    228686
test      43559
ood       18643
Name: split11, dtype: int64

train    230139
test      43835
ood       16914
Name: split12, dtype: int64

In [34]:
# double ct holdout
for cts in [('A549', 'MCF7'), ('A549', 'K562'), ('MCF7', 'K562')]:
    for dose in adata.obs.dose_val.unique():
        i += 1
        split_name = f'split{i}'
        split_dict[split_name] = f'{cts[0]}+{cts[1]}_{dose}'
        
        adata.obs[split_name] = 'train'
        adata.obs.loc[adata.obs.ct_dose == f'{cts[0]}_{dose}', split_name] = 'ood'
        adata.obs.loc[adata.obs.ct_dose == f'{cts[1]}_{dose}', split_name] = 'ood'
        
        test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index
        adata.obs.loc[test_idx, split_name] = 'test'
        
        display(adata.obs[split_name].value_counts())

train    199842
ood       52982
test      38064
Name: split13, dtype: int64

train    196536
ood       56917
test      37435
Name: split14, dtype: int64

train    198053
ood       55111
test      37724
Name: split15, dtype: int64

train    200204
ood       52550
test      38134
Name: split16, dtype: int64

train    214536
test      40863
ood       35489
Name: split17, dtype: int64

train    211847
test      40351
ood       38690
Name: split18, dtype: int64

train    213727
test      40709
ood       36452
Name: split19, dtype: int64

train    215523
test      41052
ood       34313
Name: split20, dtype: int64

train    199533
ood       53349
test      38006
Name: split21, dtype: int64

train    195699
ood       57913
test      37276
Name: split22, dtype: int64

train    197353
ood       55945
test      37590
Name: split23, dtype: int64

train    200612
ood       52065
test      38211
Name: split24, dtype: int64

In [35]:
# triple ct holdout
for dose in adata.obs.dose_val.unique():
    i += 1
    split_name = f'split{i}'

    split_dict[split_name] = f'all_{dose}'
    adata.obs[split_name] = 'train'
    adata.obs.loc[adata.obs.dose_val == dose, split_name] = 'ood'

    test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index
    adata.obs.loc[test_idx, split_name] = 'test'

    display(adata.obs[split_name].value_counts())

train    184782
ood       70910
test      35196
Name: split25, dtype: int64

train    179868
ood       76760
test      34260
Name: split26, dtype: int64

train    182393
ood       73754
test      34741
Name: split27, dtype: int64

train    185997
ood       69464
test      35427
Name: split28, dtype: int64

In [36]:
adata.uns['splits'] = split_dict

## Save adata

In [41]:
fname = PROJECT_DIR/'datasets'/'sciplex3_matched_genes_lincs.h5ad'

sc.write(fname, adata)

... storing 'cell_type' as categorical
... storing 'pathway' as categorical
... storing 'product_name' as categorical
... storing 'target' as categorical
... storing 'condition' as categorical
... storing 'drug_dose_name' as categorical
... storing 'cov_drug_dose_name' as categorical
... storing 'cov_drug' as categorical
... storing 'split' as categorical
... storing 'split_all' as categorical
... storing 'ct_dose' as categorical
... storing 'split1' as categorical
... storing 'split2' as categorical
... storing 'split3' as categorical
... storing 'split4' as categorical
... storing 'split5' as categorical
... storing 'split6' as categorical
... storing 'split7' as categorical
... storing 'split8' as categorical
... storing 'split9' as categorical
... storing 'split10' as categorical
... storing 'split11' as categorical
... storing 'split12' as categorical
... storing 'split13' as categorical
... storing 'split14' as categorical
... storing 'split15' as categorical
... storing 'split16

Check that it worked

In [42]:
sc.read(fname)

AnnData object with n_obs × n_vars = 290888 × 5894
    obs: 'cell_type', 'dose', 'dose_character', 'dose_pattern', 'g1s_score', 'g2m_score', 'pathway', 'pathway_level_1', 'pathway_level_2', 'product_dose', 'product_name', 'proliferation_index', 'replicate', 'size_factor', 'target', 'vehicle', 'batch', 'n_counts', 'dose_val', 'condition', 'drug_dose_name', 'cov_drug_dose_name', 'cov_drug', 'control', 'split', 'split_all', 'ct_dose', 'split1', 'split2', 'split3', 'split4', 'split5', 'split6', 'split7', 'split8', 'split9', 'split10', 'split11', 'split12', 'split13', 'split14', 'split15', 'split16', 'split17', 'split18', 'split19', 'split20', 'split21', 'split22', 'split23', 'split24', 'split25', 'split26', 'split27', 'split28'
    var: 'id', 'num_cells_expressed-0-0', 'num_cells_expressed-1-0', 'num_cells_expressed-1', 'gene_id', 'in_lincs', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'hvg', 'rank_genes_groups_cov', 'splits'

## Subselect to shared only shared genes

Subset to shared genes

In [39]:
adata_lincs = adata_lincs[:, adata_lincs.var.in_sciplex].copy() 

In [44]:
adata = adata[:, adata.var.in_lincs].copy()

Reindex the lincs dataset

In [58]:
lincs_ids = pd.Index(adata_lincs.var.gene_id)
new_idx = [lincs_ids.get_loc(gene_id) for gene_id in adata.var.gene_id] 

In [64]:
adata_lincs = adata_lincs[:, new_idx].copy()

## Save adata objects with shared genes only
Index of lincs has also been reordered accordingly

In [65]:
fname = PROJECT_DIR/'datasets'/'sciplex3_lincs_genes.h5ad'

sc.write(fname, adata)

In [66]:
fname_lincs = PROJECT_DIR/'datasets'/'lincs_full_smiles_sciplex_genes.h5ad'

sc.write(fname_lincs, adata_lincs)