**Requires**
* `'lincs_full_smiles.h5ad'`
* `'sciplex_raw_chunk_{i}.h5ad'` with $i \in \{0,1,2,3,4\}$

**Output**
* `'sciplex3_matched_genes_lincs.h5ad'`
* Only with genes that are shared with `lincs`: `'sciplex3_lincs_genes.h5ad'`
* Only with genes that are shared with `sciplex`: `'lincs_full_smiles_sciplex_genes.h5ad'`

## Imports

In [1]:
import os 
import scanpy as sc
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
import sfaira

sc.set_figure_params(dpi=80, frameon=False)
sc.logging.print_header()
os.getcwd()

from compert.paths import DATA_DIR, PROJECT_DIR

scanpy==1.9.0.dev41+g58f4904c anndata==0.7.6 umap==0.5.1 numpy==1.19.2 scipy==1.6.2 pandas==1.2.4 scikit-learn==0.24.2 statsmodels==0.12.2 python-igraph==0.9.1 louvain==0.7.0 pynndescent==0.5.2


In [2]:
%load_ext autoreload
%autoreload 2

## Load data

Load lincs

In [3]:
adata_lincs = sc.read(PROJECT_DIR/'datasets'/'lincs_full_smiles.h5ad' )

Load trapnell

In [4]:
adatas = []
for i in range(5):
    adatas.append(sc.read(PROJECT_DIR/'datasets'/f'sciplex_raw_chunk_{i}.h5ad'))
adata = adatas[0].concatenate(adatas[1:])

Add gene_id to trapnell

In [5]:
adata.var['gene_id'] = adata.var.id.str.split('.').str[0]

### Get gene ids from symbols via sfaira

Load genome container with sfaira

In [6]:
genome_container = sfaira.versions.genomes.GenomeContainer(organism="homo_sapiens", release="82")

Extend symbols dict with unknown symbol

In [7]:
symbols_dict = genome_container.symbol_to_id_dict
symbols_dict.update({'PLSCR3':'ENSG00000187838'})

Identify genes that are shared between lincs and trapnell

In [8]:
# For lincs
adata_lincs.var['gene_id'] = adata_lincs.var_names.map(symbols_dict)
adata_lincs.var['in_sciplex'] = adata_lincs.var.gene_id.isin(adata.var.gene_id)

In [9]:
# For trapnell
adata.var['in_lincs'] = adata.var.gene_id.isin(adata_lincs.var.gene_id)

## Preprocess trapnell dataset

See `sciplex3.ipynb`

The original CPA implementation required to subset the data due to scaling limitations.   
In this version we expect to be able to handle the full sciplex dataset.

In [10]:
SUBSET = False

if SUBSET: 
    sc.pp.subsample(adata, fraction=0.5)

In [11]:
sc.pp.normalize_per_cell(adata)

In [12]:
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000, subset=False)

### Combine HVG with lincs genes

Union of genes that are considered highly variable and those that are shared with lincs

In [13]:
((adata.var.in_lincs) | (adata.var.highly_variable)).sum()

5893

Subset to that union of genes

In [14]:
adata = adata[:, (adata.var.in_lincs) | (adata.var.highly_variable)].copy()

### Create additional meta data 

Normalise dose values

In [15]:
adata.obs['dose_val'] = adata.obs.dose.astype(float) / np.max(adata.obs.dose.astype(float))
adata.obs.loc[adata.obs['product_name'].str.contains('Vehicle'), 'dose_val'] = 1.0

In [16]:
adata.obs['dose_val'].value_counts()

0.001    153013
0.010    147670
0.100    141828
1.000    139266
Name: dose_val, dtype: int64

Change `product_name`

In [17]:
adata.obs['product_name'] = [x.split(' ')[0] for x in adata.obs['product_name']]
adata.obs.loc[adata.obs['product_name'].str.contains('Vehicle'), 'product_name'] = 'control'

Create copy of `product_name` with column name `control`

In [18]:
adata.obs['condition'] = adata.obs.product_name.copy()

Add combinations of drug (`condition`), dose (`dose_val`), and cell_type (`cell_type`)

In [19]:
adata.obs['drug_dose_name'] = adata.obs.condition.astype(str) + '_' + adata.obs.dose_val.astype(str)
adata.obs['cov_drug_dose_name'] = adata.obs.cell_type.astype(str) + '_' + adata.obs.drug_dose_name.astype(str)
adata.obs['cov_drug'] = adata.obs.cell_type.astype(str) + '_' + adata.obs.condition.astype(str)

Add `control` columns with vale `1` where only the vehicle was used

In [20]:
adata.obs['control'] = [1 if x == 'control_1.0' else 0 for x in adata.obs.drug_dose_name.values]

## Compute DE genes

In [21]:
from compert.helper import rank_genes_groups_by_cov
rank_genes_groups_by_cov(adata, groupby='cov_drug', covariate='cell_type', control_group='control', key_added='all_DEGs')

Using backend: pytorch


A549


Trying to set attribute `.obs` of view, copying.
... storing 'cell_type' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'pathway' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'product_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'target' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'condition' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug' as categorical


MCF7


Trying to set attribute `.obs` of view, copying.
... storing 'cell_type' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'pathway' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'product_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'target' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'condition' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug' as categorical


K562


Trying to set attribute `.obs` of view, copying.
... storing 'cell_type' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'pathway' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'product_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'target' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'condition' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug' as categorical


In [22]:
adata_subset = adata[:, adata.var.in_lincs].copy()
rank_genes_groups_by_cov(adata_subset, groupby='cov_drug', covariate='cell_type', control_group='control', key_added='lincs_DEGs')
adata.uns['lincs_DEGs'] = adata_subset.uns['lincs_DEGs']

A549


Trying to set attribute `.obs` of view, copying.
... storing 'cell_type' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'pathway' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'product_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'target' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'condition' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug' as categorical


MCF7


Trying to set attribute `.obs` of view, copying.
... storing 'cell_type' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'pathway' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'product_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'target' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'condition' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug' as categorical


K562


Trying to set attribute `.obs` of view, copying.
... storing 'cell_type' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'pathway' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'product_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'target' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'condition' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug_dose_name' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cov_drug' as categorical


### Map all unique `cov_drug_dose_name` to the computed DEGs, independent of the dose value

Create mapping between names with dose and without dose

In [23]:
cov_drug_dose_unique = adata.obs.cov_drug_dose_name.unique()

In [24]:
remove_dose = lambda s: '_'.join(s.split('_')[:-1])
cov_drug = pd.Series(cov_drug_dose_unique).apply(remove_dose)
dose_no_dose_dict = dict(zip(cov_drug_dose_unique, cov_drug))

### Compute new dicts for DEGs

In [25]:
uns_keys = ['all_DEGs', 'lincs_DEGs']

In [26]:
for uns_key in uns_keys:
    new_DEGs_dict = {}

    df_DEGs = pd.Series(adata.uns[uns_key])

    for key, value in dose_no_dose_dict.items():
        if 'control' in key:
            continue
        new_DEGs_dict[key] = df_DEGs.loc[value]
    adata.uns[uns_key] = new_DEGs_dict

In [27]:
adata

AnnData object with n_obs × n_vars = 581777 × 5893
    obs: 'cell_type', 'dose', 'dose_character', 'dose_pattern', 'g1s_score', 'g2m_score', 'pathway', 'pathway_level_1', 'pathway_level_2', 'product_dose', 'product_name', 'proliferation_index', 'replicate', 'size_factor', 'target', 'vehicle', 'batch', 'n_counts', 'dose_val', 'condition', 'drug_dose_name', 'cov_drug_dose_name', 'cov_drug', 'control'
    var: 'id', 'num_cells_expressed-0-0', 'num_cells_expressed-1-0', 'num_cells_expressed-1', 'gene_id', 'in_lincs', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg', 'all_DEGs', 'lincs_DEGs'

## Split

**We omit these split as we design our own splits - for referece this is commented out for the moment**

This is not the right configuration fot the experiments we want but for the moment this is okay

In [28]:
# adata.obs['split'] = 'train'  # reset
# ho_drugs = [
#     # selection of drugs from various pathways
#     "Azacitidine",
#     "Carmofur",
#     "Pracinostat",
#     "Cediranib",
#     "Luminespib",
#     "Crizotinib",
#     "SNS-314",
#     "Obatoclax",
#     "Momelotinib",
#     "AG-14361",
#     "Entacapone",
#     "Fulvestrant",
#     "Mesna",
#     "Zileuton",
#     "Enzastaurin",
#     "IOX2",
#     "Alvespimycin",
#     "XAV-939",
#     "Fasudil"
# ]
# ood = adata.obs['condition'].isin(ho_drugs)
# len(ho_drugs)

In [29]:
# adata.obs.loc[ood & (adata.obs['dose_val'] == 1.0), 'split'] = 'ood'
# test_idx = sc.pp.subsample(adata[adata.obs['split'] != 'ood'], .10, copy=True).obs.index
# adata.obs.loc[test_idx, 'split'] = 'test'

In [30]:
# pd.crosstab(adata.obs['split'], adata.obs['condition'])

In [31]:
# adata.obs['split'].value_counts()

In [32]:
# adata[adata.obs.split == 'ood'].obs.condition.value_counts()

In [33]:
# adata[adata.obs.split == 'test'].obs.condition.value_counts()

Also a split which sees all data:

In [34]:
# adata.obs['split_all'] = 'train'
# test_idx = sc.pp.subsample(adata, .10, copy=True).obs.index
# adata.obs.loc[test_idx, 'split_all'] = 'test'

In [35]:
# adata.obs['ct_dose'] = adata.obs.cell_type.astype('str') + '_' + adata.obs.dose_val.astype('str')

Round robin splits: dose and cell line combinations will be held out in turn.

In [36]:
# i = 0
# split_dict = {}

In [37]:
# # single ct holdout
# for ct in adata.obs.cell_type.unique():
#     for dose in adata.obs.dose_val.unique():
#         i += 1
#         split_name = f'split{i}'
#         split_dict[split_name] = f'{ct}_{dose}'
        
#         adata.obs[split_name] = 'train'
#         adata.obs.loc[adata.obs.ct_dose == f'{ct}_{dose}', split_name] = 'ood'
        
#         test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index
#         adata.obs.loc[test_idx, split_name] = 'test'
        
#         display(adata.obs[split_name].value_counts())

In [38]:
# # double ct holdout
# for cts in [('A549', 'MCF7'), ('A549', 'K562'), ('MCF7', 'K562')]:
#     for dose in adata.obs.dose_val.unique():
#         i += 1
#         split_name = f'split{i}'
#         split_dict[split_name] = f'{cts[0]}+{cts[1]}_{dose}'
        
#         adata.obs[split_name] = 'train'
#         adata.obs.loc[adata.obs.ct_dose == f'{cts[0]}_{dose}', split_name] = 'ood'
#         adata.obs.loc[adata.obs.ct_dose == f'{cts[1]}_{dose}', split_name] = 'ood'
        
#         test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index
#         adata.obs.loc[test_idx, split_name] = 'test'
        
#         display(adata.obs[split_name].value_counts())

In [39]:
# # triple ct holdout
# for dose in adata.obs.dose_val.unique():
#     i += 1
#     split_name = f'split{i}'

#     split_dict[split_name] = f'all_{dose}'
#     adata.obs[split_name] = 'train'
#     adata.obs.loc[adata.obs.dose_val == dose, split_name] = 'ood'

#     test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index
#     adata.obs.loc[test_idx, split_name] = 'test'

#     display(adata.obs[split_name].value_counts())

In [40]:
# adata.uns['splits'] = split_dict

## Save adata

In [41]:
fname = PROJECT_DIR/'datasets'/'sciplex3_matched_genes_lincs.h5ad'

sc.write(fname, adata)

... storing 'cell_type' as categorical
... storing 'pathway' as categorical
... storing 'product_name' as categorical
... storing 'target' as categorical
... storing 'condition' as categorical
... storing 'drug_dose_name' as categorical
... storing 'cov_drug_dose_name' as categorical
... storing 'cov_drug' as categorical


Check that it worked

In [42]:
sc.read(fname)

AnnData object with n_obs × n_vars = 581777 × 5893
    obs: 'cell_type', 'dose', 'dose_character', 'dose_pattern', 'g1s_score', 'g2m_score', 'pathway', 'pathway_level_1', 'pathway_level_2', 'product_dose', 'product_name', 'proliferation_index', 'replicate', 'size_factor', 'target', 'vehicle', 'batch', 'n_counts', 'dose_val', 'condition', 'drug_dose_name', 'cov_drug_dose_name', 'cov_drug', 'control'
    var: 'id', 'num_cells_expressed-0-0', 'num_cells_expressed-1-0', 'num_cells_expressed-1', 'gene_id', 'in_lincs', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'all_DEGs', 'hvg', 'lincs_DEGs'

## Subselect to shared only shared genes

Subset to shared genes

In [45]:
adata_lincs = adata_lincs[:, adata_lincs.var.in_sciplex].copy() 

In [44]:
adata = adata[:, adata.var.in_lincs].copy()

Reindex the lincs dataset

In [46]:
lincs_ids = pd.Index(adata_lincs.var.gene_id)
new_idx = [lincs_ids.get_loc(gene_id) for gene_id in adata.var.gene_id] 

In [47]:
adata_lincs = adata_lincs[:, new_idx].copy()

## Save adata objects with shared genes only
Index of lincs has also been reordered accordingly

In [48]:
fname = PROJECT_DIR/'datasets'/'sciplex3_lincs_genes.h5ad'

sc.write(fname, adata)

In [49]:
fname_lincs = PROJECT_DIR/'datasets'/'lincs_full_smiles_sciplex_genes.h5ad'

sc.write(fname_lincs, adata_lincs)