In [1]:
import os
import scanpy as sc

How do we select the genes for which we calculate z-scores?

Possible options
- z-scores for complete gene space (8000 genes)
    - for all perturbations (~200)
    - for each cell line: difference between z-scores (or log-fold changes) for prediction and each cell line
    - harder tasks: TGFB1 & INS - should we select cell lines here or use the randomly selected combinations?
- z-scores for all combinations in Figure 3
    - includes conserved perturbation programs (IFNG & IFNB) and less conserved programs (TGFB1 & INS)
    - pre-selected/no further analysis necessary and somewhat motivated, but could be criticised?
- perturbation programs for each pathway returned by MultiCCA
    - MultiMCCA can return multiple perturbation programs for each pathway
    - but: identifies conserved programs accross cell lines, so not really what we want to analyse
    - could be an interesting point: for INS pathway, MultiMCCA failed to return clear perturbation programs due to extensive heterogeneity and minimal conservation in cell-type specific response
        - for this, cell line specific perturbation responses were learned! --> see Supplementarty Table 4
- alternatively, we could try to distinguish pathway responses --> z-scores between all pathways and predictions
    - Figure 4 contains other examples from the paper we could use here, especially 4 d+e


--> first step: compare results for whole dataset BXPC3/IFNG to previous results
--> then continue & discuss 

In [3]:
filtered_dataset = "/lustre/groups/ml01/workspace/ot_perturbation/data/satija/datasets/ood_cell_type/satija_merged/merged_05.h5ad"
adata = sc.read_h5ad(filtered_dataset)

  utils.warn_names_duplicates("obs")


In [11]:
conditions_for_evaluation = 0

for ct in adata.obs['cell_type'].unique():
    selected = adata[adata.obs['cell_type'] == ct]
    for pt in selected.obs['pathway'].unique():
        conditions = selected.obs[selected.obs['pathway'] == pt]['gene'].unique().shape[0]
        conditions_for_evaluation += conditions
    print(f"{ct} - mean nr of knockouts: {conditions_for_evaluation/5}")
    conditions_for_evaluation = 0

A549 - mean nr of knockouts: 22.0
BXPC3 - mean nr of knockouts: 49.6
HAP1 - mean nr of knockouts: 28.2
HT29 - mean nr of knockouts: 29.0
K562 - mean nr of knockouts: 31.0
MCF7 - mean nr of knockouts: 40.2
