Benchmark STEMNET runtime
--

In this notebok, we benchmark the runtime of STEMNET's `runSTEMNET` function. We supply the terminal
clusters by from CellRank.

# Preliminaries

## Dependency notebooks

1. [MK_2020-10-16_gpcca.ipynb](MK_2020-10-16_gpcca.ipynb) - to extract the terminal states

## Import packages

In [1]:
# import standard packages
from pathlib import Path
from collections import defaultdict
import sys

import pandas as pd
import pickle

import rpy2.robjects as ro
from rpy2.robjects import r

# import single-cell packages
import scanpy as sc
import anndata2ri
from anndata import AnnData

# import utilities
import utils.utilities as ul

anndata2ri.activate()

## Print package versions for reproduciblity

In [2]:
sc.logging.print_header()
_ = r("""
library(R.utils)
library(STEMNET)
print(packageVersion("STEMNET"))
""")

R[write to console]: Loading required package: R.oo

R[write to console]: Loading required package: R.methodsS3

R[write to console]: R.methodsS3 v1.8.0 (2020-02-14 07:10:20 UTC) successfully loaded. See ?R.methodsS3 for help.

R[write to console]: R.oo v1.23.0 successfully loaded. See ?R.oo for help.

R[write to console]: 
Attaching package: ‘R.oo’


R[write to console]: The following object is masked from ‘package:R.methodsS3’:

    throw


R[write to console]: The following objects are masked from ‘package:methods’:

    getClasses, getMethods


R[write to console]: The following objects are masked from ‘package:base’:

    attach, detach, load, save


R[write to console]: R.utils v2.10.1 successfully loaded. See ?R.utils for help.

R[write to console]: 
Attaching package: ‘R.utils’


R[write to console]: The following object is masked from ‘package:utils’:

    timestamp


R[write to console]: The following objects are masked from ‘package:base’:

    cat, commandArgs, getOption, i

scanpy==1.6.0 anndata==0.7.4 umap==0.4.6 numpy==1.19.2 scipy==1.5.2 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.8.2 louvain==0.7.0 leidenalg==0.8.2
[1] ‘0.1’


## Set up paths

In [3]:
sys.path.insert(0, "../../..")  # this depends on the notebook depth and must be adapted per notebook

from paths import DATA_DIR

## Load the data

Load the preprocessed data (i.e. filtered, normalized)

In [4]:
adata = sc.read(DATA_DIR / "morris_data" / "adata_preprocessed.h5ad")

Remove all unnecessary annotations so that anndata2ri is faster during conversion.

In [5]:
obs_names, var_names = adata.obs_names, adata.var_names
adata = AnnData(adata.X)
adata.obs_names = obs_names
adata.var_names = var_names
adata

AnnData object with n_obs × n_vars = 104679 × 1500

## Load the subsets and splits

In [6]:
dfs = ul.get_split(DATA_DIR / "morris_data" / "splits")
list(dfs.keys())

[10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000]

## Define utility functions

In [7]:
def benchmark_stemnet(dfs, path):
    res = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))
    
    with open(path, 'rb') as fin:
        data = pickle.load(fin)
    
    for size, split in dfs.items():
        for col in split.columns:
            try:
                print(f"Subsetting data to `{size}`, split `{col}`.")
                ixs = split[col].values
                bdata = adata[ixs].copy()
                
                assert bdata.n_obs == size
                
                # old name: main_states
                cluster_annot = data[size][col]["terminal_states"]
                clusters = cluster_annot.cat.categories
                cluster_pop = pd.DataFrame(dict(zip(clusters, [cluster_annot.isin([c]) for c in clusters])))
                                
                ro.globalenv["adata"] = bdata
                ro.globalenv["cluster_pop"] = cluster_pop
                
                print("Running STEMNET")
                
                stem_time = r("""
                    pop <- booleanTable2Character(cluster_pop, other_value=NA)
                    expression <- t(as.matrix(adata@assays@data[['X']]))  # cells x gene
                    
                    runtime  <- withTimeout({{
                        start_time <- Sys.time()
                        result <- runSTEMNET(expression, pop)
                        end_time <- difftime(Sys.time(), start_time, units="secs")
                        end_time
                        }},
                        timeout=60 * 60 * 3,  # 3 hours threshold
                        onTimeout="silent"
                    )
                    runtime
                """)
                
                res[size][col]['stem_time'] = float(stem_time)
                ul.save_results(res, DATA_DIR / "benchmarking" / "runtime_analysis" / "stemnet.pickle")
                
            except Exception as e:
                print(f"Unable to run `STEMNET` with size `{size}` on split `{col}`. Reason: `{e}`.")
                continue
    
    return res

# Run the benchmarks

In [None]:
# we provide the terminal state clusters from GPCCA
res_stemnet = benchmark_stemnet(dfs, DATA_DIR / "benchmarking" / "runtime_analysis" / "gpcca.pickle")

## Save the results

In [10]:
ul.save_results(res_stemnet, DATA_DIR / "benchmarking" / "runtime_analysis" / "stemnet.pickle")