Benchmark STEMNET runtime
---

In this notebok, we benchmark the runtime of STEMNET's `runSTEMNET` function. We supply the terminal
clusters by first running CellRank's GPCCA estimator.

# Preliminaries

## Dependency notebooks

1. [MK_2020-10-16_gpcca.ipynb](MK_2020-10-16_gpcca.ipynb)

## Import packages

In [6]:
# import standard packages
from pathlib import Path
from collections import defaultdict
import sys

import pandas as pd
import pickle

import rpy2.robjects as ro
from rpy2.robjects import r

# import single-cell packages
import scanpy as sc
import anndata2ri
from anndata import AnnData

# import utilities
import utils.utilities as ul

anndata2ri.activate()

## Print package versions for reproduciblity

In [8]:
sc.logging.print_header()
_ = r("""
library(R.utils)
library(STEMNET)
print(packageVersion("STEMNET"))
""")

scanpy==1.6.0 anndata==0.7.4 umap==0.4.6 numpy==1.19.2 scipy==1.5.2 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.8.2 louvain==0.7.0 leidenalg==0.8.2
[1] ‘0.1’


## Set up paths

In [9]:
sys.path.insert(0, "../../..")  # this depends on the notebook depth and must be adapted per notebook

from paths import DATA_DIR

## Load the data

Load the preprocessed data (i.e. filtered, normalized)

In [3]:
adata = sc.read(DATA_DIR / "morris_data" / "adata_preprocessed.h5ad")

AnnData object with n_obs × n_vars = 104679 × 1500

Remove all unnecessary annotations so that anndata2ri is faster during conversion.

obs_names, var_names = adata.obs_names, adata.var_names
adata = AnnData(adata.X)
adata.obs_names = obs_names
adata.var_names = var_names
adata

## Load the subsets and splits

In [None]:
dfs = ul.get_split(DATA_DIR / "morris_data" / "splits")
list(dfs.keys())

## Define utility functions

In [14]:
def benchmark_stemnet(dfs, path):
    res = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))
    
    with open(path, 'rb') as fin:
        data = pickle.load(fin)
    
    for size, split in dfs.items():
        for col in split.columns:
            try:
                print(f"Subsetting data to `{size}`, split `{col}`.")
                ixs = split[col].values
                bdata = adata[ixs].copy()
                
                assert bdata.n_obs == size
                
                cluster_annot = data[size][col]["main_states"]
                clusters = cluster_annot.cat.categories
                cluster_pop = pd.DataFrame(dict(zip(clusters, [cluster_annot.isin([c]) for c in clusters])))
                                
                ro.globalenv["adata"] = bdata
                ro.globalenv["cluster_pop"] = cluster_pop
                
                print("Running STEMNET")
                
                stem_time = r("""
                    pop <- booleanTable2Character(cluster_pop, other_value=NA)
                    expression <- t(as.matrix(adata@assays@data[['X']]))  # cells x gene
                    
                    runtime  <- withTimeout({{
                        start_time <- Sys.time()
                        result <- runSTEMNET(expression, pop)
                        end_time <- difftime(Sys.time(), start_time, units="secs")
                        end_time
                        }},
                        timeout=60 * 60 * 3,  # 3 hours threshold
                        onTimeout="silent"
                    )
                    runtime
                """)
                
                res[size][col]['stem_time'] = float(stem_time)
                ul.save_results(res, DATA_DIR / "benchmarking" / "runtime_analysis" / "stemnet.pickle")
                
            except Exception as e:
                print(f"Unable to run `STEMNET` with size `{size}` on split `{col}`. Reason: `{e}`.")
                continue
    
    return res

# Run the benchmarks

In [13]:
# we provide the terminal state clusters from GPCCA
res_stemnet = benchmark_stemnet(dfs, DATA_DIR / "benchmarking" / "runtime_analysis" / "gpcca.pickle")

Subsetting data to `10000`, split `0`.
Running STEMNET


R[write to console]: At an optimal value of lambda, the misclassification rate for mature populations is 0%.



Subsetting data to `10000`, split `1`.
Unable to run `StemNET` with size `10000` on split `1`. Reason: `'1'`.
Subsetting data to `10000`, split `2`.
Unable to run `StemNET` with size `10000` on split `2`. Reason: `'2'`.
Subsetting data to `10000`, split `3`.
Unable to run `StemNET` with size `10000` on split `3`. Reason: `'3'`.
Subsetting data to `10000`, split `4`.
Unable to run `StemNET` with size `10000` on split `4`. Reason: `'4'`.
Subsetting data to `10000`, split `5`.
Unable to run `StemNET` with size `10000` on split `5`. Reason: `'5'`.
Subsetting data to `10000`, split `6`.
Unable to run `StemNET` with size `10000` on split `6`. Reason: `'6'`.
Subsetting data to `10000`, split `7`.
Unable to run `StemNET` with size `10000` on split `7`. Reason: `'7'`.
Subsetting data to `10000`, split `8`.
Unable to run `StemNET` with size `10000` on split `8`. Reason: `'8'`.
Subsetting data to `10000`, split `9`.
Unable to run `StemNET` with size `10000` on split `9`. Reason: `'9'`.
Subsetting

Unable to run `StemNET` with size `80000` on split `4`. Reason: `80000`.
Subsetting data to `80000`, split `5`.
Unable to run `StemNET` with size `80000` on split `5`. Reason: `80000`.
Subsetting data to `80000`, split `6`.
Unable to run `StemNET` with size `80000` on split `6`. Reason: `80000`.
Subsetting data to `80000`, split `7`.
Unable to run `StemNET` with size `80000` on split `7`. Reason: `80000`.
Subsetting data to `80000`, split `8`.
Unable to run `StemNET` with size `80000` on split `8`. Reason: `80000`.
Subsetting data to `80000`, split `9`.
Unable to run `StemNET` with size `80000` on split `9`. Reason: `80000`.
Subsetting data to `90000`, split `0`.
Unable to run `StemNET` with size `90000` on split `0`. Reason: `90000`.
Subsetting data to `90000`, split `1`.
Unable to run `StemNET` with size `90000` on split `1`. Reason: `90000`.
Subsetting data to `90000`, split `2`.
Unable to run `StemNET` with size `90000` on split `2`. Reason: `90000`.
Subsetting data to `90000`, spl

## Save the results

In [10]:
ul.save_results(res_stemnet, DATA_DIR / "benchmarking" / "runtime_analysis" / "stemnet.pickle")