Benchmark GPCCA runtime
---

In this notebook, we benchmark CellRank's GPCCA estimator. Furthermore, we save the lineage
drivers, as well as terminal states, which are then passed to STEMNET and Palantir, respectively.

# Preliminaries

## Dependency notebooks

1. [../preprocessing_notebooks/MK_2020-10-16_preprocess_data.ipynb](../preprocessing_notebooks/MK_2020-10-16_preprocess_data.ipynb)

## Import packages

In [1]:
# import standard packages
from pathlib import Path
import sys

# import single-cell packages
import cellrank as cr
import scanpy as sc

# import utilities
import utils.utilities as utilities

## Print package versions for reproducibility

In [2]:
cr.logging.print_versions()

cellrank==1.0.0-rc.12 scanpy==1.6.0 anndata==0.7.4 numpy==1.19.2 numba==0.51.2 scipy==1.5.2 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.8.2 scvelo==0.2.2 pygam==0.8.0 matplotlib==3.3.2 seaborn==0.11.0


## Set up paths

In [3]:
sys.path.insert(0, "../../..")  # this depends on the notebook depth and must be adapted per notebook

from paths import DATA_DIR

## Load the data

### Load the AnnData object

In [4]:
adata = sc.read(DATA_DIR / "morris_data" / "adata_preprocessed.h5ad")
adata

AnnData object with n_obs × n_vars = 104679 × 1500
    obs: 'batch', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', 'n_counts', 'velocity_self_transition'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable', 'fit_r2', 'fit_alpha', 'fit_beta', 'fit_gamma', 'fit_t_', 'fit_scaling', 'fit_std_u', 'fit_std_s', 'fit_likelihood', 'fit_u0', 'fit_s0', 'fit_pval_steady', 'fit_steady_u', 'fit_steady_s', 'fit_variance', 'fit_alignment_scaling', 'velocity_genes'
    uns: 'neighbors', 'pca', 'recover_dynamics', 'velocity_graph', 'velocity_graph_neg', 'velocity_params'
    obsm: 'X_pca'
    varm: 'PCs', 'loss'
    layers: 'Ms', 'Mu', 'ambiguous', 'fit_t', 'fit_tau', 'fit_tau_', 'matrix', 'spliced', 'unspliced', 'velocity', 'velocity_u'
    obsp: 'connectivities', 'distances'

### Load the subsets and splits

In [5]:
dfs = utilities.get_split(DATA_DIR / "morris_data" / "splits")
dfs.keys()

dict_keys([10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000])

# Run the benchmarks

In [None]:
utilities.benchmark_gpcca(adata, dfs, path=DATA_DIR / "benchmarking" / "runtime_analysis" / "gpcca.pickle")

Subsetting data to `10000`, split `0`.
Recomputing neighbors
computing neighbors
