Fig. 5: Benchmark STEMNET
---

In this notebook, we extract STEMNET's fate probabilities.

# Preliminaries

## Import packages

In [1]:
# import standard packages
import os
import sys

import pandas as pd

# import single-cell packages
import cellrank as cr
import scanpy as sc
import scvelo as scv
import anndata2ri

anndata2ri.activate()
%load_ext rpy2.ipython

In [2]:
%%R
library(STEMNET)

## Print package versions for reproducibility

In [3]:
cr.logging.print_versions()

cellrank==1.5.0+gc8c2b9f6 scanpy==1.8.1 anndata==0.7.6 numpy==1.20.0 numba==0.54.0 scipy==1.7.1 pandas==1.3.2 pygpcca==1.0.2 scikit-learn==0.24.2 statsmodels==0.12.2 python-igraph==0.9.6 scvelo==0.2.4 pygam==0.8.0 matplotlib==3.4.3 seaborn==0.11.2


In [4]:
%%R
packageVersion("STEMNET")

[1] ‘0.1’


## Set up paths

In [5]:
sys.path.insert(0, "../../../../")  # this depends on the notebook depth and must be adapted per notebook

from paths import DATA_DIR

## Load the data

In [6]:
adata = cr.datasets.pancreas(DATA_DIR / "pancreas" / "pancreas.h5ad")
adata.X = adata.X.A  # crashed anndata2ri; we don't need it
adata

AnnData object with n_obs × n_vars = 2531 × 27998
    obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'palantir_pseudotime'
    var: 'highly_variable_genes'
    uns: 'clusters_colors', 'clusters_fine_colors', 'day_colors', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'
    obsp: 'connectivities', 'distances'

### Preprocess the data

In [7]:
scv.pp.filter_and_normalize(adata, min_shared_counts=10, n_top_genes=3000)

Filtered out 20788 genes that are detected 10 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 3000 highly variable genes.
Logarithmized X.


### Extract cluster information for STEMNET

In [8]:
clusters = ['Alpha', 'Beta', 'Epsilon', 'Delta']
cluster_pop = pd.DataFrame(dict(zip(clusters, [adata.obs['clusters_fine'].isin([c]) for c in clusters])))

# Analysis

## Convert the data

In [9]:
del adata.uns["neighbors"]
del adata.layers['spliced']
del adata.layers['unspliced']

In [10]:
%%R -i cluster_pop -i adata
pop <- booleanTable2Character(cluster_pop, other_value=NA)
expression <- t(as.matrix(adata@assays@data[['X']]))  # cells x gene

## Run STEMNET

In [11]:
%%R
result <- runSTEMNET(expression, pop)

R[write to console]: At an optimal value of lambda, the misclassification rate for mature populations is 2.32%.



## Print the results

In [12]:
%%R
print(result)

Object of class stemnet with 1755 stem cells and 776 mature cells assigned to one of 4 target populations of the following sizes:

  Alpha    Beta   Delta Epsilon 
    259     308      70     139 
At an optimal value of lambda, the misclassification rate for mature populations is  2.32 %.
Posterior probability matrix (truncated):
          Alpha       Beta      Delta    Epsilon
[1,] 0.30317917 0.32638692 0.17695631 0.19347760
[2,] 0.44779088 0.22694945 0.08699444 0.23826523
[3,] 0.17926642 0.37152850 0.06225988 0.38694519
[4,] 0.89597540 0.02697837 0.01959861 0.05744762
[5,] 0.04244407 0.13833241 0.78069503 0.03852849
[6,] 0.10695675 0.79952606 0.04265684 0.05086034


## Extract the probabilities

In [13]:
%%R -o probs
probs <- result@posteriors

Create a CellRank Lineage object.

In [14]:
slin = cr.tl.Lineage(probs, names=clusters)
adata.obsm['stemnet_terminal_states'] = slin

## Save the results

In [15]:
sc.write(DATA_DIR / "benchmarking" / "stemnet" / "adata.h5ad", adata)