# Benchmarking of integration methods
This notebook provides a short overview on how to use the scIB module and performs a short analysis of tabula muris thymus and bone marrow data.

In [1]:
import scanpy as sc
import scIB
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
import scipy

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
import copy

In [5]:
%matplotlib inline

In [6]:
batch = 'method'
hvg = True

In [7]:
file="/home/st/strobld/Masterinternship_2019/group_malte/tabula_muris/data/processed/merged_adata.h5ad"

## Read the data

In [8]:
adata = sc.read(file)

In [9]:
methods = {}
adatas = {}

## Select Highly variable genes
Set `hvg` to `True`if you want to preselect for less than 4000 HVGs before running the integration methods. This is probably preferable, as many methods work better with a reduced dataset.

In [10]:
hvg = True
if hvg:
    hvgs = scIB.preprocessing.hvg_intersect(adata, batch)
adata = adata[:,hvgs]

## Run the integration method
The functions for the integration methods are in `scIB.integration`. Generally, the methods expect an anndata object and the batch key as an input. The runtime and memory usage of the functions are meaured using `scIB.metrics.measureTM`. This function returns memory usage in MB, runtime in s and the output of the tested function.

In [11]:
methods['scanorama'] = scIB.metrics.measureTM(scIB.integration.runScanorama, adata, batch)

Found 1690 genes among all datasets
[[0.        0.5060898]
 [0.        0.       ]]
Processing datasets (0, 1)
memory usage:1052.0 MB
runtime: 16.0 s


In [12]:
methods['scanorama'][2][0][1].obsm['X_pca'] = methods['scanorama'][2][0][0]
adatas['scanorama'] = methods['scanorama'][2][0][1]
#sc.pp.pca(adatas['scanorama'], svd_solver='arpack')
sc.pp.neighbors(adatas['scanorama'])

## Runtime analysis
Here, we compare the runtimes and the memory usage of all tested methods

In [13]:
mem=methods['scanorama'][0]
time=methods['scanorama'][1]

## Quantifying quality of Integration

### Silhouette score

In [14]:
sil = scIB.metrics.silhouette_score(adatas['scanorama'], batch, 'cell_ontology_class')[0]

mean silhouette over label means: 0.1603946515057824
mean silhouette per cell: 0.19777693774715072


In [15]:
sc.tl.louvain(adatas['scanorama'], key_added='louvain_post')
nmi= scIB.metrics.nmi(adatas['scanorama'], 'cell_ontology_class', 'louvain_post')

In [16]:
nmi

0.7693088352890597

### PCR on HVGs

In [17]:
pcr_hvg = scIB.metrics.pcr_hvg(adata, adatas['scanorama'], 10, batch)

0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9


In [18]:
pcr_hvg

1.7135810845151214