# Joint definition of cell types from multiple scRNA-seq datasets (python version)

This notebook demonstrates the usage of the pyliger package.

In [1]:
# Please load following modules
from pyliger import *
from anndata import read_h5ad

# Stage I: Preprocessing and Normalization (3 - 5 seconds)

0. (unfinished) Example of reading in 10X CellRanger output using read10X function. 

In [None]:
sample_dirs = ['10x_ctrl_outs', '10x_stim_outs']
sample_names = ['ctrl', 'stim']

adata_list = read10X(sample_dirs, sample_names, merge = False)

1. For the first portion of this protocol, we will be integrating data from control and interferon-stimulated PBMCs from Kang et al, 2017. The data can be found in the Gene Expression Omnibus, Series GSE96583. This dataset was originally in the form of output from the 10X Cellranger pipeline, though we will directly load downsampled versions of the control and stimulated DGEs here.

    For convenience, we have prepared the pre-processed data which are ready to use. There are three datasets: “PBMC_control.h5ad.gz” and “PBMC_interferon-stimulated.h5ad.gz”, which correspond to control and interferon-stimulated PBMCs individually. 

In [2]:
ctrl_dge = read_h5ad('./PBMC_control.h5ad.gz')
stim_dge = read_h5ad('./PBMC_interferon-stimulated.h5ad.gz')

2. With the digital gene expression matrices for both datasets, we can initialize a pyliger object using createLiger function.

In [3]:
adata_list = [ctrl_dge, stim_dge]
ifnb_liger = createLiger(adata_list)

Removing 20756 genes not expressing in ctrl.
Removing 21057 genes not expressing in stim.


3. Before we can run iNMF on our datasets, we must run several preprocessing steps to normalize expression data to account for differences in sequencing depth and efficiency between cells, identify variably expressed genes, and scale the data so that each gene has the same variance. Note that because nonnegative matrix factorization requires positive values, we do not center the data by subtracting the mean. We also do not log transform the data.

In [4]:
ifnb_liger = normalize(ifnb_liger)
ifnb_liger = selectGenes(ifnb_liger)
ifnb_liger = scaleNotCenter(ifnb_liger)

4. 

# Stage II: Joint Matrix Factorization (3 - 10 minutes)

In [None]:
ifnb_liger = optimizeALS(ifnb_liger, k = 20)

5. 

# Stage III: Quantile Normalization and Joint Clustering (1 minute)

In [None]:
ifnb_liger = quantile_norm(ifnb_liger)

6.

In [None]:
ifnb_liger = louvainCluster(ifnb_liger, resolution = 0.25)

7. 

# Stage IV: Visualization (2 - 3 minutes) and Downstream Analysis (25 - 40 seconds)

In [None]:
ifnb_liger = runUMAP(ifnb_liger, distance = 'cosine', n_neighbors = 30, min_dist = 0.3)

8.