# Joint definition of cell types from multiple scRNA-seq datasets (python version)

This notebook demonstrates the usage of the pyliger package.

In [1]:
# Please load following modules
import pyliger
from anndata import read_h5ad

  from pandas.core.index import RangeIndex


# Stage I: Preprocessing and Normalization (3 - 5 seconds)

0. (unfinished) Example of reading in 10X CellRanger output using read10X function. 

In [None]:
sample_dirs = ['10x_ctrl_outs', '10x_stim_outs']
sample_names = ['ctrl', 'stim']

adata_list = pyliger.read_10x(sample_dirs, sample_names, merge = False)

1. For the first portion of this protocol, we will be integrating data from control and interferon-stimulated PBMCs from Kang et al, 2017. The data can be found in the Gene Expression Omnibus, Series GSE96583. This dataset was originally in the form of output from the 10X Cellranger pipeline, though we will directly load downsampled versions of the control and stimulated DGEs here.

    For convenience, we have prepared the pre-processed data which are ready to use. There are three datasets: “PBMC_control.h5ad.gz” and “PBMC_interferon-stimulated.h5ad.gz”, which correspond to control and interferon-stimulated PBMCs individually. 

In [2]:
ctrl_dge = read_h5ad('./pyliger/datasets/PBMC_control.h5ad')
stim_dge = read_h5ad('./pyliger/datasets/PBMC_interferon-stimulated.h5ad')

2. With the digital gene expression matrices for both datasets, we can initialize a pyliger object using createLiger function.

In [3]:
adata_list = [ctrl_dge, stim_dge]
ifnb_liger = pyliger.create_liger(adata_list)

Removing 20756 genes not expressing in ctrl.
Removing 21057 genes not expressing in stim.


3. Before we can run iNMF on our datasets, we must run several preprocessing steps to normalize expression data to account for differences in sequencing depth and efficiency between cells, identify variably expressed genes, and scale the data so that each gene has the same variance. Note that because nonnegative matrix factorization requires positive values, we do not center the data by subtracting the mean. We also do not log transform the data.

In [4]:
%matplotlib notebook
ifnb_liger = pyliger.normalize(ifnb_liger)
ifnb_liger = pyliger.select_genes(ifnb_liger)
ifnb_liger = pyliger.scale_not_center(ifnb_liger)

4. We are now able to run integrative non-negative matrix factorization on the normalized and scaled datasets. The key parameter for this analysis is k, the number of matrix factors (analogous to the number of principal components in PCA). In general, we find that a value of k between 20 and 40 is suitable for most analyses and that results are robust for choice of k. Because LIGER is an unsupervised, exploratory approach, there is no single “right” value for k, and in practice, users choose k from a combination of biological prior knowledge and other information.

# Stage II: Joint Matrix Factorization (3 - 10 minutes)

In [None]:
#ifnb_liger = pyliger.optimizeALS(ifnb_liger, k = 20)

In [5]:
ifnb_liger = pyliger.iNMF_HALS(ifnb_liger, k = 20)

Initial Training Obj: 16203974.323134024
Iter: 1, Total time: 0.07978701591491699, Obj Delta: 0.1971392539179789
Iter: 2, Total time: 0.16704010963439941, Obj Delta: 0.04950030135821926
Iter: 3, Total time: 0.23955202102661133, Obj Delta: 0.01543513038237353
Iter: 4, Total time: 0.30680203437805176, Obj Delta: 0.007598938077226443
Iter: 5, Total time: 0.37578773498535156, Obj Delta: 0.0037406208245092418
Iter: 6, Total time: 0.45340776443481445, Obj Delta: 0.0022305599984215312
Iter: 7, Total time: 0.5306367874145508, Obj Delta: 0.0014441738163870508
Iter: 8, Total time: 0.6054568290710449, Obj Delta: 0.0010015129834642458
Iter: 9, Total time: 0.682945728302002, Obj Delta: 0.0007372005933361657
Iter: 10, Total time: 0.7512757778167725, Obj Delta: 0.0005766610587011725
Iter: 11, Total time: 0.8203978538513184, Obj Delta: 0.0004935486795893421
Iter: 12, Total time: 0.8899526596069336, Obj Delta: 0.00046948536536427053
Iter: 13, Total time: 0.964165449142456, Obj Delta: 0.0004885263625119

Important parameters are as follows:

    -k. Integer value specifying the inner dimension of factorization, or number of factors. Higher k is recommended for datasets with more substructure. We find that a value of k in the range 20 - 40 works well for most datasets. Because this is an unsupervised, exploratory analysis, there is no single “right” value for k, and in practice, users choose k from a combination of biological prior knowledge and other information.
    -lambda. This is a regularization parameter. Larger values penalize dataset-specific effects more strongly, causing the datasets to be better aligned, but possibly at the cost of higher reconstruction error. The default value is 5. We recommend using this value for most analyses, but find that it can be lowered to 1 in cases where the dataset differences are expected to be relatively small, such as scRNA-seq data from the same tissue but different individuals.
    -thresh. This sets the convergence threshold. Lower values cause the algorithm to run longer. The default is 1e-6.
    -max.iters. This variable sets the maximum number of iterations to perform. The default value is 30.
The optimization yields several lower dimension matrices, including the H matrix of metagene loadings for each cell, the W matrix of shared factor loadings and the V matrices of dataset-specific factor loadings.

Please note that the time required of this step is highly dependent on the size of the datasets being used. In most cases, this step should not take much longer than 30 minutes.

5. We can now use the resulting factors to jointly cluster cells and perform quantile normalization by dataset, factor, and cluster to fully integrate the datasets. All of this functionality is encapsulated within the quantile_norm function, which uses max factor assignment followed by refinement using a k-nearest neighbors graph.

# Stage III: Quantile Normalization and Joint Clustering (1 minute)

In [6]:
ifnb_liger = pyliger.quantile_norm(ifnb_liger)

6.

In [None]:
ifnb_liger = pyliger.louvain_cluster(ifnb_liger, resolution = 0.25)

7. 

# Stage IV: Visualization (2 - 3 minutes) and Downstream Analysis (25 - 40 seconds)

In [None]:
ifnb_liger = pyliger.runUMAP(ifnb_liger, distance = 'cosine', n_neighbors = 30, min_dist = 0.3)

8.