# Iterative single-cell multi-omic integration using online learning

In [1]:
# Please load following modules
import pyliger
from anndata import read_h5ad

# Scenario 1: sampling minibatches from fully observed datasets

We first create a Liger object by passing the filenames of HDF5 files containing the raw count data. The data can be downloaded here. Liger assumes by default that the HDF5 files are formatted by the 10X CellRanger pipeline. Large datasets are often generated over multiple 10X runs (for example, multiple biological replicates). In such cases it may be necessary to merge the HDF5 files from each run into a single HDF5 file. We provide the mergeH5 function for this purpose (see below for details).

In [11]:
ctrl_dge = read_h5ad('./pyliger/datasets/PBMC_control.h5ad', backed='r+')
stim_dge = read_h5ad('./pyliger/datasets/PBMC_interferon-stimulated.h5ad', backed='r+')

Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.


In [5]:
adata_list = [ctrl_dge, stim_dge]
ifnb_liger = pyliger.create_liger(adata_list)

<HDF5 sparse dataset: format 'csc', shape (3000, 35635), type '<f4'>

We then perform the normalization, gene selection, and gene scaling in an online fashion, reading the data from disk in small batches.

In [None]:
pyliger.normalize(ifnb_liger)
pyliger.select_genes(ifnb_liger)
pyliger.scale_not_center(ifnb_liger)

# Online Integrative Nonnegative Matrix Factorization

Now we can use online iNMF to factorize the data, again using only minibatches that we read from the HDF5 files on demand (default mini-batch size = 5000). Sufficient number of iterations is crucial for obtaining ideal factorization result. If the size of the mini-batch is set to be close to the size of the whole dataset (i.e. an epoch only contains one iteration), max.epochs needs to be increased accordingly for more iterations.

In [None]:
pyliger.quantile_norm(pbmcs)