# Identifying zero-inflated genes with AutoZI

AutoZI is a deep generative model adapted from scVI allowing a gene-specific treatment of zero-inflation. For each gene $g$, AutoZI notably learns the distribution of a random variable $\delta_g$ which denotes the probability that gene $g$ is not zero-inflated. In this notebook, we present the use of the model on a PBMC dataset.

More details about AutoZI can be found in : https://www.biorxiv.org/content/10.1101/794875v2

In [1]:
# The next cell is some code we use to keep the notebooks tested.
# Feel free to ignore!

In [2]:
def allow_notebook_for_test():
    print("Testing the totalVI notebook")

show_plot = True
test_mode = False
n_epochs_all = None
save_path = "data/"

if not test_mode:
    save_path = "../../data"

## Imports and data loading

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import anndata
import os

import scvi
from scvi.models import AutoZIVAE
from scvi.inference import UnsupervisedTrainer

  data = yaml.load(f.read()) or {}


In [4]:
pbmc = scvi.dataset.pbmc_dataset(save_path=os.path.join(save_path, "10X"), run_setup_anndata=False)
scvi.dataset.highly_variable_genes_seurat_v3(pbmc, n_top_genes=1000, subset=True)
scvi.dataset.setup_anndata(pbmc, labels_key="str_labels", batch_key="batch")

[2020-07-21 22:30:35,217] INFO - scvi.dataset._built_in_data._utils | File ../../data/10X/gene_info_pbmc.csv already downloaded
[2020-07-21 22:30:35,220] INFO - scvi.dataset._built_in_data._utils | File ../../data/10X/pbmc_metadata.pickle already downloaded
[2020-07-21 22:30:35,258] INFO - scvi.dataset._built_in_data._utils | File ../../data/10X/pbmc8k/filtered_gene_bc_matrices.tar.gz already downloaded
../../data/10X/pbmc8k/filtered_gene_bc_matrices/GRCh38
/Users/galen/scVI/tests/notebooks
[2020-07-21 22:30:54,176] INFO - scvi.dataset._built_in_data._utils | File ../../data/10X/pbmc4k/filtered_gene_bc_matrices.tar.gz already downloaded
../../data/10X/pbmc4k/filtered_gene_bc_matrices/GRCh38
/Users/galen/scVI/tests/notebooks
[2020-07-21 22:31:04,585] INFO - scvi.dataset._preprocessing | added
    'highly_variable', boolean vector (adata.var)
    'highly_variable_rank', float vector (adata.var)
    'means', float vector (adata.var)
    'variances', float vector (adata.var)
    'variances

## Analyze gene-specific ZI

In AutoZI, all $\delta_g$'s follow a common $\text{Beta}(\alpha,\beta)$ prior distribution where $\alpha,\beta \in (0,1)$ and the zero-inflation probability in the ZINB component is bounded below by $\tau_{\text{dropout}} \in (0,1)$. AutoZI is encoded by the `AutoZIVAE` class whose inputs, besides the size of the dataset, are $\alpha$ (`alpha_prior`), $\beta$ (`beta_prior`), $\tau_{\text{dropout}}$ (`minimal_dropout`). By default, we set $\alpha = 0.5, \beta = 0.5, \tau_{\text{dropout}} = 0.01$.

Note : we can learn $\alpha,\beta$ in an Empirical Bayes fashion, which is possible by setting `alpha_prior = None` and `beta_prior = None`

In [5]:
n_genes = pbmc.uns["scvi_summary_stats"]["n_genes"]
n_labels = pbmc.uns["scvi_summary_stats"]["n_labels"]
autozivae = AutoZIVAE(n_input=n_genes, alpha_prior=0.5, beta_prior=0.5, minimal_dropout=0.01)
autozitrainer = UnsupervisedTrainer(autozivae, pbmc)

We fit, for each gene $g$, an approximate posterior distribution $q(\delta_g) = \text{Beta}(\alpha^g,\beta^g)$ (with $\alpha^g,\beta^g \in (0,1)$) on which we rely. We retrieve $\alpha^g,\beta^g$ for all genes $g$ (and $\alpha,\beta$, if learned) as numpy arrays using the method `get_alphas_betas` of `AutoZIVAE`.

In [6]:
n_epochs = 200 if n_epochs_all is None else n_epochs_all
autozitrainer.train(n_epochs=n_epochs, lr=1e-2)

[2020-07-21 22:31:04,848] INFO - scvi.inference.inference | KL warmup phase exceeds overall training phaseIf your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup.
[2020-07-21 22:31:04,850] INFO - scvi.inference.inference | KL warmup for 400 epochs
training: 100%|██████████| 200/200 [24:13<00:00,  6.53s/it]
[2020-07-21 22:55:18,825] INFO - scvi.inference.inference | Training is still in warming up phase. If your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup.


In [7]:
outputs = autozivae.get_alphas_betas()
alpha_posterior = outputs['alpha_posterior']
beta_posterior = outputs['beta_posterior']

Now that we obtained fitted $\alpha^g,\beta^g$, different metrics are possible. Bayesian decision theory suggests us the posterior probability of the zero-inflation hypothesis $q(\delta_g < 0.5)$, but also other metrics such as the mean wrt $q$ of $\delta_g$ are possible. We focus on the former. We decide that gene $g$ is ZI if and only if $q(\delta_g < 0.5)$ is greater than a given threshold, say $0.5$. We may note that it is equivalent to $\alpha^g < \beta^g$. From this we can deduce the fraction of predicted ZI genes in the dataset.

In [8]:
from scipy.stats import beta

# Threshold (or Kzinb/Knb+Kzinb in paper)
threshold = 0.5

# q(delta_g < 0.5) probabilities
zi_probs = beta.cdf(0.5,alpha_posterior,beta_posterior)

# ZI genes
is_zi_pred = (zi_probs > threshold)

print('Fraction of predicted ZI genes :', is_zi_pred.mean())

Fraction of predicted ZI genes : 0.978


We noted that predictions were less accurate for genes $g$ whose average expressions - or predicted NB means, equivalently - were low. Indeed, genes assumed not to be ZI were more often predicted as ZI for such low average expressions. A threshold of 1 proved reasonable to separate genes predicted with more or less accuracy. Hence we may want to focus on predictions for genes with average expression above 1.

In [9]:
mask_sufficient_expression = (np.array(pbmc.X.mean(axis=0)) > 1.).reshape(-1)
print('Fraction of genes with avg expression > 1 :', mask_sufficient_expression.mean())
print('Fraction of predicted ZI genes with avg expression > 1 :', is_zi_pred[mask_sufficient_expression].mean())

Fraction of genes with avg expression > 1 : 0.059
Fraction of predicted ZI genes with avg expression > 1 : 0.6271186440677966


## Analyze gene-cell-type-specific ZI

One may argue that zero-inflation should also be treated on the cell-type (or 'label') level, in addition to the gene level. AutoZI can be extended by assuming a random variable $\delta_{gc}$ for each gene $g$ and cell type $c$ which denotes the probability that gene $g$ is not zero-inflated in cell-type $c$. The analysis above can be extended to this new scale.

In [10]:
# Model definition
autozivae_genelabel = AutoZIVAE(n_input=n_genes, alpha_prior=0.5, beta_prior=0.5, minimal_dropout=0.01,
                         dispersion='gene-label', zero_inflation='gene-label', n_labels=n_labels)
autozitrainer_genelabel = UnsupervisedTrainer(autozivae_genelabel, pbmc)

# Training
n_epochs = 200 if n_epochs_all is None else n_epochs_all
autozitrainer_genelabel.train(n_epochs=n_epochs, lr=1e-2)

# Retrieve posterior distribution parameters
outputs_genelabel = autozivae_genelabel.get_alphas_betas()
alpha_posterior_genelabel = outputs_genelabel['alpha_posterior']
beta_posterior_genelabel = outputs_genelabel['beta_posterior']

[2020-07-21 22:55:19,193] INFO - scvi.inference.inference | KL warmup phase exceeds overall training phaseIf your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup.
[2020-07-21 22:55:19,194] INFO - scvi.inference.inference | KL warmup for 400 epochs
training: 100%|██████████| 200/200 [25:21<00:00,  7.17s/it]
[2020-07-21 23:20:40,678] INFO - scvi.inference.inference | Training is still in warming up phase. If your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup.


In [11]:
# q(delta_g < 0.5) probabilities
zi_probs_genelabel = beta.cdf(0.5,alpha_posterior_genelabel,beta_posterior_genelabel)

# ZI gene-cell-types
is_zi_pred_genelabel = (zi_probs_genelabel > threshold)

ct = pbmc.obs.str_labels.astype("category")
for ind_cell_type, cell_type in zip(ct.cat.codes, ct.cat.categories):
    
    is_zi_pred_genelabel_here = is_zi_pred_genelabel[:,ind_cell_type]
    print('Fraction of predicted ZI genes for cell type {} :'.format(cell_type),
          is_zi_pred_genelabel_here.mean(),'\n')

Fraction of predicted ZI genes for cell type B cells : 0.913 

Fraction of predicted ZI genes for cell type CD14+ Monocytes : 0.913 

Fraction of predicted ZI genes for cell type CD4 T cells : 0.733 

Fraction of predicted ZI genes for cell type CD8 T cells : 0.733 

Fraction of predicted ZI genes for cell type Dendritic Cells : 0.783 

Fraction of predicted ZI genes for cell type FCGR3A+ Monocytes : 0.913 

Fraction of predicted ZI genes for cell type Megakaryocytes : 0.733 

Fraction of predicted ZI genes for cell type NK cells : 0.913 

Fraction of predicted ZI genes for cell type Other : 0.913 



In [12]:
# With avg expressions > 1
for ind_cell_type, cell_type in zip(ct.cat.codes, ct.cat.categories):
    mask_sufficient_expression = (np.array(pbmc.X[pbmc.obs.str_labels.values.reshape(-1) == cell_type,:].mean(axis=0)) > 1.).reshape(-1)
    print('Fraction of genes with avg expression > 1 for cell type {} :'.format(cell_type),
          mask_sufficient_expression.mean())
    is_zi_pred_genelabel_here = is_zi_pred_genelabel[mask_sufficient_expression,ind_cell_type]
    print('Fraction of predicted ZI genes with avg expression > 1 for cell type {} :'.format(cell_type),
          is_zi_pred_genelabel_here.mean(), '\n')

Fraction of genes with avg expression > 1 for cell type B cells : 0.032
Fraction of predicted ZI genes with avg expression > 1 for cell type B cells : 0.46875 

Fraction of genes with avg expression > 1 for cell type CD14+ Monocytes : 0.078
Fraction of predicted ZI genes with avg expression > 1 for cell type CD14+ Monocytes : 0.7435897435897436 

Fraction of genes with avg expression > 1 for cell type CD4 T cells : 0.035
Fraction of predicted ZI genes with avg expression > 1 for cell type CD4 T cells : 0.4 

Fraction of genes with avg expression > 1 for cell type CD8 T cells : 0.049
Fraction of predicted ZI genes with avg expression > 1 for cell type CD8 T cells : 0.5102040816326531 

Fraction of genes with avg expression > 1 for cell type Dendritic Cells : 0.177
Fraction of predicted ZI genes with avg expression > 1 for cell type Dendritic Cells : 0.711864406779661 

Fraction of genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.124
Fraction of predicted ZI genes with a