In [1]:
%load_ext autoreload
%autoreload 2

import anndata
import matplotlib.pyplot as plt
import seaborn as sns
import logging
import numpy as np
import pandas as pd
import scipy.stats
import scanpy as sc

In [2]:
import batchglm.api as glm
import diffxpy.api as de

print("batchglm version "+glm.__version__)
print("diffpy version "+de.__version__)

batchglm version v0.7.1
diffpy version v0.7.1


# Introduction

This notebook is based on the concepts presented in the notebook `introduction_differential_testing`.

Here, we present multiple scenarios in which differential expression analysis results in multiple tests for each gene. We stratify these scenarios with example use cases and introduce the diffxpy API for each of these problems. The scenarios are:

- pairwise tests between groups
- grouwise tests versus all other groups
- differential tests of the same form in each member of a set of groups: deploying tests to data set partitions

All of these problems are based on data sets which are grouped into discrete sets of cells, such as clusters (cell types) or conditions.

The returned differential expression test objects harbor multiple tests for each gene and therefor have special methods and attributes.

In [6]:
from batchglm.api.models.tf1.glm_nb import Simulator

sim = Simulator(num_observations=2000, num_features=100)
sim.generate_sample_description(num_batches=4, num_conditions=0)
sim.generate_params()
sim.generate_data()
adata = sc.AnnData(X=np.asarray(sim.x), obs=sim.sample_description)

Transforming to str index.


# Pairwise tests between groups

diffxpy makes this type of analysis easy through the `de.test.pairwise` function. The pairwise test allows to perform pairwise comparisons between a set of groups. These tests answer the questions, whether a given pair of groups shows differential expression for each gene. This results in $\frac{n*(n-1)}{2}$ possible different tests.

The central argument is `grouping` which assigns defines how the data set is partioned into groups which can be compared in a pairwise manner. `grouping` refers to a column in the sample description and can for example be the column that contains the cell type labels. Alternatively, `grouping` can be a vector of length `num_observations` and directly specify the groups assignments.

The parameter `test` specifies which kind of statistical test will be performed for each pair of groups.
Possible arguments are all `two_sample` tests (e.g. 'wald', 't-test', 'rank_sum', 't_test') and 'z-test'.
The 'z-test' is a special kind of test which treats each group as a coefficient in a single linear model and therefore requires fitting only one GLM for all tests. This significantly reduces the runtime compared to the 'wald'-test. Secondly, all of these tests can optionally be performed in a `lazy` fashion. Here, lazy means that the tests are only evaluated once the user requests p-values or coefficients for a specific pair of models. This greatly reduces the memory footprint and run time of the test if many groups are handled. Note that the lazy option slow for wald tests as this would require a new model fit upon each request of test results.

## Running the test

In [9]:
test = de.test.pairwise(
    data=adata,
    grouping="batch",
    test="z-test",
    lazy=False,
    noise_model="nb"
)

AssertionError: 

## Accessing results

The results can be accessed in a pairwise fashion. Moreover, `.summary_pairs()` can summarize the results from multiple tests by showing only the test with the lowest p-value for each gene.

In [None]:
np.set_printoptions(precision=3)
print("shape of p-values: %s" % str(test.pval.shape))

In [None]:
test.pval[:,:,0]

p-values of first gene:

In [None]:
test.summary().iloc[:10,:]

In [None]:
test.plot_volcano()

test.summary() returns a pandas DataFrame with a quick overview of the test results:

## Results specific for one test

- `gene`: gene name / identifier
- `pval`: minimal p-value of the tests
- `qval`: minimal multiple testing - corrected p-value of the tests
- `log2fc`: maximal $log_2$ fold change of the tests

`test.plot_volcano()` creates a volcano plot of p-values vs. fold-change:

One may be specifically interested in a the comparison of a specific pair of groups. Multiple methods presented above are adapated for this scenario and are simply called test.*_groups

The group identifiers are:

In [None]:
print(np.unique(sim.sample_description['batch'].values))

The results for the comparison of groups '2' and '3' are:

In [None]:
test.plot_diagnostics()

In [None]:
test.pval_pair(group1='2', group2='3')[:10]

In [None]:
test.summary_pair(group1='2', group2='3').iloc[:10,:]

# Grouwise tests versus all other groups

The pairwise test allows to perform comparisons between each group of samples to the remaining samples.

It needs a parameter `grouping` which assigns a group to each sample.
This `grouping` can either be a vector of length `num_observations` or a string specifying a column in the sample description.
Since we simulated `grouping` with `num_batches=4` different groups, the pairwise test will perform 4 different tests.

The parameter `test` specifies which kind of statistical test will be performed for each pair of groups.
Possible arguments are all `two_sample` tests (e.g. 'wald', 't-test', 'wilcoxon', ...).

## Running the test

In [None]:
test = de.test.versus_rest(
    data=adata,
    grouping="condition",
    test="wald",
    noise_model="nb"
)

# Running tests on data sets partitions

The partition API allows the running of multiple tests for each gene (one per partition) and is introduced in the basic tutorial "introduction_different_testing". Please refer to this tutorial for further information.