In [None]:
!pip install scanpy==1.4.6 umap-learn==0.4.0 anndata==0.7.1 numpy==1.18.2 scipy==1.4.1 pandas matplotlib scrublet seaborn python-igraph==0.8.0 louvain==0.6.1 !pip install scanpy==1.4.6 umap-learn==0.4.0 anndata==0.7.1 numpy==1.18.2 scipy==1.4.1 pandas matplotlib scrublet seaborn python-igraph==0.8.0 louvain==0.6.1 gprofiler

Install `tensorflow` and the `diffxpy` package.

In [None]:
!pip install tf-nightly
!pip install tfp-nightly

In [None]:
!pip install -U diffxpy

Load all required packages.

In [None]:
import scanpy as sc
import anndata as ann
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib import colors

import os 

#pretty plotting
import seaborn as sb

import logging

In [None]:

plt.rcParams['figure.figsize']=(8,8) #rescale figures
sc.settings.verbosity = 3
#sc.set_figure_params(dpi=200, dpi_save=300)
sc.logging.print_versions()


Of note, this notebook was created as part of a workshop, so we use extra large legend texts in all seaborn plots. You can set the context as well to 'talk' or 'paper'.

In [None]:
sb.set_context(context='poster')


Load differential expression package 'diffxpy'.

In [1]:
import batchglm.api as glm
import diffxpy.api as de

print("batchglm version "+glm.__version__)
print("diffpy version "+de.__version__)
from batchglm.pkg_constants import TF_CONFIG_PROTO

  from pandas.core.index import RangeIndex


batchglm version v0.7.4
diffpy version v0.7.4


In [None]:
#Set number of threads
TF_CONFIG_PROTO.inter_op_parallelism_threads = 1
TF_CONFIG_PROTO.intra_op_parallelism_threads = 12

# Introduction

The dataset consists of 12 samples from 3mm2 blocks that were manually dissected from the substantia nigra and cortex of 5 control (4 males & 1 female) de-identified post-mortem human donors,  including 2 substantia nigra (SN) replicates and sequenced using the 10X chromium system (V2) (GEO accession ID: [GSE140231](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE140231)).

In this notebook, we are going to perform differential expression tests using the tool `diffxpy`. Further reading on `diffxpy` and its usage can be found on the (diffxpy github page)[https://github.com/theislab/diffxpy/] and the (tutorial page)[https://github.com/theislab/diffxpy_tutorials/]. This notebook builds upon these tutorials.  

Differential expression analysis is a group of statistical tests that are used to establish whether there a exists a significant variation across a set of tested conditions for each gene. In its easiset form, this test can test for the difference between two distinct groups: This scenario can be handled with (Welch's) T-test, rank sum tests or Wald and likelihood ratio tests (LRT). Wald tests and LRT allow for more adaptive assumptions on the noise model and can therefore be more statistically correct. Moreover, they also allow the testing of more complex effect, e.g. for the variation across many groups (a single p-value for: Is there any difference between four conditions?) or across continuous covariates (a single covariate for: Is a gene expression trajectory in time non-constant?). Below, we introduce these and similar scenarios. We dedicated separate tutorials to a selection of scenarios that require a longer introduction.

Importantly, we assume that the groups we are comparing do not differ except for a different condition. For example, we are going to test, if oligodendrocytes, oligodendrocyte precursor cells and astrocytes from substantia nigra and cortex are different. We can consider this a borderline case, because these cells come from different brain regions and the assumption of "no difference between the two groups" may not hold. 

Further, we test if we can see differences in the dopaminergic neurons across donors.

## Set project file paths

Let us set up the connection with Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

We set up the file paths to the respective directories.

In [None]:
file_path = '/content/drive/My Drive/' #this is the file path to your google drive (main directory)

In [None]:
os.listdir(file_path) #shows all files in file_path

The data directory contains all processed data and `anndata` files. 

In [None]:
data_dir = file_path + 'day2/data/' 

The tables directory contains all tabular data output, e.g. in `.csv` or `.xls` file format. That applies to differential expression test results or overview tables such as the number of cells per cell type.

In [None]:
table_dir = file_path + 'day2/tables/'

The default figure path is a POSIX path calles 'figures'. If you don't change the default figure directory, scanpy creates a subdirectory where this notebook is located.  

In [None]:
sc.settings.figdir = file_path + 'day2/figures/'

**Comment:** When you repeat certain analyses, it might be helpful to set a `date` variable and add it to every figure and table (see `datetime` Python package).

# Read data

We read in the annotated dataset. As a reminder, the `anndata` object contains (amongst others):
1. The raw counts as 'counts' layer. 
2. Normalised gene expression values (log-scran normalised) as `X` matrix
3. Cell type annotation
4. Size factors 

In [None]:
adata = sc.read(data_dir + 'data_processed.h5ad')

In [None]:
adata

In [None]:
pd.crosstab(adata.obs['annotated'], adata.obs['location'])

# Test a single coefficient

The test of a single coefficient is the easiest differential expression test one can imagine, the comparison of two groups is a sub-scenario of this case.

In our case, testing differences of astrocytes from **cortex** and **substantia nigra** falls into this category.

## Run differential expression test for two groups

We first tackle this scenario with a Wald test. The Wald test checks if a certain coefficient introduces a significant difference in the expression of a gene.

It needs a formula which describes the setup of the model and the factor of the formula `factor_loc_totest` which should be tested.

Usually, this factor divides the samples into two groups, e.g. `condition 0` and `condition 1`. In this case, `diffxpy` chooses automatically the coefficient to test. If there are more than two groups specified by the factor, the coefficient which should be tested has to be set manually by specifying `coef_totest`. This coefficient should refer to one of the groups specified by `factor_loc_totest`, e.g. `condition 1`.

In [None]:
sc.pl.umap(adata, color=['annotated', 'location'])

Prepare test data.

In [None]:
#select astrocytes
adata_astro = adata[adata.obs['annotated'] == 'Astrocyte'].copy() 

In [None]:
adata_astro.obs['location'].value_counts()

In [None]:
adata_astro

Run the test on the count data. It must be noted that the `counts` layer does not contain the sample description nor gene names.

In [None]:
test = de.test.wald(
    data=adata_astro.layers['counts'],
    formula_loc="~ 1 + location",
    factor_loc_totest="location",
    gene_names=adata_astro.var_names,
    sample_description=adata_astro.obs
)

Obtain results.


`test.summary()` returns a `pandas` `DataFrame` with a quick overview of the test results.

In [None]:
#view first 10 results
test.summary().iloc[:10,:]

Table column description:
* gene: gene name / identifier
* pval: p-value of the gene
* qval: multiple testing - corrected p-value of the gene
* log2fc: log_2 fold change between no coefficient and coefficient
* grad: the gradient of the gene's log-likelihood
* coef_mle the maximum-likelihood estimate of the coefficient in liker-space
* coef_sd the standard deviation of the coefficient in liker-space
* ll: the log-likelihood of the estimation

Order test results by q-value:

In [None]:
test.summary().sort_values('qval').iloc[:10,:]

`test.plot_volcano()` creates a volcano plot of p-values vs. fold-change:

In [None]:
test.plot_volcano(corrected_pval=True, min_fc=1.05, alpha=10e-5, size=20)

Save results to file.

In [None]:
test.summary().to_csv(table_dir + 'test_astrocytes.csv')

**Comment:** Apart from Wald tests, `diffxpy` provides the following hypothesis tests: 
* Welch's t-test (see `de.test.t_test()`) 
* Rank sum test (see `de.test.rank_test()`)
* Likelihood-ratio test (LRT) (see `de.test.lrt()`)

**Tasks:** 
* Extract the significant differentially expressed genes from the `test.summary()` table and split the list into higher expressed in cortex and higher expressed in substantia nigra. 
* Filter for a minimum mean expression of `0.05` (or choose your own threshold). 
* Visualise your top 10 DE genes in a heatmap/matrixplot/dotplot. 

**Task:** Save your filtered tables to file.

## Include continuous covariates

In the previous test, we did not consider cell-specific effects in the test. 
However, the count data is not normalised and size factors indicate cell-specific differences in e.g. cell size and sequencing depth. Therefore, we use it as additional, numeric covariate to regress out the effect described by the size factors.  

Firstly, you have to indicate that you are supplying a continuous effect if you want to do so. We will otherwise turn it into a categorical effect and this will not produce the desired results. We do this so that we can make sure that there are no errors arising from numeric and categorical columns in `pandas` `DataFrames`. 

**Please note** that the following differential expression tests **takes considerably longer than the simple test above** because it optimizes more parameters.

In [None]:
test_sf = de.test.wald(
    data=adata_astro.layers['counts'],
    formula_loc="~ 1 + location + size_factors",
    factor_loc_totest="location",
    as_numeric=['size_factors'],
    gene_names=adata_astro.var_names,
    sample_description=adata_astro.obs
)


The results can be retrieved as before. Please note that the results differ now as we imposed size factors without changing the data:

In [None]:
sb.scatterplot(
    x=test.log10_pval_clean(),
    y=test_sf.log10_pval_clean()
)
plt.show()

Order test results by q-value:

In [None]:
test_sf.summary().sort_values('qval', ascending=True).iloc[:10,:]

In [None]:
test_sf.plot_volcano(corrected_pval=True, min_fc=1.05, alpha=10e-5, size=20)

Save results to file.

In [None]:
test_sf.summary().to_csv(table_dir + 'test_sf.csv')

# Test multiple coefficients with a Wald test

We now turn to tests that cannot be performed with t-tests or rank sum tests because they involve more than two groups (or more general: multiple coefficients). 
Here, we cover two different test scenarios, where we first test in general for donor-specific differences and we second test a specific donor.
In our test case, we use look at donor-specific differences in dopaminergic neurons in the substantia nigra.

In [None]:
adata_DN = adata[np.logical_and(adata.obs['annotated']=='Dopaminergic neuron',
                                adata.obs['location']=='SN')].copy()

In [None]:
adata_DN.obs['donor'].value_counts()

## Test a whole factor

In [None]:
test_fac = de.test.wald(
    data=adata_DN.layers['counts'],
    formula_loc="~ 1 + donor",
    factor_loc_totest="donor",
    gene_names=adata_DN.var_names,
    sample_description=adata_DN.obs
)

Look at the top 10 results.

In [None]:
test_fac.summary().sort_values('qval').iloc[:10, :]

## Test selected coefficients

First, we preview the coefficient names and then yield the desired list to `diffxpy`.

In [None]:
de.utils.preview_coef_names(
    sample_description=adata_DN.obs,
    formula="~ 1 + donor"
)

Second, set up the Wald test with the coefficient(s) of interest.

In [None]:
test_coef = de.test.wald(
    data=adata_DN.layers['counts'],
    formula_loc="~ 1 + donor",
    coef_to_test=['donor[T.5]'],
    gene_names=adata_DN.var_names,
    sample_description=adata_DN.obs
)

Look at the top 10 results.

In [None]:
test_coef.summary().sort_values('qval').iloc[:10, :]

Save results to file.

In [None]:
test_coef.summary().to_csv(table_dir + 'test_coef.csv')

# Further scenarios

Was your scenario not captured by any of these classes of tests? diffxpy wraps a number of further advanced tests to which we dedicated separate tutorials of the `diffxpy` package. These are:

* pairwise tests between groups ("multiple_tests_per_gene")
* groupwise tests versus all other groups ("multiple_tests_per_gene")
* modelling continuous covariates such as as total counts, time, pseudotime, space, concentration ("modelling_continuous_covariates")
* modelling equality constraints, relevant for scenarios with perfect confounding ("modelling_constraints")