In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import logging
import scipy.stats
import numpy as np
import scanpy.api as sc


In a future version of Scanpy, `scanpy.api` will be removed.
Simply use `import scanpy as sc` and `import scanpy.external as sce` instead.



In [2]:
import batchglm.api as glm
import diffxpy.api as de

print("batchglm version "+glm.__version__)
print("diffpy version "+de.__version__)

batchglm version v0.6.8
diffpy version v0.6.13+2.g12c6667


# Introduction

Gene set enrichment is a common downstream task after differential expression analysis. There exist dedicated paackages and algorithms for this purpose. diffxpy provides a simple hypergeometric test to test for enrichment of differentially expressed genes versus defined annotated gene sets. The rationale for including this workflow is that the analyst can directly run these enrichment tests on diffxpy output objects to obtain a first enrichment analysis with very little effort.

The annotated reference gene sets can be manually tailored or loaded from publically available files, such as those provided by MSigDB. The enrichment API of diffxpy offers a simple interface for both cases.

# Generate some data:

In [3]:
from batchglm.api.models.glm_nb import Simulator

sim = Simulator(num_observations=2000, num_features=100)
sim.generate_sample_description(num_batches=0, num_conditions=2)
sim.generate_params()
sim.generate_data()

Create anndata object:

In [4]:
adata = sc.AnnData(X=np.asarray(sim.x), obs=sim.sample_description)

Transforming to str index.


From here on, we can treat the anndata object as a container of the count matrix, the sample_description and the gene_names and we only pass adata to the diffxpy functions.

# Create annotated reference set

If you do not want to manually curate gene sets, you can also loads sets via `rs.read_from_file()`. Here, we make up two different annotated sets (`setA`, `setB`) with gene names from the data simulation:

In [5]:
adata.var_names

Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
       '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
       '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36',
       '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48',
       '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60',
       '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72',
       '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84',
       '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96',
       '97', '98', '99'],
      dtype='object')

In [6]:
rs = de.enrich.RefSets()
rs.add(id="setA", source="made_up", gene_ids=np.array(['2', '5', '22', '23']))
rs.add(id="setB", source="made_up", gene_ids=np.array(['22', '15', '16', '44', '55', '98', '99']))

# Run differential expression test:

The t-test checks if two groups of samples differ significantly in one gene.

Therefore, it has to be provided with a parameter `grouping` which specifies the group membership of each sample.
It can be either the name of a column in `sample_description` or a vector of length `num_observations`.


In [7]:
det = de.test.t_test(
    data=adata,
    grouping="condition"
)

# Perform enrichment

In [8]:
enr = de.enrich.test(det=det, ref=rs, threshold=0.005, clean_ref=False)

INFO:diffxpy: 10 overlaps found between refset (10) and provided gene list (100).


I0903 19:30:09.345910 4382320064 enrich.py:308]  10 overlaps found between refset (10) and provided gene list (100).


In [9]:
enr.summary()

Unnamed: 0,set,pval,qval,intersection,reference,enquiry,background
0,setA,1.0,1.0,4,4,100,100
1,setB,1.0,1.0,7,7,100,100
