In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import logging

# Generate some data:

In [2]:
from batchglm.api.models.nb_glm import Simulator

sim = Simulator(num_observations=2000, num_features=100)
sim.generate_sample_description(num_batches=0)
sim.generate()

# count data
X = sim.X
# sample description
sample_description = sim.sample_description

The sample description should be a pandas DataFrame with `num_observations` rows.
Each column should represent a property of the dataset.

The module `batchglm.api.data` contains some helper functions which can be useful to create this sample description:

- `sample_description_from_anndata()`
- `sample_description_from_xarray()`

In [3]:
sample_description

Unnamed: 0_level_0,condition
observations,Unnamed: 1_level_1
0,0
1,1
2,0
3,1
4,0
5,1
6,0
7,1
8,0
9,1


# Run differential expression test:

The wilcoxon test checks if two groups of samples differ significantly in one gene.

Therefore, it has to be provided with a parameter `grouping` which specifies the group membership of each sample.
It can be either the name of a column in `sample_description` or a vector of length `num_observations`.


In [4]:
logging.getLogger("tensorflow").setLevel(logging.ERROR)
logging.getLogger("batchglm").setLevel(logging.INFO)
logging.getLogger("diffxpy").setLevel(logging.INFO)

import diffxpy as de

test = de.test_wilcoxon(
    data=X,
    grouping="condition",
    sample_description=sample_description
)


# Obtaining the results

The p-/q-values can be obtained by calling test.pval / test.qval:

In [5]:
test.qval


array([0.00000000e+000, 1.93565733e-247, 1.38836928e-106, 0.00000000e+000,
       0.00000000e+000, 2.48434967e-070, 3.89332406e-165, 8.64068742e-030,
       6.48013768e-254, 0.00000000e+000, 0.00000000e+000, 1.52063167e-222,
       7.28927009e-016, 4.46116483e-004, 6.22803264e-212, 0.00000000e+000,
       0.00000000e+000, 3.42805435e-217, 0.00000000e+000, 0.00000000e+000,
       3.18934030e-096, 5.32893681e-266, 0.00000000e+000, 2.11773391e-294,
       0.00000000e+000, 6.11158580e-023, 0.00000000e+000, 6.74417605e-279,
       0.00000000e+000, 4.90200298e-240, 1.02524034e-218, 1.10889412e-195,
       1.14981416e-087, 4.09138744e-264, 1.89174370e-062, 0.00000000e+000,
       8.16686596e-248, 0.00000000e+000, 4.72372089e-185, 0.00000000e+000,
       2.67415295e-252, 5.02544323e-072, 0.00000000e+000, 0.00000000e+000,
       7.11436175e-299, 1.78792851e-054, 0.00000000e+000, 1.27110425e-254,
       8.97435115e-003, 9.46012486e-235, 0.00000000e+000, 0.00000000e+000,
       8.07725257e-284, 0

test.summary() returns a pandas DataFrame with a quick overview of the test results:

In [6]:
test.summary()

Unnamed: 0,gene,pval,qval,log2fc
0,0,0.000000e+00,0.000000e+00,0.794094
1,1,1.258177e-247,1.935657e-247,0.508358
2,2,1.193998e-106,1.388369e-106,0.245447
3,3,0.000000e+00,0.000000e+00,0.583048
4,4,0.000000e+00,0.000000e+00,0.714606
5,5,2.285602e-70,2.484350e-70,0.155400
6,6,3.114659e-165,3.893324e-165,0.555475
7,7,8.295060e-30,8.640687e-30,-0.101883
8,8,3.952884e-254,6.480138e-254,-0.431965
9,9,0.000000e+00,0.000000e+00,0.766455


- `gene`: gene name / identifier
- `pval`: p-value of the gene
- `qval`: multiple testing - corrected p-value of the gene
- `log2fc`: log_2 fold change between `no coefficient` and `coefficient`