In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import logging

# Generate some data:

In [2]:
from batchglm.api.models.nb_glm import Simulator

sim = Simulator(num_observations=2000, num_features=100)
sim.generate_sample_description(num_batches=0)
sim.generate()

# count data
X = sim.X
# sample description
sample_description = sim.sample_description

The sample description should be a pandas DataFrame with `num_observations` rows.
Each column should represent a property of the dataset.

The module `batchglm.api.data` contains some helper functions which can be useful to create this sample description:

- `sample_description_from_anndata()`
- `sample_description_from_xarray()`

In [3]:
sample_description

Unnamed: 0_level_0,condition
observations,Unnamed: 1_level_1
0,0
1,1
2,0
3,1
4,0
5,1
6,0
7,1
8,0
9,1


# Run differential expression test:

The t-test checks if two groups of samples differ significantly in one gene.

Therefore, it has to be provided with a parameter `grouping` which specifies the group membership of each sample.
It can be either the name of a column in `sample_description` or a vector of length `num_observations`.


In [4]:
logging.getLogger("tensorflow").setLevel(logging.ERROR)
logging.getLogger("batchglm").setLevel(logging.INFO)
logging.getLogger("diffxpy").setLevel(logging.INFO)

import diffxpy as de

test = de.test_t_test(
    data=X,
    grouping="condition",
    sample_description=sample_description
)


# Obtaining the results

The p-/q-values can be obtained by calling test.pval / test.qval:

In [5]:
test.qval

array([1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.60547221e-03,
       1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 2.54744535e-03, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 1.00000000e+00, 0.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 1.00000000e+00, 1.37839936e-03, 0.00000000e+00,
       0.00000000e+00, 1.00000000e+00, 1.00000000e+00, 4.50216277e-08,
       1.00000000e+00, 1.00000000e+00, 0.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 1.00000000e+00,
       0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       1.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       1.00000000e+00, 4.66907430e-01, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
      

test.summary() returns a pandas DataFrame with a quick overview of the test results:

In [6]:
test.summary()

Unnamed: 0,gene,pval,qval,log2fc
0,0,1.000000,1.000000,0.460219
1,1,1.000000,1.000000,0.857655
2,2,1.000000,1.000000,0.626951
3,3,1.000000,1.000000,0.944438
4,4,1.000000,1.000000,0.586649
5,5,1.000000,1.000000,0.897808
6,6,1.000000,1.000000,0.747291
7,7,0.000482,0.001605,-0.033741
8,8,1.000000,1.000000,0.386023
9,9,1.000000,1.000000,0.699942


- `gene`: gene name / identifier
- `pval`: p-value of the gene
- `qval`: multiple testing - corrected p-value of the gene
- `log2fc`: log_2 fold change between `no coefficient` and `coefficient`