In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import logging

# Generate some data:

In [2]:
from batchglm.api.models.nb_glm import Simulator

sim = Simulator(num_observations=2000, num_features=100)
sim.generate_sample_description(num_batches=0)
sim.generate()

# count data
X = sim.X
# sample description
sample_description = sim.sample_description

The sample description should be a pandas DataFrame with `num_observations` rows.
Each column should represent a property of the dataset.

The module `batchglm.api.data` contains some helper functions which can be useful to create this sample description:

- `sample_description_from_anndata()`
- `sample_description_from_xarray()`

In [3]:
sample_description

Unnamed: 0_level_0,condition
observations,Unnamed: 1_level_1
0,0
1,1
2,0
3,1
4,0
5,1
6,0
7,1
8,0
9,1


# Run differential expression test:

The wald test checks if a certain coefficient introduces a significant difference in the expression of a gene.

It needs a statistical formula `formula` which describes the setup of the model and the factor of the formula `factor_loc_totest` which should be tested.

Usually, this factor divides the samples into two groups, e.g. `condition 0` and `condition 1`.
In this case, diffxpy will automatically choose the coefficient to test.
If there are more than two groups specified by the factor, the coefficient which should be tested has to be set manually by specifying `coef_totest`. This coefficient should refer to one of the groups specified by `factor_loc_totest`, e.g. `condition 1`.


In [4]:
logging.getLogger("tensorflow").setLevel(logging.ERROR)
logging.getLogger("batchglm").setLevel(logging.INFO)
logging.getLogger("diffxpy").setLevel(logging.INFO)

import diffxpy as de

test = de.test_wald_loc(
    data=X,
    formula="~ 1 + condition",
    factor_loc_totest="condition",
    sample_description=sample_description
)


Estimating model...
Using closed-form MLE initialization for mean
Using closed-form MME initialization for dispersion
training strategy: [{'learning_rate': 0.01, 'convergence_criteria': 't_test', 'stop_at_loss_change': 0.25, 'loss_window_size': 10, 'use_batching': False, 'optim_algo': 'GD'}]
Beginning with training sequence #1
Training sequence #1 complete
Estimating model ready


# Obtaining the results

The p-/q-values can be obtained by calling test.pval / test.qval:

In [5]:
test.qval


array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       1.00000000e+00, 0.00000000e+00, 5.71299306e-05, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
       1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 1.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 3.29597460e-15, 1.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
       1.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
       0.00000000e+00, 4.10050451e-02, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
      

test.summary() returns a pandas DataFrame with a quick overview of the test results:

In [6]:
test.summary()

Unnamed: 0,gene,pval,qval,log2fc,grad,coef_mle,coef_sd,ll
0,0,0.000000e+00,0.000000e+00,0.334296,9.909659,0.334296,0.005385,-18640.732422
1,1,0.000000e+00,0.000000e+00,0.626702,9.909659,0.626702,0.012047,-20152.041016
2,2,0.000000e+00,0.000000e+00,0.462025,9.909659,0.462025,0.007524,-16083.605469
3,3,0.000000e+00,0.000000e+00,0.186110,9.909659,0.186110,0.005675,-19155.337891
4,4,0.000000e+00,0.000000e+00,0.339391,9.909659,0.339391,0.007353,-19271.341797
5,5,0.000000e+00,0.000000e+00,0.675572,9.909659,0.675572,0.015328,-20158.681641
6,6,0.000000e+00,0.000000e+00,0.293412,9.909659,0.293412,0.007587,-21650.812500
7,7,0.000000e+00,0.000000e+00,0.430906,9.909659,0.430906,0.015795,-23195.169922
8,8,1.000000e+00,1.000000e+00,-0.434504,9.909659,-0.434504,0.009644,-21810.517578
9,9,0.000000e+00,0.000000e+00,0.597703,9.909659,0.597703,0.006382,-21652.611328


- `gene`: gene name / identifier
- `pval`: p-value of the gene
- `qval`: multiple testing - corrected p-value of the gene
- `log2fc`: log_2 fold change between `no coefficient` and `coefficient`
- `grad`: the gradient of the gene's log-likelihood
- `coef_mle` the maximum-likelihood estimate of the coefficient in liker-space
- `coef_sd` the standard deviation of the coefficient in liker-space
- `ll`: the log-likelihood of the estimation