In [1]:
%load_ext autoreload
%autoreload 2

import anndata
import matplotlib.pyplot as plt
import seaborn as sns
import logging
import numpy as np
import pandas as pd
import scipy.stats

In [2]:
import diffxpy.api as de

# Introduction

Perfect confounding occurs frequently in differential expression assays, often if biological replicates cannot be spread across conditions: This is often the case with animals or patients. Perfect confounding implies that the corresponding design matrix is not full rank and the model underdetermined. This can be circumvented by certain tricks (where replicates are modeled as the interaction of condition and and a replicate index per condition) which essentially regress repplicates to reference replicates. This may be undesirable as the condition coefficients depend on the identity of the reference replicates and accordingly on the ordering of the replicates, which has no experiental meaning and is purely a result of sample labels. Secondly, such tricks may be hard to come up with in hard cases. Here, we show how one can solve both problems by constraining parameterse in the model. 

# Generate data:

In this example, we have 4 biological replicates (animals, patients, cell culture replicates etc.) in a treatment experiment: 2 in each condition (treated, untreated). Accordingly, there is perfect confounding at this level. We circumvent this by constraining the biological replicate coefficients. 

In [3]:
from batchglm.api.models.glm_nb import Simulator

sim = Simulator(num_observations=700, num_features=100)
sim.generate_sample_description(num_batches=0, num_conditions=4)
sim.generate_params(
    rand_fn_loc=lambda shape: np.random.uniform(-0.1, 0.1, shape),
    rand_fn_scale=lambda shape: np.random.uniform(0.1, 2, shape)
)
sim.generate_data()

data = anndata.AnnData(
    X=sim.x,
    var=pd.DataFrame(index=["gene" + str(i) for i in range(sim.x.shape[1])]),
    obs=sim.sample_description
)

Transforming to str index.


In [4]:
data.obs["individual"] = [str(np.random.randint(0, 3)) + "_" + str(x) for x in data.obs["condition"].values]

# Dictionary encoding of constraint

Now, we run a similar test, but accounting for the inter-individual difference. Note that the experimental setup is perfectly confounded so that we cannot simply write a model of the form `~1+time-individual`. Instead, we can constrain the individual effects within each time point to sum to zero so that we account for the added variance in the model but restrict the time trajectory to model the mean of the individuals at each time point.

Diffxpy allows this type of model through the constraints interface. Here, we can supply the constraint as a dictionary that contains the nested confounder (individual) that should be stratified by the covariate that it is nested in (time): `constraints_loc={"individual": "time"}`. Note that this constraint is only enforced on the location model (`constraints_loc`) because the scale model is simply an intercept here.

In [5]:
det_constr = de.test.wald(
    data=data.X, 
    sample_description=data.obs,
    gene_names=data.var_names,
    formula_loc="~ 0 + condition + individual",
    factor_loc_totest="condition",
    constraints_loc={"individual": "condition"},
    quick_scale=False
)

W0903 19:28:27.676522 4556338624 data.py:398] Built constraints: individual[0_0]+individual[1_0]+individual[2_0]=0, individual[0_1]+individual[1_1]+individual[2_1]=0, individual[0_2]+individual[1_2]+individual[2_2]=0, individual[0_3]+individual[1_3]+individual[2_3]=0
W0903 19:28:31.871695 4556338624 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0903 19:28:32.783679 4556338624 deprecation.py:323] From /Users/david.fischer/gitDevelopment/batchglm/batchglm/train/tf/base_glm/estimator_graph.py:907: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Us

# Building constraints from scratch

In [6]:
ncells = 2000
dmat_est_loc = np.zeros([ncells, 6])
dmat_est_loc[:,0] = 1
dmat_est_loc[:500,1] = 1 # bio rep 1
dmat_est_loc[500:1000,2] = 1 # bio rep 2
dmat_est_loc[1000:1500,3] = 1 # bio rep 3
dmat_est_loc[1500:2000,4] = 1 # bio rep 4
dmat_est_loc[1000:2000,5] = 1 # condition effect

coefficient_names = ['intercept', 'bio1', 'bio2', 'bio3', 'bio4', 'treatment']
dmat_est_loc = pd.DataFrame(data=dmat_est_loc, columns=coefficient_names)
dmat_est_loc = de.test.design_matrix(dmat=dmat_est_loc)

AttributeError: module 'diffxpy.api.test' has no attribute 'design_matrix'

In [None]:
dmat_est_scale = np.ones([ncells, 1])
dmat_est_scale = pd.DataFrame(data=dmat_est_scale, columns=['intercept'])
dmat_est_scale = de.test.design_matrix(dmat=dmat_est_scale)

In [None]:
print(np.unique(dmat_est_loc.data_vars["design"], axis=0))

Define equality constraints that constrain groups of parameters of confounding variable to sum to zero. These constraints make the perfectly confounded epxerimental design identifiable. Here, we have two groups of biological replicates, such as individuals, per condition. Each of these groups is forced to sum to zero so that the treatment effect is the difference between the means of the sample values in log space (inverse linker space). These constraints are encoded as strings. Note that the coefficient names have to be the exactly as defined in the design matrix.

In [None]:
constraints_loc = de.utils.data_utils.build_equality_constraints_string(
    dmat=dmat_est_loc,
    constraints=["bio1+bio2=0", "bio3+bio4=0"],
    dims=["design_loc_params", "loc_params"]
)

The scale model only has an intercept and no perfect confounding, accordingly, no constraints are necessary:

In [None]:
constraints_scale = None