In [1]:
import os
import datetime
import numpy as np
import xarray as xa
import pprint

import logging
import warnings

logging.getLogger("tensorflow").setLevel(logging.INFO)
logging.getLogger("batchglm").setLevel(logging.INFO)

## Import batchglm

In [2]:
import batchglm.api as glm

In [3]:
# just to ignore some tensorflow warnings; just ignore this line
warnings.filterwarnings("ignore", category=DeprecationWarning, module="tensorflow")

# Introduction

Perfect confounding occurs frequently in differential expression assays, often if biological replicates cannot be spread acrodd conditions: This is often the case with animals or patients. Perfect confoudnding implies that the corresponding design matrix is not full rank and the model underdetermined. This can be circumvented by certain tricks which essentially regress repplicates to reference replicates. We believe that this is firstly undesirable as the condition coefficients depend on the identity of the reference replicates and accordingly on the ordering of the replicates, which has no experiental meaning and is purely a result of sample labels. Secondly, such tricks may be hard to come up with in hard cases such as presented in example 2. Here, we show how one can solve both problems by constraining parameterse in the model. 

# Example 1: easy

## Simulate data

In this example, we have 4 biological replicates (animals, patients, cell culture replicates etc.) in a treatment experiment: 2 in each condition (treated, untreated). Accordingly, there is perfect confounding at this level. We circumvent this by constraining the biological replicate coefficients to not model mean trends. 

### Define design matrices

In [4]:
ncells = 2000
dmat = np.zeros([ncells, 6])
dmat[:,0] = 1
dmat[:500,1] = 1 # bio rep 1
dmat[500:1000,2] = 1 # bio rep 2
dmat[1000:1500,3] = 1 # bio rep 3
dmat[1500:2000,4] = 1 # bio rep 4
dmat[1000:2000,5] = 1 # condition effect
print(np.unique(dmat, axis=0))

[[1. 0. 0. 0. 1. 1.]
 [1. 0. 0. 1. 0. 1.]
 [1. 0. 1. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0.]]


In [5]:
sim = glm.models.nb_glm.Simulator(num_features=100)

In [6]:
sim.parse_dmat_loc(dmat = dmat)
sim.parse_dmat_scale(dmat = dmat)
sim.generate_params()
sim.generate_data()

### Simulated model data:

In [7]:
sim.X

<xarray.DataArray 'X' (observations: 2000, features: 100)>
array([[  865, 12768,   917, ..., 12761,  2869,  2481],
       [ 2751, 10937,   674, ...,  7563,  8906,  1191],
       [  427, 15638,   702, ...,  4856,  3458,  4536],
       ...,
       [ 4843, 15996,   865, ...,  7491, 15674,  5349],
       [ 1620, 17567,   811, ...,  6933, 19805,  4601],
       [14987, 26191,   723, ...,  7653, 16142,  3981]])
Dimensions without coordinates: observations, features

In [8]:
np.unique(sim.design_loc, axis=0)

array([[1., 0., 0., 0., 1., 1.],
       [1., 0., 0., 1., 0., 1.],
       [1., 0., 1., 0., 0., 0.],
       [1., 1., 0., 0., 0., 0.]])

### The parameters used to generate this data:

In [9]:
sim.par_link_loc

<xarray.DataArray 'a' (design_loc_params: 6, features: 100)>
array([[ 7.569248,  8.890025,  7.028238, ...,  8.353432,  8.723765,  8.625727],
       [-0.317375,  0.478504, -0.357971, ...,  0.475232, -0.101983, -0.551926],
       [ 0.390059,  0.332053,  0.379956, ...,  0.118794,  0.201195,  0.275354],
       [ 0.164272, -0.255698,  0.552759, ..., -0.483279,  0.479711,  0.477822],
       [ 0.180245,  0.492293, -0.109394, ...,  0.569809,  0.527508, -0.414742],
       [ 0.487067,  0.185528,  0.087079, ..., -0.095335,  0.57964 ,  0.300389]])
Coordinates:
  * design_loc_params  (design_loc_params) <U2 'p0' 'p1' 'p2' 'p3' 'p4' 'p5'
Dimensions without coordinates: features

In [10]:
sim.par_link_scale

<xarray.DataArray 'b' (design_scale_params: 6, features: 100)>
array([[ 0.693147,  2.079442,  1.791759, ...,  1.94591 ,  2.197225,  1.098612],
       [ 0.184218,  0.425898,  0.490049, ...,  0.472858, -0.50144 ,  0.524885],
       [ 0.225669,  0.028211,  0.24437 , ...,  0.658131,  0.679489, -0.264468],
       [ 0.367617,  0.405594,  0.617539, ...,  0.520513, -0.401469,  0.507148],
       [ 0.103364, -0.124424,  0.011604, ...,  0.630569,  0.571592,  0.664485],
       [-0.0635  , -0.031324,  0.36708 , ...,  0.396978, -0.139396,  0.60142 ]])
Coordinates:
  * design_scale_params  (design_scale_params) <U2 'p0' 'p1' 'p2' 'p3' 'p4' 'p5'
Dimensions without coordinates: features

## Constraints for model

In [11]:
dmat_est_loc = sim.design_loc

In [12]:
dmat_est_scale = sim.design_scale

Build constraints based on sets of parameters that have to sum to zero. Each of these constraints is enforced by binding one of these parameters to the rest of the set. Such a constraint is encoded by assigning a 1 to each parameter in the set and a -1 to to the dependent parameter.

In [13]:
constraints_loc = np.zeros([2, dmat_est_loc.shape[1]])
# Constraint 0: Account for perfect confouding at biological replicate and treatment level 
# by constraining biological replicate coefficients not to produce mean effects across conditions.
constraints_loc[0,3] = -1
constraints_loc[0,4:5] = 1
# Constraint 1: Account for fact that first level of biological replicates was not absorbed into offset.
constraints_loc[1,1] = -1
constraints_loc[1,2:5] = 1
constraints_loc

array([[ 0.,  0.,  0., -1.,  1.,  0.],
       [ 0., -1.,  1.,  1.,  1.,  0.]])

In [14]:
constraints_scale = constraints_loc.copy()

In [15]:
from numpy.linalg import matrix_rank
constraints_loc_mod = constraints_loc.copy()
constraints_loc_mod[constraints_loc_mod==-1] = 1
print(np.vstack([np.unique(dmat_est_loc, axis=0), constraints_loc_mod]))
print("rank deficiency without constraints: "+ str(dmat_est_loc.shape[1] - matrix_rank(np.vstack([np.unique(dmat_est_loc, axis=0)]))))
print("rank deficiency with constraints: "+ str(dmat_est_loc.shape[1] - matrix_rank(np.vstack([np.unique(dmat_est_loc, axis=0), constraints_loc_mod]))))

[[1. 0. 0. 0. 1. 1.]
 [1. 0. 0. 1. 0. 1.]
 [1. 0. 1. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 1. 0.]
 [0. 1. 1. 1. 1. 0.]]
rank deficiency without constraints: 2
rank deficiency with constraints: 0


## Estimate the model

In [17]:
X = sim.X
design_loc = dmat_est_loc
design_scale = dmat_est_scale

# input data
input_data = glm.models.nb_glm.InputData.new(
    data=X, 
    design_loc=design_loc,
    design_scale=design_scale,
    constraints_loc=constraints_loc,
    constraints_scale=constraints_scale)

### Set up estimator:

In [18]:
estimator = glm.models.nb_glm.Estimator(input_data, quick_scale=False)
estimator.initialize()

Using closed-form MLE initialization for mean
Should train mu: False
Using closed-form MME initialization for dispersion
Should train r: True


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Graph was finalized.
Running local_init_op.
Done running local_init_op.


### Train

Now start the training sequence and let the estimator choose automatically the best training strategy:

In [19]:
estimator.train_sequence('QUICK')

training strategy:
[{'convergence_criteria': 't_test',
  'learning_rate': 0.1,
  'loss_window_size': 100,
  'optim_algo': 'ADAM',
  'stop_at_loss_change': 0.05,
  'use_batching': True}]
Beginning with training sequence #1
Step: 1	loss: 882.075268
Step: 2	loss: 886.603730
Step: 3	loss: 888.413564
Step: 4	loss: 889.325455
Step: 5	loss: 881.437230
Step: 6	loss: 886.622384
Step: 7	loss: 889.451865
Step: 8	loss: 888.701746
Step: 9	loss: 882.119454
Step: 10	loss: 886.732701
Step: 11	loss: 888.263349
Step: 12	loss: 888.789886
Step: 13	loss: 881.556806
Step: 14	loss: 886.671547
Step: 15	loss: 888.566374
Step: 16	loss: 889.120927
Step: 17	loss: 881.184785
Step: 18	loss: 885.914453
Step: 19	loss: 889.513262
Step: 20	loss: 889.190920
Step: 21	loss: 881.919622
Step: 22	loss: 885.329601
Step: 23	loss: 889.437051
Step: 24	loss: 889.126494
Step: 25	loss: 881.133643
Step: 26	loss: 886.371602
Step: 27	loss: 889.085185
Step: 28	loss: 889.185028
Step: 29	loss: 881.564541
Step: 30	loss: 886.754848
Step: 3

## Obtaining the results

The fitted parameters can be retrieved by calling the corresponding parameters of `estimator`:

In [20]:
estimator.par_link_loc

<xarray.DataArray (design_loc_params: 6, features: 100)>
array([[ 7.60433 ,  9.299009,  7.036924, ...,  8.647857,  8.786441,  8.500951],
       [-0.378287,  0.078669, -0.382539, ...,  0.172109, -0.150278, -0.438347],
       [ 0.378287, -0.078669,  0.382539, ..., -0.172109,  0.150278,  0.438347],
       [-0.040116, -0.394487,  0.344727, ..., -0.52132 , -0.015463,  0.463282],
       [ 0.040116,  0.394487, -0.344727, ...,  0.52132 ,  0.015463, -0.463282],
       [ 0.656642, -0.109476,  0.303267, ..., -0.333953,  1.029455,  0.465089]])
Coordinates:
  * design_loc_params  (design_loc_params) <U2 'p0' 'p1' 'p2' 'p3' 'p4' 'p5'
    feature_allzero    (features) bool False False False False False False ...
  * features           (features) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...

In [21]:
estimator.par_link_scale

<xarray.DataArray (design_scale_params: 6, features: 100)>
array([[ 0.937191,  2.361788,  2.112912, ...,  2.506849,  2.256079,  1.175205],
       [-0.08379 ,  0.259864,  0.110316, ..., -0.11203 , -0.653096,  0.461834],
       [ 0.08379 , -0.259864, -0.110316, ...,  0.11203 ,  0.653096, -0.461834],
       [ 0.152664,  0.211096,  0.385954, ..., -0.096363, -0.514341, -0.082485],
       [-0.152664, -0.211096, -0.385954, ...,  0.096363,  0.514341,  0.082485],
       [-0.071221, -0.145986,  0.429112, ...,  0.447067, -0.091087,  1.154632]])
Coordinates:
  * design_scale_params  (design_scale_params) <U2 'p0' 'p1' 'p2' 'p3' 'p4' 'p5'
    feature_allzero      (features) bool False False False False False False ...
  * features             (features) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...

### Check that constraints were met

These parameter sets should sum to zero for each gene.

In [22]:
np.max(estimator.par_link_loc[1,:]+np.sum(estimator.par_link_loc[2:5,:], axis=0))

<xarray.DataArray ()>
array(5.551115e-17)
Coordinates:
    design_loc_params  <U2 'p1'

In [23]:
np.max(np.sum(estimator.par_link_loc[1:3,:], axis=0)+np.sum(estimator.par_link_loc[3:5,:], axis=0))

<xarray.DataArray ()>
array(5.551115e-17)

## Comparing the results with the simulated data:

Linear model output:

In [24]:
locdiff = glm.utils.stats.rmsd(np.matmul(estimator.design_loc, estimator.par_link_loc), 
                               np.matmul(sim.design_loc, sim.par_link_loc))
print("Root mean squared deviation of location: %.2f" % locdiff)

scalediff = glm.utils.stats.rmsd(np.matmul(estimator.design_scale, estimator.par_link_scale), 
                                 np.matmul(sim.design_scale, sim.par_link_scale))
print("Root mean squared deviation of scale:    %.2f" % scalediff)

Root mean squared deviation of location: 0.02
Root mean squared deviation of scale:    0.07


# Example 2: advanced

## Simulate some data

In this example, we have 4 biological replicates (animals, patients, cell culture replicates etc.) in a treatment experiment: 2 in each condition (treated, untreated). Accordingly, there is perfect confounding at this level already. We circumvent this by constraining the biological replicate coefficients to not model mean trends (constraints 0,1). Secondly, there a are technical replicates which contain cells from one biological replicate from each condition. Each biological replicate was assigned to one treated-untreated sample pair and each pair split into two technical replicates. Again, we correct perfect confouding by constrainig the techincal replicate coefficients not to model mean effects by constraints 2,3.

### Define design matrices

In [25]:
ncells = 2000
dmat = np.zeros([ncells, 10])
dmat[:,0] = 1
dmat[:500,1] = 1 # bio rep 1
dmat[500:1000,2] = 1 # bio rep 2
dmat[1000:1500,3] = 1 # bio rep 3
dmat[1500:2000,4] = 1 # bio rep 4
dmat[0:250,5] = 1 # tech rep 1
dmat[1000:1250,5] = 1 # tech rep 1
dmat[250:500,6] = 1 # tech rep 2
dmat[1250:1500,6] = 1 # tech rep 2
dmat[500:750,7] = 1 # tech rep 3
dmat[1500:1750,7] = 1 # tech rep 3
dmat[750:1000,8] = 1 # tech rep 4
dmat[1750:2000,8] = 1 # tech rep 4
dmat[1000:2000,9] = 1 # condition effect
print(np.unique(dmat, axis=0))

[[1. 0. 0. 0. 1. 0. 0. 0. 1. 1.]
 [1. 0. 0. 0. 1. 0. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0. 1. 0. 0. 0. 1.]
 [1. 0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [1. 1. 0. 0. 0. 0. 1. 0. 0. 0.]
 [1. 1. 0. 0. 0. 1. 0. 0. 0. 0.]]


In [26]:
sim = glm.models.nb_glm.Simulator(num_features=100)

In [27]:
sim.parse_dmat_loc(dmat = dmat)
sim.parse_dmat_scale(dmat = dmat)
sim.generate_params()
sim.generate_data()

### Simulated model data:

In [28]:
sim.X

<xarray.DataArray 'X' (observations: 2000, features: 100)>
array([[  642,   249, 12487, ..., 20348,  2165,  7319],
       [ 1576, 14048, 21200, ..., 21197,  3093,  2970],
       [ 2671,   196, 25641, ...,  7072,  2239,  5689],
       ...,
       [ 7340,  2571, 10871, ..., 23945,   784, 32510],
       [ 7164,  6329,  9987, ...,  9381,  3204, 17760],
       [ 6140,  1068, 10479, ..., 10916,  4179, 28858]])
Dimensions without coordinates: observations, features

## Constraints for model

In [29]:
dmat_est_loc = sim.design_loc

In [30]:
dmat_est_scale = sim.design_scale

Build constraints based on sets of parameters that have to sum to zero. Each of these constraints is enforced by binding one of these parameters to the rest of the set. Such a constraint is encoded by assigning a 1 to each parameter in the set and a -1 to to the dependent parameter.

In [31]:
np.unique(dmat_est_loc, axis=0)

array([[1., 0., 0., 0., 1., 0., 0., 0., 1., 1.],
       [1., 0., 0., 0., 1., 0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0., 0., 1., 0., 0., 1.],
       [1., 0., 0., 1., 0., 1., 0., 0., 0., 1.],
       [1., 0., 1., 0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 1., 0., 0., 0., 0., 1., 0., 0.],
       [1., 1., 0., 0., 0., 0., 1., 0., 0., 0.],
       [1., 1., 0., 0., 0., 1., 0., 0., 0., 0.]])

In [32]:
constraints_loc = np.zeros([4, dmat_est_loc.shape[1]])
# Constraint 0: Account for perfect confouding at biological replicate and treatment level 
# by constraining biological replicate coefficients not to produce mean effects across conditions.
constraints_loc[0,3] = -1
constraints_loc[0,4:5] = 1
# Constraint 1: Account for fact that first level of biological replicates was not absorbed into offset. 
constraints_loc[1,1] = -1
constraints_loc[1,2:5] = 1
# Constraint 2: Account for perfect confouding at biological replicate and technical replicate 
# by constraining technical replicate coefficients not to produce mean effects across biological replicates.
constraints_loc[2,7] = -1
constraints_loc[2,8:9] = 1
# Constraint 3: Account for fact that first level of technical replicates was not absorbed into offset. 
constraints_loc[3,5] = -1
constraints_loc[3,6:9] = 1

constraints_loc

array([[ 0.,  0.,  0., -1.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0., -1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0., -1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0., -1.,  1.,  1.,  1.,  0.]])

In [33]:
constraints_scale = constraints_loc.copy()

In [34]:
from numpy.linalg import matrix_rank
constraints_loc_mod = constraints_loc.copy()
constraints_loc_mod[constraints_loc_mod==-1] = 1
print(np.vstack([np.unique(dmat_est_loc, axis=0), constraints_loc_mod]))
print("rank deficiency without constraints: "+ str(dmat_est_loc.shape[1] - matrix_rank(np.vstack([np.unique(dmat_est_loc, axis=0)]))))
print("rank deficiency with constraints: "+ str(dmat_est_loc.shape[1] - matrix_rank(np.vstack([np.unique(dmat_est_loc, axis=0), constraints_loc_mod]))))

[[1. 0. 0. 0. 1. 0. 0. 0. 1. 1.]
 [1. 0. 0. 0. 1. 0. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0. 1. 0. 0. 0. 1.]
 [1. 0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [1. 1. 0. 0. 0. 0. 1. 0. 0. 0.]
 [1. 1. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 1. 1. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 0.]
 [0. 0. 0. 0. 0. 1. 1. 1. 1. 0.]]
rank deficiency without constraints: 4
rank deficiency with constraints: 0


## Estimate the model

In [35]:
X = sim.X
design_loc = dmat_est_loc
design_scale = dmat_est_scale

# input data
input_data = glm.models.nb_glm.InputData.new(
    data=X, 
    design_loc=design_loc,
    design_scale=design_scale,
    constraints_loc=constraints_loc,
    constraints_scale=constraints_scale)

### Set up estimator:

Note that there is no closed form estimator for the mean model here due to the confounding. The model is initialised with least squares but the mean model is also trained.

In [36]:
estimator = glm.models.nb_glm.Estimator(input_data, quick_scale=False)
estimator.initialize()

Using closed-form MLE initialization for mean
Should train mu: True
Using closed-form MME initialization for dispersion
Should train r: True


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Graph was finalized.
Running local_init_op.
Done running local_init_op.


### Train

Now start the training sequence and let the estimator choose automatically the best training strategy:

In [37]:
estimator.train_sequence('QUICK')

training strategy:
[{'convergence_criteria': 't_test',
  'learning_rate': 0.1,
  'loss_window_size': 100,
  'optim_algo': 'ADAM',
  'stop_at_loss_change': 0.05,
  'use_batching': True}]
Beginning with training sequence #1
Step: 1	loss: 885.917835
Step: 2	loss: 911.328084
Step: 3	loss: 892.516976
Step: 4	loss: 892.208725
Step: 5	loss: 894.579411
Step: 6	loss: 899.264257
Step: 7	loss: 896.262826
Step: 8	loss: 893.874717
Step: 9	loss: 890.636066
Step: 10	loss: 892.263851
Step: 11	loss: 894.195462
Step: 12	loss: 892.248708
Step: 13	loss: 890.761843
Step: 14	loss: 891.491649
Step: 15	loss: 891.946005
Step: 16	loss: 891.261989
Step: 17	loss: 887.718970
Step: 18	loss: 890.695663
Step: 19	loss: 889.861702
Step: 20	loss: 890.756509
Step: 21	loss: 887.416082
Step: 22	loss: 890.205908
Step: 23	loss: 889.012561
Step: 24	loss: 889.114767
Step: 25	loss: 887.146485
Step: 26	loss: 888.758632
Step: 27	loss: 888.741073
Step: 28	loss: 889.252259
Step: 29	loss: 886.806134
Step: 30	loss: 888.869114
Step: 3

Step: 300	loss: 887.954476
pval: 0.811604
Training sequence #1 complete


## Obtaining the results

### Check that constraints were met

These parameter sets should sum to zero for each gene.

In [38]:
np.max(estimator.par_link_loc[1,:]+np.sum(estimator.par_link_loc[2:5,:], axis=0))

<xarray.DataArray ()>
array(1.110223e-16)
Coordinates:
    design_loc_params  <U2 'p1'

In [39]:
np.max(np.sum(estimator.par_link_loc[1:3,:], axis=0)+np.sum(estimator.par_link_loc[3:5,:], axis=0))

<xarray.DataArray ()>
array(1.110223e-16)

## Comparing the results with the simulated data:

Linear model output:

In [40]:
locdiff = glm.utils.stats.rmsd(np.matmul(estimator.design_loc, estimator.par_link_loc), 
                               np.matmul(sim.design_loc, sim.par_link_loc))
print("Root mean squared deviation of location: %.2f" % locdiff)

scalediff = glm.utils.stats.rmsd(np.matmul(estimator.design_scale, estimator.par_link_scale), 
                                 np.matmul(sim.design_scale, sim.par_link_scale))
print("Root mean squared deviation of scale:    %.2f" % scalediff)

Root mean squared deviation of location: 0.04
Root mean squared deviation of scale:    0.10


# Example 3: advanced

## Simulate some data

In this example, we have the same scenario as in example 2 but one technical replicate is missing. We have to drop the corresponding constraint and remove the two parameters belonging to this pair of technical replicates.

### Define design matrices

In [41]:
ncells = 2000
dmat = np.zeros([ncells, 8])
dmat[:,0] = 1
dmat[:500,1] = 1 # bio rep 1
dmat[500:1000,2] = 1 # bio rep 2
dmat[1000:1500,3] = 1 # bio rep 3
dmat[1500:2000,4] = 1 # bio rep 4
dmat[0:250,5] = 1 # tech rep 1
dmat[1000:1250,5] = 1 # tech rep 1
dmat[250:500,6] = 1 # tech rep 2
dmat[1250:1500,6] = 1 # tech rep 2
dmat[1000:2000,7] = 1 # condition effect
print(np.unique(dmat, axis=0))

[[1. 0. 0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0. 1. 1.]
 [1. 0. 0. 1. 0. 1. 0. 1.]
 [1. 0. 1. 0. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0. 1. 0.]
 [1. 1. 0. 0. 0. 1. 0. 0.]]


In [42]:
sim = glm.models.nb_glm.Simulator(num_features=100)

In [43]:
sim.parse_dmat_loc(dmat = dmat)
sim.parse_dmat_scale(dmat = dmat)
sim.generate_params()
sim.generate_data()

### Simulated model data:

In [44]:
sim.X

<xarray.DataArray 'X' (observations: 2000, features: 100)>
array([[ 7162,  7387,  3382, ...,  4642,  3427,  2539],
       [11531, 28312,  4398, ...,  6501,  2815,  3052],
       [11137, 21964,  8779, ...,  2818,  2458,  2243],
       ...,
       [11197, 14933,  2099, ...,  4123,  6367,  4362],
       [ 8916, 17550,  2223, ...,  2077,  5256,  4591],
       [ 7343, 11708,  2448, ...,  5147,  6220,  6653]])
Dimensions without coordinates: observations, features

## Constraints for model

In [45]:
dmat_est_loc = sim.design_loc

In [46]:
dmat_est_scale = sim.design_scale

Build constraints based on sets of parameters that have to sum to zero. Each of these constraints is enforced by binding one of these parameters to the rest of the set. Such a constraint is encoded by assigning a 1 to each parameter in the set and a -1 to to the dependent parameter.

In [47]:
np.unique(dmat_est_loc, axis=0)

array([[1., 0., 0., 0., 1., 0., 0., 1.],
       [1., 0., 0., 1., 0., 0., 1., 1.],
       [1., 0., 0., 1., 0., 1., 0., 1.],
       [1., 0., 1., 0., 0., 0., 0., 0.],
       [1., 1., 0., 0., 0., 0., 1., 0.],
       [1., 1., 0., 0., 0., 1., 0., 0.]])

In [48]:
constraints_loc = np.zeros([3, dmat_est_loc.shape[1]])
# Constraint 0: Account for perfect confouding at biological replicate and treatment level 
# by constraining biological replicate coefficients not to produce mean effects across conditions.
constraints_loc[0,3] = -1
constraints_loc[0,4:5] = 1
# Constraint 1: Account for fact that first level of biological replicates was not absorbed into offset. 
constraints_loc[1,1] = -1
constraints_loc[1,2:5] = 1
# Constraint 2: Account for fact that first level of technical replicates was not absorbed into offset. 
constraints_loc[2,5] = -1
constraints_loc[2,6:7] = 1

constraints_loc

array([[ 0.,  0.,  0., -1.,  1.,  0.,  0.,  0.],
       [ 0., -1.,  1.,  1.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0., -1.,  1.,  0.]])

In [49]:
constraints_scale = constraints_loc.copy()

In [50]:
from numpy.linalg import matrix_rank
constraints_loc_mod = constraints_loc.copy()
constraints_loc_mod[constraints_loc_mod==-1] = 1
print(np.vstack([np.unique(dmat_est_loc, axis=0), constraints_loc_mod]))
print("rank deficiency without constraints: "+ str(dmat_est_loc.shape[1] - matrix_rank(np.vstack([np.unique(dmat_est_loc, axis=0)]))))
print("rank deficiency with constraints: "+ str(dmat_est_loc.shape[1] - matrix_rank(np.vstack([np.unique(dmat_est_loc, axis=0), constraints_loc_mod]))))

[[1. 0. 0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0. 1. 1.]
 [1. 0. 0. 1. 0. 1. 0. 1.]
 [1. 0. 1. 0. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0. 1. 0.]
 [1. 1. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 1. 0. 0. 0.]
 [0. 1. 1. 1. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0.]]
rank deficiency without constraints: 3
rank deficiency with constraints: 0


## Estimate the model

In [51]:
X = sim.X
design_loc = dmat_est_loc
design_scale = dmat_est_scale

# input data
input_data = glm.models.nb_glm.InputData.new(
    data=X, 
    design_loc=design_loc,
    design_scale=design_scale,
    constraints_loc=constraints_loc,
    constraints_scale=constraints_scale)

### Set up estimator:

Note that there is no closed form estimator for the mean model here due to the confounding. The model is initialised with least squares but the mean model is also trained.

In [None]:
estimator = glm.models.nb_glm.Estimator(input_data, quick_scale=False)
estimator.initialize()

Using closed-form MLE initialization for mean
Should train mu: True
Using closed-form MME initialization for dispersion
Should train r: True


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


### Train

Now start the training sequence and let the estimator choose automatically the best training strategy:

In [None]:
estimator.train_sequence('QUICK')

## Obtaining the results

### Check that constraints were met

These parameter sets should sum to zero for each gene.

In [None]:
np.max(estimator.par_link_loc[1,:]+np.sum(estimator.par_link_loc[2:5,:], axis=0))

In [None]:
np.max(np.sum(estimator.par_link_loc[1:3,:], axis=0)+np.sum(estimator.par_link_loc[3:5,:], axis=0))

## Comparing the results with the simulated data:

Linear model output:

In [None]:
locdiff = glm.utils.stats.rmsd(np.matmul(estimator.design_loc, estimator.par_link_loc), 
                               np.matmul(sim.design_loc, sim.par_link_loc))
print("Root mean squared deviation of location: %.2f" % locdiff)

scalediff = glm.utils.stats.rmsd(np.matmul(estimator.design_scale, estimator.par_link_scale), 
                                 np.matmul(sim.design_scale, sim.par_link_scale))
print("Root mean squared deviation of scale:    %.2f" % scalediff)