In [1]:
import os
import datetime
import numpy as np
import xarray as xa
import pprint

import logging
import warnings

logging.getLogger("tensorflow").setLevel(logging.INFO)
logging.getLogger("batchglm").setLevel(logging.INFO)

## Import batchglm

In [2]:
import batchglm.api as glm

In [3]:
# just to ignore some tensorflow warnings; just ignore this line
warnings.filterwarnings("ignore", category=DeprecationWarning, module="tensorflow")

# Introduction

Perfect confounding occurs frequently in differential expression assays, often if biological replicates cannot be spread across conditions: This is often the case with animals or patients. Perfect confounding implies that the corresponding design matrix is not full rank and the model underdetermined. This can be circumvented by certain tricks (where replicates are modeled as the interaction of condition and and a replicate index per condition) which essentially regress repplicates to reference replicates. We believe that this is firstly undesirable as the condition coefficients depend on the identity of the reference replicates and accordingly on the ordering of the replicates, which has no experiental meaning and is purely a result of sample labels. Secondly, such tricks may be hard to come up with in hard cases such as presented in example 2 and 3 below. Here, we show how one can solve both problems by constraining parameterse in the model. 

# Example 1: easy

## Simulate data

In this example, we have 4 biological replicates (animals, patients, cell culture replicates etc.) in a treatment experiment: 2 in each condition (treated, untreated). Accordingly, there is perfect confounding at this level. We circumvent this by constraining the biological replicate coefficients. 

### Define design matrices

In [4]:
ncells = 2000
dmat = np.zeros([ncells, 6])
dmat[:,0] = 1
dmat[:500,1] = 1 # bio rep 1
dmat[500:1000,2] = 1 # bio rep 2
dmat[1000:1500,3] = 1 # bio rep 3
dmat[1500:2000,4] = 1 # bio rep 4
dmat[1000:2000,5] = 1 # condition effect
print(np.unique(dmat, axis=0))

[[1. 0. 0. 0. 1. 1.]
 [1. 0. 0. 1. 0. 1.]
 [1. 0. 1. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0.]]


In [5]:
sim = glm.models.nb_glm.Simulator(num_features=100)

In [6]:
sim.parse_dmat_loc(dmat = dmat)
sim.parse_dmat_scale(dmat = dmat)
sim.generate_params()
sim.generate_data()

### Simulated model data:

In [7]:
sim.X

<xarray.DataArray 'X' (observations: 2000, features: 100)>
array([[  694,  1738,  3272, ...,  2933,   922,  4066],
       [  488,  1892,  3034, ...,  2933,   328,  2436],
       [  655,  2255,  2069, ...,  2860,  1395,  2708],
       ...,
       [  404,  7535, 13393, ...,  3165,  1298,  4830],
       [  410, 13540,  7661, ...,  9533,  3017,  4518],
       [  283,  9925,  6575, ...,  3934,  1855,  2788]])
Dimensions without coordinates: observations, features

In [8]:
np.unique(sim.design_loc, axis=0)

array([[1., 0., 0., 0., 1., 1.],
       [1., 0., 0., 1., 0., 1.],
       [1., 0., 1., 0., 0., 0.],
       [1., 1., 0., 0., 0., 0.]])

### The parameters used to generate this data:

In [9]:
sim.par_link_loc

<xarray.DataArray 'a' (design_loc_params: 6, features: 100)>
array([[ 6.169171,  7.84706 ,  8.1824  , ...,  7.512092,  7.303938,  7.954016],
       [ 0.616672,  0.049985,  0.156191, ...,  0.674974, -0.342042, -0.091577],
       [-0.292955,  0.572764,  0.541457, ...,  0.519858,  0.231112, -0.456774],
       [-0.552061,  0.559411,  0.667111, ...,  0.109105,  0.370208, -0.202652],
       [-0.321391,  0.641852,  0.666109, ...,  0.34856 ,  0.521367,  0.376613],
       [ 0.148241,  0.559405,  0.273667, ...,  0.579711, -0.485237, -0.270124]])
Coordinates:
  * design_loc_params  (design_loc_params) <U2 'p0' 'p1' 'p2' 'p3' 'p4' 'p5'
Dimensions without coordinates: features

In [10]:
sim.par_link_scale

<xarray.DataArray 'b' (design_scale_params: 6, features: 100)>
array([[ 1.609438,  1.609438,  1.94591 , ...,  1.791759,  1.098612,  1.609438],
       [ 0.233637,  0.625092,  0.468428, ...,  0.514288,  0.196374, -0.171865],
       [-0.264192,  0.385856, -0.150861, ..., -0.177253,  0.180284,  0.321248],
       [ 0.19759 ,  0.467183, -0.311065, ..., -0.153057,  0.245684,  0.05796 ],
       [ 0.141119, -0.141433, -0.406333, ...,  0.600302,  0.394482,  0.623357],
       [-0.129143,  0.227628,  0.252282, ..., -0.357212,  0.639319,  0.175286]])
Coordinates:
  * design_scale_params  (design_scale_params) <U2 'p0' 'p1' 'p2' 'p3' 'p4' 'p5'
Dimensions without coordinates: features

## Constraints for model

In [11]:
dmat_est_loc = sim.design_loc

In [12]:
dmat_est_scale = sim.design_scale

Build constraints based on sets of parameters that have to sum to zero. Each of these constraints is enforced by binding one of these parameters to the rest of the set. Such a constraint is encoded by assigning a 1 to each parameter in the set and a -1 to to the dependent parameter. The constraints have to be ordered so that they can be iteratively applied from top to bottom and so that all independent parameters (1s) are defined at each stage: A dependent parameter may depend on another dependent parameter if the other dependent parameter was defined in a constrained that lies before the current constraint.

In [13]:
constraints_loc = np.zeros([2, dmat_est_loc.shape[1]])
# Constraint 0: Account for perfect confouding at biological replicate and treatment level 
# by constraining biological replicate coefficients not to produce mean effects across conditions.
constraints_loc[0,3] = -1
constraints_loc[0,4:5] = 1
# Constraint 1: Account for fact that first level of biological replicates was not absorbed into offset.
constraints_loc[1,1] = -1
constraints_loc[1,2:5] = 1

constraints_loc

array([[ 0.,  0.,  0., -1.,  1.,  0.],
       [ 0., -1.,  1.,  1.,  1.,  0.]])

In [14]:
constraints_scale = constraints_loc.copy()

In [15]:
from numpy.linalg import matrix_rank
constraints_loc_mod = constraints_loc.copy()
constraints_loc_mod[constraints_loc_mod==-1] = 1
print(np.vstack([np.unique(dmat_est_loc, axis=0), constraints_loc_mod]))
print("rank deficiency without constraints: "+ str(dmat_est_loc.shape[1] - matrix_rank(np.vstack([np.unique(dmat_est_loc, axis=0)]))))
print("rank deficiency with constraints: "+ str(dmat_est_loc.shape[1] - matrix_rank(np.vstack([np.unique(dmat_est_loc, axis=0), constraints_loc_mod]))))

[[1. 0. 0. 0. 1. 1.]
 [1. 0. 0. 1. 0. 1.]
 [1. 0. 1. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 1. 0.]
 [0. 1. 1. 1. 1. 0.]]
rank deficiency without constraints: 2
rank deficiency with constraints: 0


## Estimate the model

In [16]:
X = sim.X
design_loc = dmat_est_loc
design_scale = dmat_est_scale

# input data
input_data = glm.models.nb_glm.InputData.new(
    data=X, 
    design_loc=design_loc,
    design_scale=design_scale,
    constraints_loc=constraints_loc,
    constraints_scale=constraints_scale)

### Set up estimator:

In [17]:
estimator = glm.models.nb_glm.Estimator(input_data, quick_scale=False)
estimator.initialize()

Using closed-form MLE initialization for mean
Should train mu: False
Using closed-form MME initialization for dispersion
Should train r: True


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Graph was finalized.
Running local_init_op.
Done running local_init_op.


### Train

Now start the training sequence and let the estimator choose automatically the best training strategy:

In [18]:
estimator.train_sequence('QUICK')

training strategy:
[{'convergence_criteria': 't_test',
  'learning_rate': 0.1,
  'loss_window_size': 100,
  'optim_algo': 'ADAM',
  'stop_at_loss_change': 0.05,
  'use_batching': True}]
Beginning with training sequence #1
Step: 1	loss: 884.155295
Step: 2	loss: 885.949234
Step: 3	loss: 886.515063
Step: 4	loss: 886.121214
Step: 5	loss: 884.298366
Step: 6	loss: 886.097554
Step: 7	loss: 885.781682
Step: 8	loss: 886.387190
Step: 9	loss: 884.287266
Step: 10	loss: 885.533888
Step: 11	loss: 886.013124
Step: 12	loss: 886.450422
Step: 13	loss: 884.066419
Step: 14	loss: 885.499019
Step: 15	loss: 886.634011
Step: 16	loss: 886.072353
Step: 17	loss: 884.011284
Step: 18	loss: 885.591750
Step: 19	loss: 885.806826
Step: 20	loss: 886.775648
Step: 21	loss: 884.142276
Step: 22	loss: 885.068036
Step: 23	loss: 886.356100
Step: 24	loss: 886.549458
Step: 25	loss: 883.926797
Step: 26	loss: 885.963304
Step: 27	loss: 886.032688
Step: 28	loss: 886.206786
Step: 29	loss: 884.735220
Step: 30	loss: 885.884261
Step: 3

## Obtaining the results

The fitted parameters can be retrieved by calling the corresponding parameters of `estimator`:

In [19]:
estimator.par_link_loc

<xarray.DataArray (design_loc_params: 6, features: 100)>
array([[ 6.348189,  8.164046,  8.538716, ...,  8.103653,  7.265792,  7.67668 ],
       [ 0.459377, -0.266948, -0.208716, ...,  0.07624 , -0.239818,  0.184094],
       [-0.459377,  0.266948,  0.208716, ..., -0.07624 ,  0.239818, -0.184094],
       [-0.108924, -0.032833, -0.013623, ..., -0.13636 , -0.076002, -0.288177],
       [ 0.108924,  0.032833,  0.013623, ...,  0.13636 ,  0.076002,  0.288177],
       [-0.479316,  0.834696,  0.575472, ...,  0.190039,  0.016902,  0.082734]])
Coordinates:
  * design_loc_params  (design_loc_params) <U2 'p0' 'p1' 'p2' 'p3' 'p4' 'p5'
    feature_allzero    (features) bool False False False False False False ...
  * features           (features) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...

In [20]:
estimator.par_link_scale

<xarray.DataArray (design_scale_params: 6, features: 100)>
array([[ 1.615262,  2.038328,  2.0574  , ...,  2.063593,  1.261082,  1.732057],
       [ 0.24935 ,  0.119381,  0.294329, ...,  0.33788 , -0.069746, -0.198959],
       [-0.24935 , -0.119381, -0.294329, ..., -0.33788 ,  0.069746,  0.198959],
       [ 0.009784,  0.353081,  0.02172 , ..., -0.423438, -0.0333  , -0.310581],
       [-0.009784, -0.353081, -0.02172 , ...,  0.423438,  0.0333  ,  0.310581],
       [-0.007874,  0.004785, -0.255241, ..., -0.32047 ,  0.780985,  0.420808]])
Coordinates:
  * design_scale_params  (design_scale_params) <U2 'p0' 'p1' 'p2' 'p3' 'p4' 'p5'
    feature_allzero      (features) bool False False False False False False ...
  * features             (features) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...

### Check that constraints were met

These parameter sets should sum to zero for each gene.

In [21]:
np.max(estimator.par_link_loc[1,:]+np.sum(estimator.par_link_loc[2:5,:], axis=0))

<xarray.DataArray ()>
array(1.110223e-16)
Coordinates:
    design_loc_params  <U2 'p1'

In [22]:
np.max(np.sum(estimator.par_link_loc[1:3,:], axis=0)+np.sum(estimator.par_link_loc[3:5,:], axis=0))

<xarray.DataArray ()>
array(5.551115e-17)

## Comparing the results with the simulated data:

Linear model output:

In [23]:
locdiff = glm.utils.stats.rmsd(np.matmul(estimator.design_loc, estimator.par_link_loc), 
                               np.matmul(sim.design_loc, sim.par_link_loc))
print("Root mean squared deviation of location: %.2f" % locdiff)

scalediff = glm.utils.stats.rmsd(np.matmul(estimator.design_scale, estimator.par_link_scale), 
                                 np.matmul(sim.design_scale, sim.par_link_scale))
print("Root mean squared deviation of scale:    %.2f" % scalediff)

Root mean squared deviation of location: 0.02
Root mean squared deviation of scale:    0.07


# Example 2: advanced

## Simulate some data

In this example, we have 4 biological replicates (animals, patients, cell culture replicates etc.) in a treatment experiment: 2 in each condition (treated, untreated). Accordingly, there is perfect confounding at this level already. We circumvent this by constraining the biological replicate coefficients to not model mean trends (constraints 0,1). Secondly, there are technical replicates which contain cells from one biological replicate from each condition. Each biological replicate was assigned to one treated-untreated sample pair and each pair split into two technical replicates. Again, we correct perfect confouding by constrainig the techincal replicate coefficients not to model mean effects by constraints 2,3.

### Define design matrices

In [24]:
ncells = 2000
dmat = np.zeros([ncells, 10])
dmat[:,0] = 1
dmat[:500,1] = 1 # bio rep 1
dmat[500:1000,2] = 1 # bio rep 2
dmat[1000:1500,3] = 1 # bio rep 3
dmat[1500:2000,4] = 1 # bio rep 4
dmat[0:250,5] = 1 # tech rep 1
dmat[1000:1250,5] = 1 # tech rep 1
dmat[250:500,6] = 1 # tech rep 2
dmat[1250:1500,6] = 1 # tech rep 2
dmat[500:750,7] = 1 # tech rep 3
dmat[1500:1750,7] = 1 # tech rep 3
dmat[750:1000,8] = 1 # tech rep 4
dmat[1750:2000,8] = 1 # tech rep 4
dmat[1000:2000,9] = 1 # condition effect
print(np.unique(dmat, axis=0))

[[1. 0. 0. 0. 1. 0. 0. 0. 1. 1.]
 [1. 0. 0. 0. 1. 0. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0. 1. 0. 0. 0. 1.]
 [1. 0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [1. 1. 0. 0. 0. 0. 1. 0. 0. 0.]
 [1. 1. 0. 0. 0. 1. 0. 0. 0. 0.]]


In [25]:
sim = glm.models.nb_glm.Simulator(num_features=100)

In [26]:
sim.parse_dmat_loc(dmat = dmat)
sim.parse_dmat_scale(dmat = dmat)
sim.generate_params()
sim.generate_data()

### Simulated model data:

In [27]:
sim.X

<xarray.DataArray 'X' (observations: 2000, features: 100)>
array([[ 6791,   476,   645, ...,  6876,  7486, 14190],
       [12923,  2740,  1093, ...,  4252,  6449, 12855],
       [ 9694,  3576,  2441, ...,  4385,  7285, 15416],
       ...,
       [ 9782,  9281,   469, ..., 21728,  9560, 12528],
       [ 7251,  9682,   676, ..., 21535, 11451,  9472],
       [ 8910,  6861,  1934, ..., 30346,  5248, 20390]])
Dimensions without coordinates: observations, features

## Constraints for model

In [28]:
dmat_est_loc = sim.design_loc

In [29]:
dmat_est_scale = sim.design_scale

In [30]:
np.unique(dmat_est_loc, axis=0)

array([[1., 0., 0., 0., 1., 0., 0., 0., 1., 1.],
       [1., 0., 0., 0., 1., 0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0., 0., 1., 0., 0., 1.],
       [1., 0., 0., 1., 0., 1., 0., 0., 0., 1.],
       [1., 0., 1., 0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 1., 0., 0., 0., 0., 1., 0., 0.],
       [1., 1., 0., 0., 0., 0., 1., 0., 0., 0.],
       [1., 1., 0., 0., 0., 1., 0., 0., 0., 0.]])

In [31]:
constraints_loc = np.zeros([4, dmat_est_loc.shape[1]])
# Constraint 0: Account for perfect confouding at biological replicate and treatment level 
# by constraining biological replicate coefficients not to produce mean effects across conditions.
constraints_loc[0,3] = -1
constraints_loc[0,4:5] = 1
# Constraint 1: Account for fact that first level of biological replicates was not absorbed into offset. 
constraints_loc[1,1] = -1
constraints_loc[1,2:5] = 1
# Constraint 2: Account for perfect confouding at biological replicate and technical replicate 
# by constraining technical replicate coefficients not to produce mean effects across biological replicates.
constraints_loc[2,7] = -1
constraints_loc[2,8:9] = 1
# Constraint 3: Account for fact that first level of technical replicates was not absorbed into offset. 
constraints_loc[3,5] = -1
constraints_loc[3,6:9] = 1

constraints_loc

array([[ 0.,  0.,  0., -1.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0., -1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0., -1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0., -1.,  1.,  1.,  1.,  0.]])

In [32]:
constraints_scale = constraints_loc.copy()

In [33]:
from numpy.linalg import matrix_rank
constraints_loc_mod = constraints_loc.copy()
constraints_loc_mod[constraints_loc_mod==-1] = 1
print(np.vstack([np.unique(dmat_est_loc, axis=0), constraints_loc_mod]))
print("rank deficiency without constraints: "+ str(dmat_est_loc.shape[1] - matrix_rank(np.vstack([np.unique(dmat_est_loc, axis=0)]))))
print("rank deficiency with constraints: "+ str(dmat_est_loc.shape[1] - matrix_rank(np.vstack([np.unique(dmat_est_loc, axis=0), constraints_loc_mod]))))

[[1. 0. 0. 0. 1. 0. 0. 0. 1. 1.]
 [1. 0. 0. 0. 1. 0. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0. 1. 0. 0. 0. 1.]
 [1. 0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [1. 1. 0. 0. 0. 0. 1. 0. 0. 0.]
 [1. 1. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 1. 1. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 0.]
 [0. 0. 0. 0. 0. 1. 1. 1. 1. 0.]]
rank deficiency without constraints: 4
rank deficiency with constraints: 0


## Estimate the model

In [34]:
X = sim.X
design_loc = dmat_est_loc
design_scale = dmat_est_scale

# input data
input_data = glm.models.nb_glm.InputData.new(
    data=X, 
    design_loc=design_loc,
    design_scale=design_scale,
    constraints_loc=constraints_loc,
    constraints_scale=constraints_scale)

### Set up estimator:

Note that there is no closed form estimator for the mean model here due to the confounding. The model is initialised with least squares but the mean model is also trained.

In [35]:
estimator = glm.models.nb_glm.Estimator(input_data, quick_scale=False)
estimator.initialize()

Using closed-form MLE initialization for mean
Should train mu: True
Using closed-form MME initialization for dispersion
Should train r: True


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Graph was finalized.
Running local_init_op.
Done running local_init_op.


### Train

Now start the training sequence and let the estimator choose automatically the best training strategy:

In [36]:
estimator.train_sequence('QUICK')

training strategy:
[{'convergence_criteria': 't_test',
  'learning_rate': 0.1,
  'loss_window_size': 100,
  'optim_algo': 'ADAM',
  'stop_at_loss_change': 0.05,
  'use_batching': True}]
Beginning with training sequence #1
Step: 1	loss: 897.701602
Step: 2	loss: 923.520278
Step: 3	loss: 908.112978
Step: 4	loss: 906.723808
Step: 5	loss: 905.482292
Step: 6	loss: 912.208433
Step: 7	loss: 912.437294
Step: 8	loss: 910.213020
Step: 9	loss: 903.013620
Step: 10	loss: 905.645439
Step: 11	loss: 908.425297
Step: 12	loss: 908.020575
Step: 13	loss: 901.502998
Step: 14	loss: 904.960904
Step: 15	loss: 907.300517
Step: 16	loss: 907.049878
Step: 17	loss: 900.179336
Step: 18	loss: 904.038785
Step: 19	loss: 905.077502
Step: 20	loss: 904.887815
Step: 21	loss: 899.667640
Step: 22	loss: 902.370722
Step: 23	loss: 904.283863
Step: 24	loss: 904.310988
Step: 25	loss: 898.702699
Step: 26	loss: 902.409750
Step: 27	loss: 904.419129
Step: 28	loss: 903.415020
Step: 29	loss: 898.090837
Step: 30	loss: 901.887437
Step: 3

Step: 300	loss: 903.479272
pval: 0.600743
Training sequence #1 complete


## Obtaining the results

### Check that constraints were met

These parameter sets should sum to zero for each gene.

In [37]:
np.max(estimator.par_link_loc[1,:]+np.sum(estimator.par_link_loc[2:5,:], axis=0))

<xarray.DataArray ()>
array(1.110223e-16)
Coordinates:
    design_loc_params  <U2 'p1'

In [38]:
np.max(np.sum(estimator.par_link_loc[1:3,:], axis=0)+np.sum(estimator.par_link_loc[3:5,:], axis=0))

<xarray.DataArray ()>
array(1.110223e-16)

## Comparing the results with the simulated data:

Linear model output:

In [39]:
locdiff = glm.utils.stats.rmsd(np.matmul(estimator.design_loc, estimator.par_link_loc), 
                               np.matmul(sim.design_loc, sim.par_link_loc))
print("Root mean squared deviation of location: %.2f" % locdiff)

scalediff = glm.utils.stats.rmsd(np.matmul(estimator.design_scale, estimator.par_link_scale), 
                                 np.matmul(sim.design_scale, sim.par_link_scale))
print("Root mean squared deviation of scale:    %.2f" % scalediff)

Root mean squared deviation of location: 0.04
Root mean squared deviation of scale:    0.10


# Example 3: advanced

## Simulate some data

In this example, we have the same scenario as in example 2 but one technical replicate is missing. We have to drop the corresponding constraint and remove the two parameters belonging to this pair of technical replicates.

### Define design matrices

In [40]:
ncells = 1500
dmat = np.zeros([ncells, 9])
dmat[:,0] = 1
dmat[:500,1] = 1 # bio rep 1 
dmat[500:750,2] = 1 # bio rep 2 # 50%=1 tech_rep missing
dmat[750:1250,3] = 1 # bio rep 3
dmat[1250:1500,4] = 1 # bio rep 4 # 50%=1 tech_rep missing

dmat[0:250,5] = 1 # tech rep 1 in bio rep 1
dmat[750:1000,5] = 1 # tech rep 1 in bio rep 3
dmat[250:500,6] = 1 # tech rep 2 in bio rep 1
dmat[1000:1250,6] = 1 # tech rep 2 in bio rep 3
# tech rep 3 is missing in bio rep 2,4
dmat[500:750,7] = 1 # tech rep 4 in bio rep 2
dmat[1250:1500,7] = 1 # tech rep 4 in bio rep 4

dmat[1000:2000,8] = 1 # condition effect
print(np.unique(dmat, axis=0))

[[1. 0. 0. 0. 1. 0. 0. 1. 1.]
 [1. 0. 0. 1. 0. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0. 1. 0. 0. 0.]
 [1. 0. 1. 0. 0. 0. 0. 1. 0.]
 [1. 1. 0. 0. 0. 0. 1. 0. 0.]
 [1. 1. 0. 0. 0. 1. 0. 0. 0.]]


In [41]:
sim = glm.models.nb_glm.Simulator(num_features=100)

In [42]:
sim.parse_dmat_loc(dmat = dmat)
sim.parse_dmat_scale(dmat = dmat)
sim.generate_params()
sim.generate_data()

## Constraints for model

Remove coefficient for single technical replicate 4 from models:

In [43]:
dmat_est_loc = sim.design_loc[:,np.array([0,1,2,3,4,5,6,8])]

In [44]:
dmat_est_scale = sim.design_scale[:,np.array([0,1,2,3,4,5,6,8])]

In [45]:
np.unique(dmat_est_loc, axis=0)

array([[1., 0., 0., 0., 1., 0., 0., 1.],
       [1., 0., 0., 1., 0., 0., 1., 1.],
       [1., 0., 0., 1., 0., 1., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0., 0.],
       [1., 1., 0., 0., 0., 0., 1., 0.],
       [1., 1., 0., 0., 0., 1., 0., 0.]])

In [46]:
constraints_loc = np.zeros([3, dmat_est_loc.shape[1]])
# Constraint 0: Account for perfect confouding at biological replicate and treatment level 
# by constraining biological replicate coefficients not to produce mean effects across conditions.
constraints_loc[0,3] = -1
constraints_loc[0,4:5] = 1
# Constraint 1: Account for fact that first level of biological replicates was not absorbed into offset. 
constraints_loc[1,1] = -1
constraints_loc[1,2:5] = 1
# Constraint 2: Account for fact that first level of technical replicates was not absorbed into offset. 
constraints_loc[2,5] = -1
constraints_loc[2,6:7] = 1

constraints_loc

array([[ 0.,  0.,  0., -1.,  1.,  0.,  0.,  0.],
       [ 0., -1.,  1.,  1.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0., -1.,  1.,  0.]])

In [47]:
constraints_scale = constraints_loc.copy()

In [48]:
from numpy.linalg import matrix_rank
constraints_loc_mod = constraints_loc.copy()
constraints_loc_mod[constraints_loc_mod==-1] = 1
print(np.vstack([np.unique(dmat_est_loc, axis=0), constraints_loc_mod]))
print("rank deficiency without constraints: "+ str(dmat_est_loc.shape[1] - matrix_rank(np.vstack([np.unique(dmat_est_loc, axis=0)]))))
print("rank deficiency with constraints: "+ str(dmat_est_loc.shape[1] - matrix_rank(np.vstack([np.unique(dmat_est_loc, axis=0), constraints_loc_mod]))))

[[1. 0. 0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0. 1. 1.]
 [1. 0. 0. 1. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0. 1. 0.]
 [1. 1. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 1. 0. 0. 0.]
 [0. 1. 1. 1. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0.]]
rank deficiency without constraints: 2
rank deficiency with constraints: 0


## Estimate the model

In [49]:
X = sim.X
design_loc = dmat_est_loc
design_scale = dmat_est_scale

# input data
input_data = glm.models.nb_glm.InputData.new(
    data=X, 
    design_loc=design_loc,
    design_scale=design_scale,
    constraints_loc=constraints_loc,
    constraints_scale=constraints_scale)

### Set up estimator:

Note that there is no closed form estimator for the mean model here due to the confounding. The model is initialised with least squares but the mean model is also trained.

In [50]:
estimator = glm.models.nb_glm.Estimator(input_data, quick_scale=False)
estimator.initialize()

Using closed-form MLE initialization for mean
Should train mu: True
Using closed-form MME initialization for dispersion
Should train r: True


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Graph was finalized.
Running local_init_op.
Done running local_init_op.


### Train

Now start the training sequence and let the estimator choose automatically the best training strategy:

In [51]:
estimator.train_sequence('QUICK')

training strategy:
[{'convergence_criteria': 't_test',
  'learning_rate': 0.1,
  'loss_window_size': 100,
  'optim_algo': 'ADAM',
  'stop_at_loss_change': 0.05,
  'use_batching': True}]
Beginning with training sequence #1
Step: 1	loss: 907.310813
Step: 2	loss: 917.416733
Step: 3	loss: 909.281433
Step: 4	loss: 908.050036
Step: 5	loss: 910.225890
Step: 6	loss: 909.918557
Step: 7	loss: 907.845170
Step: 8	loss: 907.651723
Step: 9	loss: 907.515244
Step: 10	loss: 905.854563
Step: 11	loss: 907.409702
Step: 12	loss: 906.256109
Step: 13	loss: 905.232032
Step: 14	loss: 905.509946
Step: 15	loss: 906.438927
Step: 16	loss: 904.972781
Step: 17	loss: 904.929343
Step: 18	loss: 905.555354
Step: 19	loss: 904.464431
Step: 20	loss: 904.745232
Step: 21	loss: 904.869783
Step: 22	loss: 903.304794
Step: 23	loss: 904.576924
Step: 24	loss: 905.193506
Step: 25	loss: 903.376691
Step: 26	loss: 904.364825
Step: 27	loss: 904.811288
Step: 28	loss: 903.099013
Step: 29	loss: 903.477355
Step: 30	loss: 905.131594
Step: 3

Step: 300	loss: 904.126106
pval: 0.676358
Training sequence #1 complete


## Obtaining the results

### Check that constraints were met

These parameter sets should sum to zero for each gene.

In [52]:
np.max(estimator.par_link_loc[1,:]+np.sum(estimator.par_link_loc[2:5,:], axis=0))

<xarray.DataArray ()>
array(2.220446e-16)
Coordinates:
    design_loc_params  <U2 'p1'

In [53]:
np.max(np.sum(estimator.par_link_loc[1:3,:], axis=0)+np.sum(estimator.par_link_loc[3:5,:], axis=0))

<xarray.DataArray ()>
array(1.110223e-16)

## Comparing the results with the simulated data:

Linear model output:

In [54]:
locdiff = glm.utils.stats.rmsd(np.matmul(estimator.design_loc, estimator.par_link_loc), 
                               np.matmul(sim.design_loc, sim.par_link_loc))
print("Root mean squared deviation of location: %.2f" % locdiff)

scalediff = glm.utils.stats.rmsd(np.matmul(estimator.design_scale, estimator.par_link_scale), 
                                 np.matmul(sim.design_scale, sim.par_link_scale))
print("Root mean squared deviation of scale:    %.2f" % scalediff)

Root mean squared deviation of location: 0.11
Root mean squared deviation of scale:    0.19
