# Supplementary Information: Holmes *et al.* 2017

# 4. Varying base expression level

In [2]:
%pylab inline

import numpy as np
import pandas as pd
import pystan
import scipy
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

sns.set_context('notebook')

Populating the interactive namespace from numpy and matplotlib


  "Cython.Distutils.old_build_ext does not properly handle dependencies "


## Building the model

In the unpooled model, we treated each probe (across both conditions) as if its measurements before and after the experiment were drawn from a distribution specific to that probe, where that the output value is (still) some linear function of the input value, mediated by the experiment. We still treated the intercept $\alpha$ as though it was a single pooled value for all probes. Here, we'll change that assumption so that we consider each probe to have its own baseline intensity.

We construct the following model of the experiment:

$$y_i = \alpha_{j[i]} + \beta_{j[i]} x_i + \epsilon_i$$

* $y_i$: measured log(intensity) output on the array for probe $i$ (specific to each replicate)
* $x_i$: measured log(intensity) input on the array for probe $i$ (specific to each replicate)
* $\alpha_{j[i]}$: expected log(intensity) level for the output array if the input log(intensity) is zero, for a particular probe ID $j[i]$
* $\beta_{j[i]}$: the effect of the experiment (i.e. difference between input and output log(intensity)on the measured log(intensity) for probe ID $j[i]$
* $\epsilon_i$: error in the model prediction for probe $i$

### Stan model construction and fit

We need to define `data`, `parameters` and our `model` for `Stan`.

In the `data` block, we have:

* `N`: `int`, the number of data points)
* `J`: `int`, the number of unique probe IDs (`J` < `N`)
* `probe`: `int[N]`, an index list of probe identities - one index representing six probe measurements (i.e. three control, three treatment) - there are `J` probes
* `x`: `vector[N]`, the input log(intensity) values
* `y`: `vector[N]`, the output log(intensity) values

In the `parameter` block, we have:

* `a`: `real vector[J]`, representative input log(intensity)
* `b`: `real vector[J]`, effect on log(intensity) of passing through the experiment, specific to a probe ID
* `sigma`: `real<lower=0>`, the error in the prediction

We also define a `transformed parameter`:

* `y_hat[i] <- b[probe[i]] * x[i] + a[probe[i]]`: the linear relationship describing $\hat{y}$, our estimate of experimental output intensity, which is subject to variance `sigma`.

We define the model as $y \sim N(\hat{y}, \sigma^2)$.

In [3]:
# load clean, normalised data
data = pd.read_csv("output/normalised_array_data.tab", sep="\t")

# create indices and values for probes
probe_ids = data['probe'].unique()
nprobes = len(probe_ids)
probe_lookup = dict(zip(probe_ids, range(nprobes)))

# add data column with probe index from probe_lookup
data['probe_index'] = data['probe'].replace(probe_lookup).values

In [4]:
# define unpooled stan model
unpooled_model = """
data {
  int<lower=0> N;
  int<lower=0> J;
  int<lower=1, upper=J> probe[N];
  vector[N] x;
  vector[N] y;
}
parameters {
  vector[J] a;
  vector[J] b;
  real<lower=0> sigma;
}
transformed parameters{
  vector[N] y_hat;

  for (i in 1:N)
    y_hat[i] = a[probe[i]] + b[probe[i]] * x[i];
}
model {
  y ~ normal(y_hat, sigma);
}
"""

In [5]:
# relate python variables to stan variables
unpooled_data_dict = {'N': len(data),
                      'J': nprobes,
                      'probe': data['probe_index'] + 1,
                      'x': data['input'],
                      'y': data['output']}

In [None]:
# run stan fit
unpooled_fit = pystan.stan(model_code=unpooled_model,
                           data=unpooled_data_dict,
                           iter=1000, chains=2)