# Introduction

We have two learning tasks that involve the same _kind_ of input data, but don't have exactly aligned samples. In the first learning task, we have 100 iid samples; in the 2nd learning task we have 80 iid samples. In our data matrix, the 2nd task's 80 iid samples are not necessarily aligned with the 100 iid samples from the 1st task. One other assumption we have baked into this model is that the weights, while given a set for each task, are shared from a parental prior, hence there is parameter sharing amongst the learning tasks, though not in our usual "classical" sense.

By appending zero-padding, we should be able to generalize this to multi-task neural network learning with non-overlapping samples. [Thomas Wiecki](https://twiecki.io/blog/2018/08/13/hierarchical_bayesian_neural_network/) has a great blog post on how to do it, though he didn't deal with the "number of samples" issue, which I tried to add here.

In [91]:
import pymc3 as pm
import theano.tensor as tt
import numpy as np

In [92]:
# We start with parental weights of length 4, one for each feature.
parental_weights = np.random.normal(loc=10, scale=3, size=(4,))
parental_weights

array([12.32545447, 13.93699523,  7.15949499,  8.26135873])

In [93]:
# We'll now generate new weights based on location=parental_weights,
# and scale=1, but one for each learning task (here there are 15 in total)

child_weights = np.random.normal(loc=parental_weights, scale=3, size=(15, 4))
child_weights

array([[14.68326928, 12.92295184,  3.52881896,  9.6505052 ],
       [ 9.63723985, 11.76671028,  6.73455871,  3.97607313],
       [11.04095371, 15.44631967,  9.76702665,  7.38798696],
       [14.12818348,  9.09436489,  9.62631285,  6.73982072],
       [14.64622042, 18.717324  ,  4.64819761,  8.60506451],
       [13.45981014, 14.04655425,  6.71800397,  5.48751886],
       [11.58397585, 16.15434869,  7.45054838,  4.4286952 ],
       [17.26905786, 14.82248315,  9.5490677 ,  4.77344471],
       [11.65916625, 14.39400437, 10.08578205,  5.88794316],
       [13.76733116, 15.73665656,  6.56426226, 12.54160592],
       [10.11841595,  8.9355318 ,  5.83729823,  4.20686629],
       [12.59941373, 10.74817324,  5.96226451,  4.98209562],
       [14.10718385, 13.27177338,  3.41672407, 10.64570369],
       [15.46324969, 11.12735008,  8.32406899, 11.81155349],
       [11.91743787, 15.99971291,  8.72581055,  7.55214293]])

These are the true weights of the system.

We are now going to attempt to learn them in a Bayesian fashion.

In [118]:
n_samps = 100
n_tasks = 15
n_weights = 4
data = np.random.normal(loc=3, scale=4, size=(n_tasks, n_samps, n_weights))

# We are going to apply a mask that nulls out about 70% of the values in the data matrix,
# and replaces them with zeros.
null_mask = np.repeat(np.random.binomial(n=1, p=0.3, size=(n_samps, 1)), 4, axis=1)
data = data * null_mask

In [98]:
data.shape, child_weights.shape

((15, 100, 4), (15, 4))

Let's now generate the $y$s. As long as they are labelled as zero where the inputs are also labelled as zero, then we should be in an ok regime. 

By definition of the math at hand, they will be zero because we don't have any $X$ information to propagate forward (they are set to zero as inputs), so in this simulation setting, we are ok.

In [134]:
y = np.einsum("ijk, ik -> ij", data, child_weights)
y = y + np.random.normal(loc=0, scale=3, size=y.shape)

In [135]:
y.shape

(15, 100)

We are now going to write a hierarchical linear regression model that handles this particular case of imbalanced number of samples.

If we are able to recover back the original weights, then zero-padding could be a very powerful technique to deal with multiple learning tasks that also have non-equal numbers of samples that also have non-overlapping sample indices.

In [136]:
data.shape

(15, 100, 4)

In [137]:
child_weights.shape

(15, 4)

In [138]:
tt.batched_dot(data, child_weights)

BatchedDot.0

In [139]:
data.shape

(15, 100, 4)

In [140]:
y

array([[-5.40511021e-01,  7.96100110e+01,  3.80960069e+00, ...,
        -4.50295161e+00, -2.13821128e+00,  4.78772003e+00],
       [-2.15359977e+00,  3.27940463e+01,  1.63238797e+00, ...,
         9.45026428e+00,  3.90900484e-01, -7.86344845e-01],
       [ 5.83578198e-01,  1.82021934e+02,  1.44075232e+00, ...,
        -3.43508082e-01,  1.01018699e+00,  4.99343430e+00],
       ...,
       [-3.93427280e+00,  6.00263849e+00, -1.91082659e+00, ...,
        -2.11004124e-01,  1.64443814e+00, -1.17448507e+00],
       [-3.01122125e+00,  1.95075112e+02,  9.13296097e-01, ...,
        -3.94708935e+00,  2.45555610e+00,  4.79875269e+00],
       [ 5.98406463e+00,  2.75020592e+02, -1.83502781e+00, ...,
        -9.18397459e-01, -3.44907888e-01,  2.08267055e-01]])

In [141]:
with pm.Model() as hierarchical_linear_model:
    w_parent = pm.Normal("w_parent", mu=0, sd=1, shape=(n_weights,))

    # Broadcasting will give us 4 child weights drawn from w_parent,
    # I think.
    w_child = pm.Normal("w_child", mu=w_parent, sd=1, shape=(n_tasks, n_weights))

    sd = pm.HalfCauchy("sd", beta=10)

    # mu = pm.Deterministic("mu", np.einsum('ijk, kj -> ij', data, w_child))
    mu = pm.Deterministic("mu", tt.batched_dot(data, w_child))
    like = pm.Normal("like", mu=mu, sd=sd, observed=y)

In [142]:
with hierarchical_linear_model:
    # trace = pm.sample(2000, cores=1)
    approx = pm.fit(100000)
    trace = approx.sample(2000)

Average Loss = 4,262.1: 100%|██████████| 100000/100000 [00:19<00:00, 5258.22it/s]
Finished [100%]: Average Loss = 4,262


In [143]:
trace["w_child"]

array([[[14.547934 , 12.748727 ,  3.6341069,  9.855609 ],
        [ 9.364235 , 11.742509 ,  6.8934946,  4.23233  ],
        [11.151909 , 15.157989 ,  9.767109 ,  7.5370646],
        ...,
        [14.038198 , 13.069905 ,  3.7092342, 10.547483 ],
        [15.044951 , 11.061436 ,  8.386472 , 11.63056  ],
        [11.828579 , 16.026001 ,  8.609211 ,  7.935846 ]],

       [[14.761777 , 12.6763315,  3.7792463,  9.946475 ],
        [ 9.437651 , 11.69514  ,  6.970719 ,  3.9804826],
        [11.094664 , 15.1843815,  9.465349 ,  7.516925 ],
        ...,
        [13.965954 , 13.255589 ,  3.6395712, 10.4506445],
        [15.3331995, 11.114253 ,  8.444947 , 11.575428 ],
        [11.697102 , 15.980113 ,  8.447931 ,  7.613655 ]],

       [[14.70629  , 12.796245 ,  3.853656 ,  9.72828  ],
        [ 9.749298 , 11.824504 ,  6.8428125,  4.049762 ],
        [11.287544 , 15.608483 ,  9.82605  ,  7.453506 ],
        ...,
        [13.97867  , 13.163752 ,  3.6965578, 10.434882 ],
        [15.110666 , 11.29303

In [144]:
trace["w_parent"].mean(axis=0)

array([12.220205 , 12.63951  ,  6.7088165,  6.823576 ], dtype=float32)

In [145]:
parental_weights

array([12.32545447, 13.93699523,  7.15949499,  8.26135873])

We're close!

In [146]:
trace["w_child"].mean(axis=0)

array([[14.688177 , 12.674288 ,  3.782937 ,  9.694809 ],
       [ 9.632241 , 11.64214  ,  6.841324 ,  4.209931 ],
       [11.201812 , 15.353562 ,  9.551869 ,  7.4665   ],
       [14.221379 ,  9.042535 ,  9.925585 ,  6.7520785],
       [14.51784  , 18.521345 ,  4.7933145,  8.713857 ],
       [13.585577 , 13.839578 ,  6.524608 ,  5.616619 ],
       [11.416423 , 16.004206 ,  7.634398 ,  4.5760627],
       [17.001846 , 14.900782 ,  9.529501 ,  4.553543 ],
       [11.590321 , 14.249942 , 10.119506 ,  6.0153227],
       [13.678913 , 15.5588455,  6.426305 , 12.679569 ],
       [10.190732 ,  9.045437 ,  5.7470784,  4.223966 ],
       [12.530115 , 10.814122 ,  5.8741508,  4.838204 ],
       [14.025205 , 13.353983 ,  3.6537082, 10.414098 ],
       [15.293123 , 11.174624 ,  8.425864 , 11.749094 ],
       [11.816928 , 15.980511 ,  8.585058 ,  7.760424 ]], dtype=float32)

In [147]:
child_weights

array([[14.68326928, 12.92295184,  3.52881896,  9.6505052 ],
       [ 9.63723985, 11.76671028,  6.73455871,  3.97607313],
       [11.04095371, 15.44631967,  9.76702665,  7.38798696],
       [14.12818348,  9.09436489,  9.62631285,  6.73982072],
       [14.64622042, 18.717324  ,  4.64819761,  8.60506451],
       [13.45981014, 14.04655425,  6.71800397,  5.48751886],
       [11.58397585, 16.15434869,  7.45054838,  4.4286952 ],
       [17.26905786, 14.82248315,  9.5490677 ,  4.77344471],
       [11.65916625, 14.39400437, 10.08578205,  5.88794316],
       [13.76733116, 15.73665656,  6.56426226, 12.54160592],
       [10.11841595,  8.9355318 ,  5.83729823,  4.20686629],
       [12.59941373, 10.74817324,  5.96226451,  4.98209562],
       [14.10718385, 13.27177338,  3.41672407, 10.64570369],
       [15.46324969, 11.12735008,  8.32406899, 11.81155349],
       [11.91743787, 15.99971291,  8.72581055,  7.55214293]])

In [148]:
trace["w_child"].std(axis=0)

array([[0.1336653 , 0.10619411, 0.16646177, 0.12158728],
       [0.16896625, 0.12257322, 0.13656092, 0.15733702],
       [0.10035358, 0.14588821, 0.11639755, 0.13881133],
       [0.2039019 , 0.10889905, 0.13775828, 0.14201835],
       [0.12482826, 0.12572515, 0.11699717, 0.11192081],
       [0.12796696, 0.13269228, 0.1537989 , 0.11189639],
       [0.1375776 , 0.14793952, 0.15032995, 0.1095657 ],
       [0.14512703, 0.10769934, 0.1058685 , 0.12129825],
       [0.13897239, 0.14762425, 0.15589072, 0.11110373],
       [0.12060434, 0.12413985, 0.1357903 , 0.12382214],
       [0.1522342 , 0.15977028, 0.10111415, 0.16021541],
       [0.12382656, 0.14558436, 0.14306618, 0.1175691 ],
       [0.12064637, 0.14835243, 0.11171525, 0.14321892],
       [0.11482354, 0.12465627, 0.13243239, 0.13184893],
       [0.1452353 , 0.13790777, 0.12258571, 0.11603751]], dtype=float32)

OK! I think that this works, just that something is not right with NUTS because of gradient issues (we get zeros on diagonal of mass matrix). I'm going to show this experiment to the PyMC devs to see what I might be doing wrong.