# Introduction

TBD

In [1]:
import pymc3 as pm
import theano.tensor as tt
import numpy as np

In [2]:
# We start with parental weights of length 4, one for each feature.
parental_weights = np.random.normal(loc=10, scale=3, size=(4,))
parental_weights

array([ 5.95203   ,  5.32704166, 15.26745499, 14.27567973])

In [3]:
# We'll now generate new weights based on location=parental_weights,
# and scale=1

child_weights = np.random.normal(loc=parental_weights, scale=1, size=(2, 4))
child_weights

array([[ 6.58715506,  5.32683857, 16.18669375, 13.94170117],
       [ 5.72214364,  6.24181095, 16.51829553, 15.62049623]])

These are the true weights of the system.

We are now going to attempt to learn them in a Bayesian fashion.

In [4]:
n_samps = 100
n_tasks = 2
n_weights = 4
data = np.random.normal(loc=3, scale=4, size=(n_tasks, n_samps, n_weights))

# Now, on the last 20 samples on the 2nd task, set everything to zeros
# to indicate that it has nothing in there.
data[80:, 1] = 0

In [5]:
data.shape, child_weights.shape

((2, 100, 4), (2, 4))

In [6]:
y = np.einsum("ijk, ik -> ij", data, child_weights)

In [7]:
y.shape

(2, 100)

We are now going to write a hierarchical linear regression model that handles this particular case of imbalanced number of samples.

If we are able to recover back the original weights, then zero-padding could be a very powerful technique to deal with multiple learning tasks that also have non-equal numbers of samples that also have non-overlapping sample indices.

In [8]:
data.shape

(2, 100, 4)

In [9]:
child_weights.shape

(2, 4)

In [10]:
tt.batched_dot(data, child_weights)

BatchedDot.0

In [11]:
data.shape

(2, 100, 4)

In [12]:
y.shape

(2, 100)

In [13]:
with pm.Model() as hierarchical_linear_model:
    w_parent = pm.Normal("w_parent", mu=0, sd=1, shape=(4,))

    # Broadcasting will give us 4 child weights drawn from w_parent,
    # I think.
    w_child = pm.Normal("w_child", mu=w_parent, sd=1, shape=(2, 4))

    sd = pm.HalfCauchy("sd", beta=10)

    # mu = pm.Deterministic("mu", np.einsum('ijk, kj -> ij', data, w_child))
    mu = pm.Deterministic("mu", tt.batched_dot(data, w_child))
    like = pm.Normal("like", mu=mu, sd=sd, observed=y)

In [20]:
with hierarchical_linear_model:
    # trace = pm.sample(2000, cores=1)
    approx = pm.fit(100000)
    trace = approx.sample(2000)

Average Loss = 1,256.2:  15%|█▍        | 14694/100000 [00:02<00:14, 5721.78it/s]IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Average Loss = 1,174:  35%|███▌      | 35082/100000 [00:06<00:10, 6032.77it/s]  IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Average Loss = 896.32:  56%|█████▌    | 55726/100000 [00:09<00:07, 6037.26it/s] IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order t

In [21]:
trace["w_child"]

array([[[ 6.587257 ,  5.3264933, 16.186497 , 13.941768 ],
        [ 5.7225223,  6.241286 , 16.518278 , 15.620812 ]],

       [[ 6.5872164,  5.326681 , 16.187141 , 13.941956 ],
        [ 5.7223797,  6.241532 , 16.518642 , 15.620879 ]],

       [[ 6.5872602,  5.326886 , 16.186613 , 13.941912 ],
        [ 5.7223163,  6.2413754, 16.518312 , 15.620644 ]],

       ...,

       [[ 6.586886 ,  5.326661 , 16.186844 , 13.941857 ],
        [ 5.722209 ,  6.241243 , 16.51849  , 15.620339 ]],

       [[ 6.587067 ,  5.3271995, 16.186884 , 13.941792 ],
        [ 5.7223306,  6.2415323, 16.518023 , 15.620553 ]],

       [[ 6.58718  ,  5.3270683, 16.186546 , 13.941807 ],
        [ 5.7223587,  6.2414646, 16.518484 , 15.620249 ]]], dtype=float32)

In [27]:
trace["w_parent"].mean(axis=0)

array([ 4.116142 ,  3.8681304, 10.926062 ,  9.860763 ], dtype=float32)

In [29]:
parental_weights

array([ 5.95203   ,  5.32704166, 15.26745499, 14.27567973])

We're close!

In [24]:
trace["w_child"].mean(axis=0)

array([[ 6.587023 ,  5.3268237, 16.18679  , 13.941514 ],
       [ 5.7222967,  6.2413874, 16.518183 , 15.620741 ]], dtype=float32)

In [30]:
trace["w_child"].std(axis=0)

array([[0.00026634, 0.00028596, 0.00030754, 0.00029132],
       [0.00021608, 0.00024131, 0.00029862, 0.00034922]], dtype=float32)

In [25]:
child_weights

array([[ 6.58715506,  5.32683857, 16.18669375, 13.94170117],
       [ 5.72214364,  6.24181095, 16.51829553, 15.62049623]])

OK! I think that this works, just that something is not right with NUTS because of gradient issues (we get zeros on diagonal of mass matrix). I'm going to show this experiment to the PyMC devs to see what I might be doing wrong.

# Recap

Just to recap what we've done here.

We have two learning tasks that involve the same _kind_ of input data, but don't have exactly aligned samples. In the first learning task, we have 100 iid samples; in the 2nd learning task we have 80 iid samples. In our data matrix, the 2nd task's 80 iid samples are not necessarily aligned with the 100 iid samples from the 1st task. One other assumption we have baked into this model is that the weights, while given a set for each task, are shared from a parental prior, hence there is parameter sharing amongst the learning tasks, though not in our usual "classical" sense.

By appending zero-padding, we should be able to generalize this to multi-task neural network learning with non-overlapping samples. [Thomas Wiecki](https://twiecki.io/blog/2018/08/13/hierarchical_bayesian_neural_network/) has a great blog post on how to do it, though he didn't deal with the "number of samples" issue, which I tried to add here.