# MLSS2019: Bayesian Deep Learning

In this tutorial we will learn what basic building blocks are needed
to endow (deep) neural networks with uncertainty estimates.

The plan of the tutorial
1. [Setup and imports](#Setup-and-imports)
2. [Easy uncertainty in networks](#Easy-uncertainty-in-networks)
   1. [Bayesification via dropout and weight decay](#Bayesification-via-dropout-and-weight-decay)
   2. [Implementing function sampling with the DropoutLinear Layer](#Implementing-function-sampling-with-the-DropoutLinear-Layer)
   3. [Implementing-DropoutLinear](#Implementing-DropoutLinear)
   4. [Comparing sample functions to point-estimates](#Comparing-sample-functions-to-point-estimates)
3. [(optional) Dropout $2$-d Convolutional layer](#(optional)-Dropout-$2$-d-Convolutional-layer)
4. [(optional) A brief reminder on Bayesian and Variational Inference](#(optional)-A-brief-reminder-on-Bayesian-and-Variational-Inference)

**(note)**
* to view documentation on something  type in `something?` (with one question mark)
* to view code of something type in `something??` (with two question marks).

<br>

## Setup and imports

In this section we import necessary modules and functions and
define the computational device.

First, we install some boilerplate service code for this tutorial.

In [None]:
!pip install -q --upgrade git+https://github.com/ivannz/mlss2019-bayesian-deep-learning.git

Next, numpy for computing, matplotlib for plotting and tqdm for progress bars.

In [None]:
import tqdm
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

For deep learning stuff will be using [pytorch](https://pytorch.org/).

If you are unfamiliar with it, it is basically like `numpy` with autograd,
stricter data type enforcement, native GPU support, and tools for building
training and serializing models.
<!-- (and with `axis` argument replaced with `dim` :) -->

There are good introductory tutorials on `pytorch`, like this
[one](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html).

In [None]:
import torch
import torch.nn.functional as F

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

We will need some functionality from scikit

In [None]:
from sklearn.metrics import confusion_matrix

Next we import the boilerplate code.

* a procedure that implements a minibatch SGD **fit** loop
* a function, that **evaluates** the model on the provided dataset

In [None]:
from mlss2019bdl import fit

```python
# pseudocode
def fit(model, dataset, criterion, ...):
    for epoch in epochs:
        for batch in dataset:
            loss = criterion(model, batch)  # forward pass

            grad = loss.backward()          # gradient via back propagation

            adam_step(grad)
```

In [None]:
from mlss2019bdl import predict

```python
# pseudocode
def predict(model, dataset, ...):
    for input_batch in dataset:
        output.append(model(input_batch))  # forward pass
    
    return concatenate(output)
```

<br>

## Easy uncertainty in networks

Generate the initial small dataset $S_0 = (x_i, y_i)_{i=1}^{m_0}$
with $y_i = g(x_i)$, $x_i$ on a regular-spaced grid, and $
g
    \colon \mathbb{R} \to \mathbb{R}
    \colon x \mapsto \tfrac{x^2}4 + \sin \frac\pi2 x
$.
<!--
`dataset_from_numpy` **converts** numpy arrays into torch tensors,
**places** them on the specified compute device, **and packages**
into a dataset
-->

In [None]:
from mlss2019bdl import dataset_from_numpy

X_train = np.linspace(-6.0, +6.0, num=20)[:, np.newaxis]
y_train = np.sin(X_train * np.pi / 2) + 0.25 * X_train**2

train = dataset_from_numpy(X_train, y_train, device=device)

In [None]:
X_domain = np.linspace(-10., +10., num=251)[:, np.newaxis]

domain = dataset_from_numpy(X_domain, device=device)

Suppose we have the following model: a 3-layer fully connected
network with LeakyReLU activations.

In [None]:
from torch.nn import Linear, Sequential
from torch.nn import LeakyReLU


model = Sequential(
    Linear(1, 512, bias=True),
    LeakyReLU(),

    Linear(512, 512, bias=True),
    LeakyReLU(),

    Linear(512, 1, bias=True),
)

model.to(device)

<br>

We fit our model on `train` using MSE loss and $\ell_2$ penalty on
weights (`weight_decay`):
$$
    \tfrac1{2 m} \|f_\omega(x) - y\|_2^2 + \lambda \|\omega\|_2^2
    \,, $$
where $\omega$ are all the learnable parameters of the network $f_\omega$.

<br>

Fit, ...

In [None]:
fit(model, train, criterion="mse", n_epochs=2000, verbose=True, weight_decay=1e-3)

..., compute the predictions, ...

In [None]:
y_pred = predict(model, domain)

..., and plot them.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 5))

ax.scatter(X_train, y_train, c="black", s=40, label="train")

ax.plot(X_domain, y_pred.numpy(), c="C0", lw=2, label="prediction")

plt.legend();

This model seems to fit the train set adequately well. However, there is no
way to assess how confident this model is with respect to its predictions.
Indeed, the prediction $\hat{y}_x = f_\omega(x)$ is is a deterministic function
of the input $x$ and the learnt parameters $\omega$.

<br>

### `Bayesification` via dropout and weight decay

One inexpensive way to make any network into a stochastic function of its
input is to add dropout before any parameterized layer like `linear`
or `convolutional`, [Hinton et al. 2012](https://arxiv.org/abs/1207.0580).
Essentially, dropout applies a Bernoulli mask to the features of the input.

In [Gal, Y. (2016)](http://www.cs.ox.ac.uk/people/yarin.gal/website/thesis/thesis.pdf)
it has been shown that a simple, somewhat ad-hoc approach of
adding uncertainty quantification to networks through dropout,
coupled with $\ell_2$ weight penalty, is a special case of Variational Inference.

For input
$
    x\in \mathbb{R}^{[\mathrm{in}]}
$ the dropout layer acts like this:

$$
    y_j = x_j \, m_j
    \,, $$

where $m\in \mathbb{R}^{[\mathrm{in}]}$ with $
m_j \sim \pi_p(m_j)
    = \mathcal{Ber}\bigl(\bigl\{0, \tfrac1{1-p}\bigr\}, 1-p\bigr)
$,
i.e. equals $\tfrac1{1-p}$ with probability $1-p$ and $0$ otherwise.

#### (task) Always Active Dropout

Useful methods:
* `torch.rand(d1, ..., dn)` -- draw $d_1\times \ldots \times d_n$ tensor of uniform rv-s
* `torch.rand_like(other)` -- draw a tensor of uniform rv-s with the shape, data type and device as `other`


* `torch.bernoulli(pi)` -- draw tensor $t$ with independent $
t_\alpha \sim \mathcal{Ber}\bigl(\{0, 1\}, \pi_\alpha\bigr)
$ for each index $\alpha$
* `torch.full((d1, ..., dn), v)` -- a $d_1\times \ldots \times d_n$ tensor with the same value $v$


* `Tensor.to(other)` -- assume move `Tensor` to the device of the `other` and cast to its data type.

In [None]:
from torch.nn import Module

class ActiveDropout(Module):
    # all building blocks of networks are inherited from Module!

    def __init__(self, p=0.5):
        super().__init__()  # init the base class

        self.p = p

    def forward(self, input):
        ## Exercise: implement feature dropout on input
        #  self.p - contains the specified dropout rate
        
        mask = torch.rand_like(input) > self.p
        return input * mask.to(input) / (1 - self.p)

        # prob = torch.full_like(input, 1 - self.p)
        # return input * torch.bernoulli(prob) / prob

        # return F.dropout(input, self.p, True)

        pass

<br>

#### (task) Rebuilding the model

Let's recreate the model above with this freshly minted dropout layer.
Then fit and plot it's prediction uncertainty due to forward pass stochasticity.

In [None]:
def build_model(p=0.5):
    """Build a model with dropout layers' rate set to `p`."""

    return Sequential(
        ## Exercise: Use `ActiveDropout` before linear layers of our
        #  first network. Note that dropping out inputs is not a good idea

        Linear(1, 512, bias=True),
        LeakyReLU(),

        ActiveDropout(p),
        Linear(512, 512, bias=True),
        LeakyReLU(),

        ActiveDropout(p),
        Linear(512, 1, bias=True),

        # pass
    )

<br>

In [None]:
model = build_model(p=0.5)

model.to(device)

fit(model, train, criterion="mse", n_epochs=2000, verbose=True,
    weight_decay=1e-3)

<br>

#### Sampling the random output

Let's take the test sample $\tilde{S} = (\tilde{x}_i)_{i=1}^m \in \mathcal{X}$
and repeat the stochastic forward pass $B$ times at each $x\in \tilde{S}$:

* for $b = 1 .. B$ do:

  1. draw $y_{bi} \sim f_\omega(\tilde{x}_i)$ for $i = 1 .. m$.

In [None]:
def point_estimate(model, dataset, n_samples=1, verbose=False):
    """Draw pointwise samples with stochastic forward pass."""

    outputs = []
    for sample in tqdm.tqdm(range(n_samples), disable=not verbose):

        outputs.append(predict(model, dataset))

    return torch.stack(outputs, dim=0)


samples = point_estimate(model, domain, n_samples=101, verbose=True)

<br>

The approximate $95\%$ confidence band of predictions is...

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 5))
ax.scatter(X_train, y_train, c="black", s=40, label="train")

mean, std = samples.mean(dim=0).numpy(), samples.std(dim=0).numpy()
ax.plot(X_domain, mean + 1.96 * std, c="k")
ax.plot(X_domain, mean - 1.96 * std, c="k");

<br>

### Implementing function sampling with the DropoutLinear Layer

Let's inspect the draws $y_{bi}$ as $B$ functional samples:
$(x_i, y_{bi})_{i=1}^m$ - the $b$-th sample path. Below we
plot $5$ random paths.

In [None]:
samples = point_estimate(model, domain, n_samples=101, verbose=True)

fig, ax = plt.subplots(1, 1, figsize=(12, 5))

ax.scatter(X_train, y_train, c="black", s=40, label="train")
ax.plot(X_domain[:, 0], samples[:5, :, 0].numpy().T, c="C0", lw=1, alpha=0.25);

It is clear that they are very erratic!

Computing stochastic forward passes with a new mask each time is equivalent
to drawing new **independent** prediction from for each point $x\in \tilde{S}$,
without considering that, in fact, at adjacent points the predictions should
be correlated. If we were interested in uncertainty at some particular point,
this would be okay: **fast and simple**.

However, if we are interested in the uncertainty of an integral **path-dependent**
measure of the whole estimated function, or are doing **optimization** of
the unknown true function taking estimation uncertainty into account, then
this clearly erratic behaviour of paths is undesirable. Ex. see
[blog: Gal, Y. 2016](http://www.cs.ox.ac.uk/people/yarin.gal/website/blog_2248.html)

<br>

We need to implement some extra functionality on top of `pytorch`,
in order to draw realizations from the induced distribution over
functions, defined by a network, i.e. $
\bigl\{
    f_\omega\colon \mathcal{X}\to\mathcal{Y}
\bigr\}_{\omega \sim q(\omega)}
$
where $q(\omega)$ is a distribution over the parameters.

One of the design approaches is to allow layers
to cache random draws of their parameters for reuse
in all subsequent forward passes, until this is no
longer needed.

#### Freeze/unfreeze interface

This is a base **trait-class** `FreezableWeight` that adds interface
for freezing and unfreezing layer's random **weight** parameter.

In [None]:
class FreezableWeight(Module):
    def __init__(self):
        super().__init__()
        self.unfreeze()

    def unfreeze(self):
        self.register_buffer("frozen_weight", None)

    def is_frozen(self):
        """Check if a frozen weight is available."""
        return isinstance(self.frozen_weight, torch.Tensor)

    def freeze(self):
        """Sample from the parameter distribution and freeze."""
        raise NotImplementedError()

Next, we declare a pair of functions:
* `freeze()` instructs each compatible layer of the model to **sample and freeze** its randomness
* `unfreeze()` requests the layers to **undo** this

In [None]:
def freeze(model):
    for layer in model.modules():
        if isinstance(layer, FreezableWeight):
            layer.freeze()

    return model

In [None]:
def unfreeze(model):
    for layer in model.modules():
        if isinstance(layer, FreezableWeight):
            layer.unfreeze()

    return model

<br>

#### (task) Sampling realizations

The algorithm to sample a random function is:
* for $b = 1... B$ do:

  1. draw an independent realization $f_b\colon \mathcal{X} \to \mathcal{Y}$
  with from the process $\{f_\omega\}_{\omega \sim q(\omega)}$
  2. get $\hat{y}_{bi} = f_b(\tilde{x}_i)$ for $i=1 .. m$


In [None]:
def sample_function(model, dataset, n_samples=1, verbose=False):
    """Draw a realization of a random function."""

    ## Exercise: code a function similar to `point_estimate()`,
    ##  that collects the predictions from `frozen` models. Don't
    ##  forget to unfreeze before returning.

    outputs = []
    for _ in tqdm.tqdm(range(n_samples), disable=not verbose):
        freeze(model)

        outputs.append(predict(model, dataset))

    unfreeze(model)

    return torch.stack(outputs, dim=0)

    pass

**(note)** although the internal loop in both functions looks
similar they, conceptually the functions differ:
<strong>
```python
def point_estimate(f, S):
    for x in S:
        for w from f.q:  # different w for different x
            yield f(x, w)


def sample_function(f, S):
    for w from f.q:
        for x in S:      # same w for different x (thanks to freeze)
            yield f(x, w)
```
</strong>

<br>

### Implementing `DropoutLinear`

Now we will merge `ActiveDropout` and `Linear` layers into one, which

1. (on forward pass) **drops out** the inputs, if necessary, and **applies** the linear (affine) transform
2. (on freeze) **randomly zeros** columns in a copy of the the weight matrix $W$

Preferably, we will try to preserve interface, so that the resulting
object is backwards compatible with `Linear`.

This way we would be able to draw realizations from the induced
distribution over functions defined by the network $
\bigl\{
    f_\omega\colon \mathcal{X}\to\mathcal{Y}
\bigr\}_{\omega \sim q(\omega)}
$
where $q(\omega)$ a distribution over the network parameters.

<br>

#### (task) Fused dropout-linear operation

On the inputs into a linear layer dropout acts like this: for input
$
    x\in \mathbb{R}^{[\mathrm{in}]}
$ and layer weights $
    W\in \mathbb{R}^{[\mathrm{out}] \times [\mathrm{in}]}
$
and bias $
    b\in \mathbb{R}^{[\mathrm{out}]}
$ the resulting effect is

$$
    \tilde{x} = x \odot m
    \,, \\
    y = \tilde{x} W^\top + b
%     = b + \sum_i x_i m_i W_i
    \,, $$

where $\odot$ is the elementwise product and $m\in \mathbb{R}^{[\mathrm{in}]}$
with $m_j \sim \pi_p(m_j) = \mathcal{Ber}\bigl(\bigl\{0, \tfrac1{1-p}\bigr\}, 1-p\bigr)$,
i.e. equals $\tfrac1{1-p}$ with probability $1-p$ and $0$ otherwise.

Let
$
    x\in \mathbb{R}^{[\mathrm{in}]}
$, $
    W\in \mathbb{R}^{[\mathrm{out}] \times [\mathrm{in}]}
$
and $
    b\in \mathbb{R}^{[\mathrm{out}]}
$. Let's use the following `torch`'s functions:

* `F.dropout(x, p, on/off)` -- independent Bernoulli dropout $x\mapsto x\odot m$
  for $m\sim \mathcal{Ber}\bigl(\bigl\{0, \tfrac1{1-p}\bigr\}, 1-p\bigr)$

* `F.linear(x, W, b)` -- affine transformation $x \mapsto x W^\top + b$

**(note)** the `.weight` of a linear layer in `pytorch` is an $
{
    [\mathrm{out}]
    \times [\mathrm{in}]
}
$ matrix.

<!-- `pytorch` has a function for this `F.dropout(input, p, training)`. It multiplies
each element of the `input` tensor by an independent Bernoulli rv. The argument
`p` has the same meaning as above. The boolean argument `training` toggles the
effect: if `False` then the input is returned as-is, otherwise the mask is applied. -->

In [None]:
def DropoutLinear_forward(self, input):
    ## Exercise: If not frozen, then apply always active dropout,
    #  then linear transformation. If frozen, apply the transform
    #  using the frozen weight

    # linear with frozen weight
    if self.is_frozen():
        return F.linear(input, self.frozen_weight, self.bias)

    # stochastic pass as in `ActiveDropout` + Linear
    input = F.dropout(input, self.p, True)

    return F.linear(input, self.weight, self.bias)
    # return super().forward(F.dropout(input, self.p, True))

    pass

<br>

#### Parameter freezer for our custom layer

For input
$
    x\in \mathbb{R}^{[\mathrm{in}]}
$ and a layer parameters $
    W\in \mathbb{R}^{[\mathrm{out}] \times [\mathrm{in}]}
$
and $
    b\in \mathbb{R}^{[\mathrm{out}]}
$ the effect in `DropoutLinear` is

$$
    y_j
        = \bigl[(x \odot m) W^\top + b\bigr]_j
        = b_j + \sum_i x_i m_i W_{ji}
        = b_j + \sum_i x_i \breve{W}_{ji}
    \,, $$

where the each column of $\breve{W}_i$ is, independently, either
$\mathbf{0} \in \mathbb{R}^{[\mathrm{out}]}$ with probability $p$ or
some (learnable) vector in $\mathbb{R}^{[\mathrm{out}]}$

$$
    \breve{W}_i \sim
\begin{cases}
    \mathbf{0}
        & \text{ w. prob } p \,, \\
    \tfrac1{1-p} M_i
        & \text{ w. prob } 1-p \,.
\end{cases}
$$

Thus the multiplicative effect of the random mask $m$ on $x$ can be
equivalently seen as a random **on/off** switch effect on the
**columns** of the matrix $W$.

In [None]:
def DropoutLinear_freeze(self):
    """Apply dropout with rate `p` to columns of `weight` and freeze it."""
    # we leverage torch's broadcasting semantics and draw a one-row
    #  mask binary mask, that we later multiply the weight by.

    # let's draw the new weight
    with torch.no_grad():
        prob = torch.full_like(self.weight[:1, :], 1 - self.p)
        feature_mask = torch.bernoulli(prob) / prob

        frozen_weight = self.weight * feature_mask

    # and store it
    self.register_buffer("frozen_weight", frozen_weight)

<br>

Assemble the blocks into a layer

In [None]:
class DropoutLinear(Linear, FreezableWeight):
    """Linear layer with dropout on inputs."""
    def __init__(self, in_features, out_features, bias=True, p=0.5):
        super().__init__(in_features, out_features, bias=bias)

        self.p = p

    forward = DropoutLinear_forward

    freeze = DropoutLinear_freeze

<br>

### Comparing sample functions to point-estimates 

Let's rewrite the model builder function:

In [None]:
def build_model(p=0.5):
    """Build a model with the custom layer and dropout rate set to `p`."""

    return Sequential(
        ## Exercise: Plug-in `DropoutLinear` layer into our second network.

        Linear(1, 512, bias=True),
        LeakyReLU(),

        DropoutLinear(512, 512, bias=True , p=p),
        LeakyReLU(),

        DropoutLinear(512, 1, bias=True, p=p),

        # pass
    )

Let's create a new instance and retrain the model.

In [None]:
model = build_model(p=0.5)
model.to(device)

fit(model, train, criterion="mse", n_epochs=2000, verbose=True, weight_decay=1e-3)

... and obtain two estimates: pointwise and functional.

In [None]:
samples_pe = point_estimate(model, domain, n_samples=51, verbose=True)
samples_sf = sample_function(model, domain, n_samples=51, verbose=True)

samples_pe.shape, samples_sf.shape

```python
(torch.Size([51, 251, 1]), torch.Size([51, 251, 1]))
```

<br>

Let's compare <span style="color:#1f77b4">**point estimates**</span>
with <span style="color:#ff7f0e">**function sampling**</span>.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 5))

ax.plot(X_domain[:, 0], samples_pe[:10, :, 0].numpy().T,
        c="C1", lw=1, alpha=0.5)

ax.plot(X_domain[:, 0], samples_sf[:10, :, 0].numpy().T,
        c="C0", lw=2, alpha=0.5)

ax.scatter(X_train, y_train, c="black", s=40,
           label="train", zorder=+10);

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 5))

ax.scatter(X_train, y_train, c="black", s=40, label="train")

mean, std = samples_sf.mean(dim=0).numpy(), samples_sf.std(dim=0).numpy()
ax.plot(X_domain, mean + 1.96 * std, c="C0")
ax.plot(X_domain, mean - 1.96 * std, c="C0");

mean, std = samples_pe.mean(dim=0).numpy(), samples_pe.std(dim=0).numpy()
ax.plot(X_domain, mean + 1.96 * std, c="C1")
ax.plot(X_domain, mean - 1.96 * std, c="C1");

Pros of `point-estimate`:
* uses stochastic forward passes -- no need to for extra code and classes

Cons of `point-estimate`:
* samples from the predictive distribution at adjacent inputs are independent

<br>

**(note)**
The parameter distribution of the layer we've built is

$$
    q(\omega\mid \theta)
        = \prod_i q(\omega_i\mid \theta_i)
        = \prod_i \bigl\{
            p \delta_{\mathbf{0}} (\omega_i)
            + (1 - p) \delta_{\tfrac1{1-p} \theta_i}(\omega_i)
        \bigr\}
    \,, $$

where $\omega_i$ is the $i$-th column of $\omega$, $\delta_a$ is a
**point-mass** distribution at $a$, and $\theta$ is the learnt
approximate posterior mean of $\omega$.

Under benign assumptions and certain relaxations
[Gal, Y. 2016 (eq. (6.3) p.109, Prop. 4 p.149)](http://www.cs.ox.ac.uk/people/yarin.gal/website/thesis/thesis.pdf)
has shown that a deep network with dropout rate $p$
and $\ell_2$ weight penalty (`weight_decay`) performs (doubly)
**stochastic variational inference** with the following stochastic
approximate **evidence lower bound**: for the dataset $D = (x_i, y_i)_i$
of size $N = \lvert D \rvert$ and random batches $B$ of size
$\lvert B \rvert = m$

$$
    \frac1{N} \Bigl( \underbrace{
        \mathbb{E}_{\omega\sim q(\omega\mid \theta)} \log p(D \mid \omega)
        - KL\bigl(q(\omega\mid \theta) \big\| \pi(\omega) \bigr)
    }_{ELBO(\theta)} \Bigr)
    \approx \frac1{\lvert B \rvert}
        \sum_{i\in B} \log p(y_i \mid x_i, \omega^{(1)}_i, \ldots, \omega^{(L)}_i)
        - \sum_{l=1}^L
            \frac{1-p^{(l)}}{2 s^2 N} \|\theta^{(l)}\|_2^2
%             - [\mathrm{in}_{(l)}] \, \mathbb{H}(\mathcal{Ber}(p^{(l)}))
%         + \mathrm{const}
\,, $$
where $\omega_i^{(l)}$ are independently drawn from $q(\omega \mid \theta)$
(one random draw per element in $B$) and $s^2$ is the prior variance.

Thus `weight_decay` should be decreasing with $p$ and $N$:
$$ \lambda = \frac{1-p}{2 s^2 N} \,. $$

<br>

#### Question(s) (to ponder in your spare time)

* what happens to the confidence bands, when you increase the number
  of path-wise and pointwise samples?

* what will happen if you change the dropout rate $p$ and keep `n_epochs` at 2000?

* what happens if for $p=\tfrac12$ we use much less `n_epochs`?

* how does different settings of `weight_decay` affect the bands?

Try to rebuild the model with different $p \in (0, 1)$ using `build_model(p)`, use
`fit(..., n_epochs=...)`, and then plot the predictive bands.

In [None]:
from mlss2019bdl.plotting import plot1d_bands

# model = fit(build_model(p=...), train, n_epochs=..., weight_decay=..., criterion="mse")
# plot1d_bands(sample_function(model, domain, n_samples=101), c="C0")

<br>

In [None]:
model_a = fit(build_model(p=0.15), train, criterion="mse", n_epochs=2000, weight_decay=1e-3)

model_z = fit(build_model(p=0.75), train, criterion="mse", n_epochs=2000, weight_decay=1e-3)

In [None]:
fig = plt.figure(figsize=(12, 5))

samples_a = sample_function(model_a, domain, n_samples=101)
samples_z = sample_function(model_z, domain, n_samples=101)

plot1d_bands(X_domain, samples_a.transpose(0, 2), c="r")
plot1d_bands(X_domain, samples_z.transpose(0, 2), c="b")

In [None]:
model_a = fit(build_model(p=0.50), train, criterion="mse", n_epochs=20, weight_decay=1e-3)

model_z = fit(build_model(p=0.50), train, criterion="mse", n_epochs=200, weight_decay=1e-3)

In [None]:
fig = plt.figure(figsize=(12, 5))

samples_a = sample_function(model_a, domain, n_samples=101)
samples_z = sample_function(model_z, domain, n_samples=101)

plot1d_bands(X_domain, samples_a.transpose(0, 2), c="r")
plot1d_bands(X_domain, samples_z.transpose(0, 2), c="b")

In [None]:
model_a = fit(build_model(p=0.50), train, criterion="mse", n_epochs=2000, weight_decay=1e-5)

model_z = fit(build_model(p=0.50), train, criterion="mse", n_epochs=2000, weight_decay=1e-1)

In [None]:
fig = plt.figure(figsize=(12, 5))

samples_a = sample_function(model_a, domain, n_samples=101)

samples_z = sample_function(model_z, domain, n_samples=101)

plot1d_bands(X_domain, samples_a.transpose(0, 2), c="r")
plot1d_bands(X_domain, samples_z.transpose(0, 2), c="b")

In [None]:
model_a = fit(build_model(p=0.10), train, criterion="mse", n_epochs=2000, weight_decay=1e-3)

model_z = fit(build_model(p=0.90), train, criterion="mse", n_epochs=2000, weight_decay=1e-4)

In [None]:
fig = plt.figure(figsize=(12, 5))

samples_a = sample_function(model_a, domain, n_samples=101)

samples_z = sample_function(model_z, domain, n_samples=101)

plot1d_bands(X_domain, samples_a.transpose(0, 2), c="r")
plot1d_bands(X_domain, samples_z.transpose(0, 2), c="b")

<br>

### (optional) Dropout $2$-d Convolutional layer

Typically, in convolutional neural networks the dropout acts upon the feature
(channel) information and not on the spatial dimensions. Thus entire channels
are dropped out and for $
    x \in \mathbb{R}^{
        [\mathrm{in}]
        \times h
        \times w}
$ and $
    y \in \mathbb{R}^{
        [\mathrm{out}]
        \times h'
        \times w'}
$ the full effect of the `Dropout+Conv2d` layer is

$$
    y_{lij} = ((x \odot m) \ast W_l)_{ij} + b_l
        = b_l + \sum_k \sum_{pq} x_{k i_p j_q} m_k W_{lkpq}
    \,, \tag{conv-2d} $$
    
where i.i.d $m_k \sim \mathcal{Ber}\bigl(\bigl\{0, \tfrac1{1-p}\bigr\}, 1-p\bigr)$,
and indices $i_p$ and $j_q$ represent the spatial location in $x$ that correspond
to the $p$ and $q$ elements in the kernel $
    W\in \mathbb{R}^{
        [\mathrm{out}]
        \times [\mathrm{in}]
        \times h
        \times w}
$ relative to $(i, j)$ coordinates in $y$.
The exact values of $i_p$ and $j_q$ depend on the configuration of the
convolutional layer, e.g. stride, kernel size and dilation.

**(note)** Informative illustrations on the effects of convolution
parameters can be found in [Convolution arithmetic](https://github.com/vdumoulin/conv_arithmetic) 
repo.

<br>

## (optional) A brief reminder on Bayesian and Variational Inference

Bayesian Inference is a principled framework of reasoning about uncertainty.

In Bayesian Inference (**BI**) we *assume* that the observation
data $D$ follows a *model* $m$ with data generating distribution
$p(D\mid m, \omega)$ *governed by unknown parameters* $\omega$.
The goal of **BI** is to reason about the model and/or its parameters,
and new data given the observed data $D$ and our assumptions, i.e
to seek the **posterior** parameter and predictive distributions:

$$\begin{align}
    p(d \mid D, m)
        % &= \mathbb{E}_{
        %     \omega \sim p(\omega \mid D, m)
        % } p(d \mid D, \omega, m)
        &= \int p(d \mid D, \omega, m) p(\omega \mid D, m) d\omega
    \,, \\
    p(\omega \mid D, m)
        &= \frac{p(D\mid \omega, m) \, \pi(\omega \mid m)}{p(D\mid m)}
    \,.
\end{align}
$$

* the **prior** distribution $\pi(\omega \mid m)$ reflects our belief
  before having made the observations

* the data distribution $p(D \mid \omega, m)$ reflects our assumptions
  about the data generating process, and determines the parameter
  **likelihood** (Gaussian, Categorical, Poisson)

Unless the distributions and likelihoods are conjugate, posterior in
Bayesian inference is typically intractable and it is common to resort
to **Variational Inference** or **Monte Carlo** approximations.

This key idea of this approach is to seek an approximation $q(\omega)$
to the intractable posterior $p(\omega \mid D, m)$, via a variational
optimization problem over some tractable family of distributions $\mathcal{Q}$:

$$
    q^*(\omega)
        \in \arg \min_{q\in \mathcal{Q}} \mathrm{KL}(q(\omega) \| p(\omega \mid D, m))
    \,, $$

where the Kullback-Leibler divergence between $P$ and $Q$ ($P\ll Q$)
with densities $p$ and $q$, respectively, is given by

$$
    \mathrm{KL}(q(\omega) \| p(\omega))
%         = \mathbb{E}_{\omega \sim Q} \log \tfrac{dQ}{dP}(\omega)
        = \mathbb{E}_{\omega \sim q(\omega)}
            \log \tfrac{q(\omega)}{p(\omega)}
    \,. \tag{kl-div} $$


Note that the family of variational approximations $\mathcal{Q}$ can be
structured **arbitrarily**: point-mass, products, mixture, dependent on
input, having mixed hierarchical structure, -- any valid distribution.

Although computing the divergence w.r.t. the unknown posterior
is still hard and intractable, it is possible to do away with it
through the following identity, which is based on the Bayes rule.

For **any** $q(\omega) \ll p(\omega \mid D; \phi)$ and any model $m$

$$
\begin{align}
    \overbrace{
        \log p(D \mid m)
    }^{\text{evidence}}
        &= \overbrace{
            \mathbb{E}_{\omega \sim q} \log p(D\mid \omega, m)
        }^{\text{expected conditional likelihood}}
        - \overbrace{
            \mathrm{KL}(q(\omega)\| \pi(\omega \mid m))
        }^{\text{proximity to prior belief}}
        \\
        &+ \underbrace{
            \mathrm{KL}(q(\omega)\| p(\omega \mid D, m))
        }_{\text{posterior approximation}}
\end{align}
    \,. \tag{master-identity}
$$

Instead of minimizing the divergence of the approximation from the posterior,
we maximize the **Evidence Lower Bound** with respect to $q(\omega)$:

$$
    q^* \in
    \arg\max_{q\in Q}
        \mathcal{L}(q) = 
            \mathbb{E}_{\omega \sim q} \log p(D\mid \omega, m)
            - \mathrm{KL}(q(\omega)\| \pi(\omega \mid m))
    \,. \tag{max-ELBO} $$

* the expected $\log$-likelihood favours $q$ that place their mass on
parameters $\omega$ that explain $D$ under the specified model $m$.

* the negative KL-divergence discourages the approximation $q$
from straying too far away from to the prior belief $\pi$ under $m$.

We usually consider the following setup (conditioning on model $m$ is omitted):
* the likelihood factorizes $
p(D \mid \omega)
    = \prod_i p(y_i, x_i \mid \omega)
    \propto \prod_i p(y_i \mid x_i, \omega)
$
for $D = (x_i, y_i)_{i=1}^N$

* the approximation is parameterized by $\theta$: $q(\omega\mid \theta)$

* the prior on $\omega$ itself depends on hyper-parameters $\lambda$, that
  can be fixed, or variable ($\pi(\omega \mid \lambda)$).

In this case the variational objective (evidence lower bound)

$$
    \log p(D\mid \lambda )
        \geq \mathcal{L}(\theta, \lambda)
            = \mathbb{E}_{\omega \sim q(\omega \mid \theta)}
                \sum_i \log p_\phi(y_i \mid x_i, \omega)
            - KL(q(\omega \mid \theta) \| \pi(\omega \mid \lambda))
    $$

is maximized with respect to $\theta$ (to approximate the posterior).

Priors can be
* *subjective*, i.e. reflecting prior beliefs (but not arbitrary),
* *objective*, i.e. reflecting our lack of knowledge,
* *empirical*, i.e. learnt from data (we also optimize over hyper-parameters $\lambda$)

The stochastic variant of ELBO is formed by randomly batching
the dataset $D$:

$$
    \mathcal{L}(\theta, \lambda)
        \approx \mathcal{L}_\mathrm{SGVB}(\theta, \lambda)
        = \lvert D \rvert \biggl(
            \tfrac1{\lvert B \rvert}
                \sum_{b \in B} \mathbb{E}_{\omega \sim q(\omega \mid \theta)}
                    \log p(y_b \mid x_b, \omega)
        \biggr)
        - KL(q(\omega \mid \theta) \| \pi(\omega \mid \lambda))
    \,. $$

* Stochastic optimization follows noisy unbiased gradient estimates, which are
usually cheap, allow escaping from local optima, and optimize the objective in
expectation.

In order to get a gradient of $
    F_\theta = \mathbb{E}_{\omega \sim q(\omega \mid \theta)} f(\omega)
$ w.r.t $\theta$ we use either:

###### (REINFORCE)
$
\nabla_\theta F_\theta
    = \mathbb{E}_{\omega \sim q(\omega \mid \theta)}
        (f(\omega) - b_\theta) \nabla_\theta \log q(\omega \mid \theta)
$
* for some $b_\theta$ that is used to control variance

###### (reparameterization)
$
\nabla_\theta F_\theta
    = \nabla_\theta \mathbb{E}_{\varepsilon \sim q(\varepsilon)}
        f(g(\theta; \varepsilon))
    = \mathbb{E}_{\varepsilon \sim q(\varepsilon)}
        \nabla_\theta g(\theta; \varepsilon)
            \nabla_\omega f(\omega) \big\vert_{\omega = g(\theta; \varepsilon)}
$
* when there are $q$ and differentiable $g$ such that sampling from
$q(\omega \mid \theta)$ is equivalent to $\omega = g(\theta; \varepsilon)$
with $\varepsilon \sim q(\varepsilon)$.

The variational approximation might yield high dimensional integrals,
which are slow/prohibitive to compute. To make the computations faster
without foregoing much of the precision, we may use Monte Carlo methods:

$$
    \mathbb{E}_{\omega \sim q(\omega\mid \theta)} \, f(\omega)
        \overset{\text{MC}}{\approx}
            \frac1{\lvert \mathcal{W}\rvert}
                \sum_{\omega \in \mathcal{W}} f(\omega)
    \,,
$$

where $\mathcal{W} = (\omega_b)_{b=1}^B$ is a sample of independent draws
from $q(\omega\mid \theta)$.

If we also approximate the expectation in the gradient of ELBO
via Monte Carlo we get **doubly stochastic variational objective**:

$$
    \nabla_\theta \mathcal{L}_\mathrm{DSVB}(\theta, \lambda)
        \approx
            \lvert D \rvert \biggl(
                \tfrac1{\lvert B \rvert}
                    \sum_{b \in B}
                        \mathop{gradient}(x_b, y_b)
            \biggr)
            - \nabla_\theta KL(q(\omega \mid \theta) \| \pi(\omega \mid \lambda))
    \,, $$

where `gradient` is $
    \nabla_\theta
        \mathbb{E}_{\omega \sim q(\omega \mid \theta)}
            \log p(y \mid x, \omega)
$ using one of the approaches above, typically approximated using
one independent draw of $\omega$ per $b\in B$.

We can use a similar sampling approach to compute the gradient of the divergence term.

A good overview of Bayesian Inference can be found at [bdl101.ml](http://bdl101.ml/),
in [this lecture](http://mlg.eng.cam.ac.uk/zoubin/talks/lect1bayes.pdf),
[this paper](https://arxiv.org/abs/1206.7051.pdf), or
[this review](https://arxiv.org/abs/1601.00670.pdf),
among other great resources. It is also possible to consult
the references at [wiki](https://en.wikipedia.org/wiki/Bayesian_inference).

We can estimate the divergence term in the ELBO
with Monte Carlo, or, for example, for the predictive distribution
we have

$$
\begin{align}
    \mathbb{E}_{y\sim p(y\mid x, D, m)} \, g(y)
        &\overset{\text{BI}}{=}
            \mathbb{E}_{\omega\sim p(\omega \mid D, m)}
                \mathbb{E}_{y\sim p(y\mid x, \omega, D, m)} \, g(y) 
        \\
        &\overset{\text{VI}}{\approx}
            \mathbb{E}_{\omega\sim q(\omega)}
                \mathbb{E}_{y\sim p(y\mid x, \omega, D, m)} \, g(y)
        \\
        &\overset{\text{MC}}{\approx}
%             \hat{\mathbb{E}}_{\omega \sim \mathcal{W}}
%                 \mathbb{E}_{y\sim p(y\mid x, \omega, D, m)} \, g(y)
            \frac1{\lvert \mathcal{W}\rvert} \sum_{\omega \in \mathcal{W}}
                \mathbb{E}_{y\sim p(y\mid x, \omega, D, m)} \, g(y)
    \,,
\end{align}
$$

where $\mathcal{W} = (\omega_b)_{b=1}^B \sim q(\omega)$
-- iid samples from the variational approximation.

<br>