# Bayesian Regression

Regression is one of the most common and basic supervised learning tasks in machine learning. It is used to fit a function to data. Linear regression generally takes the form:
\begin{equation}
y = \beta_1 x + \beta_0 + \sigma
\end{equation}
where we would like to learn $\beta_0$ and $\beta_1$. Let's first write a normal regression as you would in PyTorch and learning point estimates for the parameters.  Then we'll see how to learn uncertainty by doing a bayesian regression over the same parameters.


## Setup
As always, let's begin by importing the modules we'll need.

In [1]:
import numpy as np
import argparse
import torch
import torch.nn as nn
from torch.nn.functional import normalize  # noqa: F401

from torch.autograd import Variable

import pyro
from pyro import random_module
from pyro.distributions import DiagNormal, Bernoulli  # noqa: F401
from pyro.infer import SVI
from pyro.optim import Adam

## Data
We'll generate a linear toy dataset with $\beta_1 = 3$ and $\beta_0 = 1$ as follows:

In [8]:
def build_linear_dataset(N, noise_std=0.1):
    X = np.linspace(-6, 6, num=N)
    y = 3 * X + 1 + np.random.normal(0, noise_std, size=N)
    X = X.reshape((N, 1))
    y = y.reshape((N, 1))
    X, y = Variable(torch.Tensor(X)), Variable(torch.Tensor(y))
    return torch.cat((X, y), 1)

Now let's define our regression model in the form of a neural net. We'll use PyTorch's nn module for this.  Our input is a data of size $N \times p$ and our output is a vector of size $p \times 1$. `nn.Linear(1, 1)` defines a linear module of the form $Xw + b$ where $w$ is the weight matrix and $b$ is the additive bias.

In [10]:
class RegressionModel(nn.Module):
    def __init__(self):
        super(Regression, self).__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

regression = RegressionModel()

## Training
We will use MSE as our loss and Adam as our optimizer. We would like to optimize the parameters of the `regression` neural net above. Since our toy dataset does not have a lot of noise, we will use a larger learning rate of `0.01` and run for 1000 epochs.

In [14]:
N = 100  # size of toy data
p = 1  # number of features
loss_fn = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.Adam(regression.parameters(), lr=0.01)
num_epochs = 1000

def main():
    x, y = build_linear_dataset(N, p)
    for j in range(num_epochs):
        y_pred = regression(x)
        loss = loss_fn(y_pred, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if j % 100 == 0:
            print loss.data[0]
    # Inspect learned parameters
    print list(regression.named_parameters())

if __name__ == '__main__':
    main()

1.420830369
0.917385041714
0.917374432087
0.917373895645
0.91737383604
0.91737395525
0.917375266552
0.917375266552
0.91737395525
0.91737383604
[('linear.weight', Parameter containing:
 3.0021
[torch.FloatTensor of size 1x1]
), ('linear.bias', Parameter containing:
 1.0078
[torch.FloatTensor of size 1]
)]


Not too bad - you can see that the neural net learned parameters that were pretty close to the ground truth $w = 3, b = 1$.  However, what if our data was noisy? How confident are we that the learned parameters reflect the true values?

This is a fundamental limitation of deep learning that we can address with probabilistic modeling.  Instead of only learning the point estimates, we learn a _distribution_ over the possible parameters.  In other words, we'll learn two values for each parameter: $\mu$ which is the mean (ie the actual value) and $\sigma$, our uncertainty for that estimate.

Instead of learning these parameters directly, we'll put a prior over these parameters, and learn a posterior distribution given our observed data.  To do this, we'll use pyro's `random_module()` to lift the parameters we would like to learn.  `random_module()` replaces the original parameters of the neural net with random variables sampled from our prior.

In [16]:
mu = Variable(torch.zeros(1, 1))
sigma = Variable(torch.ones(1, 1))
# define a prior we want to sample from
prior = DiagNormal(mu, sigma)
# overload the parameters in the regression nn with samples from the prior
lifted_module = pyro.random_module("regression_module", regression, prior)
# sample a nn from the prior
sampled_nn = lifted_module()

Regression (
  (linear): Linear (1 -> 1)
)

# Model
Now we want to define our model and guide to do inference. The model samples from the prior $\mathcal{N}(\mu=0, \sigma=1)$ and runs the decoder on the sample to calculate the parameters for the image distribution. We then score the image data against a Gaussian parameterized by the $\mu, \sigma$ generated by the decoder in the previous step. This is done via an `observe()` statement.

The guide is the approximating distribution which will be used to sample when inference is run. It simply samples from a `DiagNormal()` parameterized by $\mu$ and $\sigma$ from the encoder.

In [2]:
def model(data):
    # klqp gets called with data.

    # wrap params for use in model -- required
    decoder = pyro.module("decoder", pt_decode)
    
    # sample from prior
    z_mu, z_sigma = ng_zeros(
        [data.size(0), 20]), ng_ones([data.size(0), 20])

    # sample (retrieve value set by the guide)
    z = pyro.sample("latent", DiagNormal(z_mu, z_sigma))

    # decode into size of imgx2 for mu/sigma
    img_mu, img_sigma = decoder.forward(z)

    # score against actual images
    pyro.observe("obs", DiagNormal(img_mu, img_sigma), data.view(-1, 784))


def guide(data):
    # wrap params for use in model -- required
    encoder = pyro.module("encoder", pt_encode)

    # use the ecnoder to get an estimate of mu, sigma
    z_mu, z_sigma = encoder.forward(data)

    pyro.sample("latent", DiagNormal(z_mu, z_sigma))

# Inference
Pyro makes it easy to do variational inference with the objective and estimator of your choice. For this exaample, we'll optimize the ELBO using the ADAM optimizer with a learning rate of 0.01.

In [2]:
adam = Adam({"lr": 0.01})
svi = SVI(model, guide, adam, loss="ELBO")

NameError: name 'model' is not defined

# Training
To train, we will iterate over the number of epochs and feed the data to our SVI object. We'll print the loss every 100 epochs, and plot the ELBO during optimization.

NameError: name 'args' is not defined

We can evaluate the results by inspecting the parameters of the learned approximate posterior. Let's take a look!

In [None]:
>> print pyro.get_param_store()._params


See the full code on [Github](https://github.com/uber/pyro/blob/dev/examples/bayesian_regression.py).