# Bayesian Regression

Regression is one of the most common and basic supervised learning tasks in machine learning. It is used to fit a function to data. Linear regression generally takes the form:
\begin{equation}
y = \beta_1 x + \beta_0 + \sigma
\end{equation}
where we would like to learn $\beta_0$ and $\beta_1$. Let's first write a normal regression as you would in PyTorch and learn point estimates for the parameters.  Then we'll see how to learn uncertainty by doing a bayesian regression over the same parameters.


## Setup
As always, let's begin by importing the modules we'll need.

In [3]:
import numpy as np
import argparse
import torch
import torch.nn as nn
from torch.nn.functional import normalize  # noqa: F401

from torch.autograd import Variable

import pyro
from pyro import random_module
from pyro.distributions import DiagNormal, Bernoulli  # noqa: F401
from pyro.infer import SVI
from pyro.optim import Adam

## Data
We'll generate a linear toy dataset with $\beta_1 = 3$ and $\beta_0 = 1$ as follows:

In [8]:
def build_linear_dataset(N, noise_std=0.1):
    X = np.linspace(-6, 6, num=N)
    y = 3 * X + 1 + np.random.normal(0, noise_std, size=N)
    X = X.reshape((N, 1))
    y = y.reshape((N, 1))
    X, y = Variable(torch.Tensor(X)), Variable(torch.Tensor(y))
    return torch.cat((X, y), 1)

# Regression
Now let's define our regression model in the form of a neural net. We'll use PyTorch's nn module for this.  Our input is a data of size $N \times p$ and our output is a vector of size $p \times 1$. `nn.Linear(1, 1)` defines a linear module of the form $Xw + b$ where $w$ is the weight matrix and $b$ is the additive bias.

In [10]:
class RegressionModel(nn.Module):
    def __init__(self):
        super(RegressionModel, self).__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

regression_model = RegressionModel()

## Training
We will use MSE as our loss and Adam as our optimizer. We would like to optimize the parameters of the `regression` neural net above. Since our toy dataset does not have a lot of noise, we will use a larger learning rate of `0.01` and run for 1000 epochs.

In [1]:
N = 100  # size of toy data
p = 1  # number of features
loss_fn = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.Adam(regression_model.parameters(), lr=0.01)
num_epochs = 1000

def main():
    x, y = build_linear_dataset(N, p)
    for j in range(num_epochs):
        y_pred = regression_model(x)
        loss = loss_fn(y_pred, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if j % 100 == 0:
            print loss.data[0]
    # Inspect learned parameters
    print list(regression_model.named_parameters())

if __name__ == '__main__':
    main()

NameError: name 'torch' is not defined

Not too bad - you can see that the neural net learned parameters that were pretty close to the ground truth $w = 3, b = 1$.  However, what if our data was noisy? How confident are we that the learned parameters reflect the true values?

This is a fundamental limitation of deep learning that we can address with probabilistic modeling.  Instead of only learning the point estimates, we learn a _distribution_ over the possible parameters.  In other words, we'll learn two values for each parameter: $\mu$ which is the mean (ie the actual value) and $\sigma$, our uncertainty for that estimate.

# Bayesian Regression
Instead of learning these parameters directly, we'll put a prior over these parameters, and learn a posterior distribution given our observed data.  To do this, we'll use pyro's `random_module()` to lift the parameters we would like to learn.  `random_module()` replaces the original parameters of the neural net with random variables sampled from our prior.

In [16]:
mu = Variable(torch.zeros(1, 1))
sigma = Variable(torch.ones(1, 1))
# define a prior we want to sample from
prior = DiagNormal(mu, sigma)
# overload the parameters in the regression nn with samples from the prior
lifted_module = pyro.random_module("regression_module", regression_model, prior)
# sample a nn from the prior
sampled_nn = lifted_module()

Regression (
  (linear): Linear (1 -> 1)
)

# Model
Now we want to define our model and guide to do inference. The model samples from the prior $\mathcal{N}(\mu=0, \sigma=1)$ and runs the decoder on the sample to calculate the parameters for the image distribution. We then score the image data against a Gaussian parameterized by the $\mu, \sigma$ generated by the decoder in the previous step. This is done via an `observe()` statement.

The guide is the approximating distribution which will be used to sample when inference is run. It simply samples from a `DiagNormal()` parameterized by $\mu$ and $\sigma$ from the encoder.

In [5]:
def model(data):
    # Create unit normal priors over the parameters
    x_data = data[:, :-1]
    y_data = data[:, -1]
    mu = Variable(torch.zeros(p, 1))
    sigma = Variable(torch.ones(p, 1))
    bias_mu = Variable(torch.zeros(1))
    bias_sigma = Variable(torch.ones(1))
    w_prior = DiagNormal(mu, sigma)
    b_prior = DiagNormal(bias_mu, bias_sigma)
    priors = {'linear.weight': w_prior, 'linear.bias': b_prior}
    # wrap regression model that lifts module parameters to random variables
    # sampled from the priors in the guide
    lifted_module = pyro.random_module("module", regression_model, priors)
    # sample a nn
    lifted_nn = lifted_module()
    latent = lifted_nn(x_data).squeeze()
    pyro.observe("obs", DiagNormal(latent, Variable(torch.ones(data.size(0)))), y_data.squeeze())


def guide(data):
    w_mu = Variable(torch.randn(p, 1), requires_grad=True)
    w_log_sig = Variable(-3.0 * torch.ones(p, 1) + 0.05 * torch.randn(p, 1), requires_grad=True)
    b_mu = Variable(torch.randn(1), requires_grad=True)
    b_log_sig = Variable(-3.0 * torch.ones(1) + 0.05 * torch.randn(1), requires_grad=True)
    # register learnable params in the param store
    mw_param = pyro.param("guide_mean_weight", w_mu)
    sw_param = softplus(pyro.param("guide_sigma_weight", w_log_sig))
    mb_param = pyro.param("guide_mean_bias", b_mu)
    sb_param = softplus(pyro.param("guide_sigma_bias", b_log_sig))
    # gaussian priors for w and b
    w_prior, b_prior = DiagNormal(mw_param, sw_param), DiagNormal(mb_param, sb_param)
    priors = {'linear.weight': w_prior, 'linear.bias': b_prior}
    # overloading the parameters in the module with random samples from the prior
    lifted_module = pyro.random_module("module", regression_model, priors)
    # sample a nn
    lifted_module()


# Inference
Pyro makes it easy to do variational inference with the objective and estimator of your choice. We'll still use the ADAM optimizer with a learning rate of 0.01, but this time we'ere going to optimize the evidence lower bound (ELBO).  For more information on the ELBO see [LINK].  To train, we will iterate over the number of epochs and feed the data to our SVI object. We'll print the loss every 100 epochs, and plot the ELBO during optimization.

In [6]:
svi = SVI(model, guide, optim, loss="ELBO")

def main():
    data = build_linear_dataset(N, p)
    for j in range(num_epochs):
        if args.batch_size == N:
            # use the entire data set
            epoch_loss = svi.step(data)
        if j % 100 == 0:
            print("epoch avg loss {}".format(epoch_loss/float(N)))
            
if __name__ == '__main__':
    main()

NameError: name 'optim' is not defined

# Model Criticism
Let's evaluate our model by checking its predicting accuracy on new test data. This is known as _posterior predictive checks_.

Let's compare our output to our previous result:

In [None]:
>> print pyro.get_param_store()._params


As you can see, the means are pretty close to the value we previously learned; however, instead of a point estimate, we learned a _distribution over possible values_. Note that the $\sigma$s are actually $\log \sigma$ so the more negative the value is, the narrower the width.

See the full code on [Github](https://github.com/uber/pyro/blob/dev/examples/bayesian_regression.py).