# Bayesian Regression

Regression is one of the most common and basic supervised learning tasks in machine learning. It is used to fit a function to observed data. Linear regression generally takes the form:
\begin{equation}
y = \beta_1 X + \beta_0 + \epsilon
\end{equation}
where we would like to learn $\beta_0$ and $\beta_1$. Let's first write a normal regression as you would in PyTorch and learn point estimates for the parameters.  Then we'll see how to learn uncertainty by doing bayesian inference over the same parameters.


## Setup
As always, let's begin by importing the modules we'll need.

In [None]:
import numpy as np
import torch
import torch.nn as nn

from torch.autograd import Variable

import pyro
from pyro.distributions import Normal
from pyro.infer import SVI
from pyro.optim import Adam

## Data
We'll generate a linear toy dataset with one feature and $\beta_1 = 3$ and $\beta_0 = 1$ as follows:

In [None]:
N = 100  # size of toy data
p = 1  # number of features

def build_linear_dataset(N, noise_std=0.1):
    X = np.linspace(-6, 6, num=N)
    y = 3 * X + 1 + np.random.normal(0, noise_std, size=N)
    X, y = X.reshape((N, 1)), y.reshape((N, 1))
    X, y = Variable(torch.Tensor(X)), Variable(torch.Tensor(y))
    return torch.cat((X, y), 1)

## Regression
Now let's define our regression model in the form of a neural network. We'll use PyTorch's `nn.Module` for this.  Our input $X$ is a data of size $N \times p$ and our output $y$ is a vector of size $p \times 1$.  The function `nn.Linear(p, 1)` defines a linear module of the form $Xw + b$ where $w$ is the weight matrix and $b$ is the additive bias.

In [None]:
class RegressionModel(nn.Module):
    def __init__(self, p):
        super(RegressionModel, self).__init__()
        self.linear = nn.Linear(p, 1)

    def forward(self, x):
        return self.linear(x)

regression_model = RegressionModel(p)

## Training
We will use MSE as our loss and Adam as our optimizer. We would like to optimize the parameters of the `regression_model` neural net above. Since our toy dataset does not have a lot of noise, we will use a larger learning rate of `0.01` and run for 1000 epochs.

In [None]:
loss_fn = torch.nn.MSELoss(size_average=False)
optim = torch.optim.Adam(regression_model.parameters(), lr=0.01)
num_epochs = 1000

def main():
    data = build_linear_dataset(N, p)
    x_data = data[:, :-1]
    y_data = data[:, -1]
    for j in range(num_epochs):
        # run the model forward on the data
        y_pred = regression_model(x_data)
        # calculate the mse loss
        loss = loss_fn(y_pred, y_data)
        # initialize zero gradients
        optim.zero_grad()
        # backpropagate
        loss.backward()
        # take a gradient step
        optim.step()
        if j % 100 == 0:
            print loss.data[0]
    # Inspect learned parameters
    print "Parameters:", list(regression_model.named_parameters())

if __name__ == '__main__':
    main()

Not too bad - you can see that the neural net learned parameters that were pretty close to the ground truth $w = 3, b = 1$.  However, what if our data was noisy? How confident are we that the learned parameters reflect the true values?

This is a fundamental limitation of deep learning that we can address with probabilistic modeling.  Instead of only learning the point estimates, we learn a _distribution_ over the possible parameters.  In other words, we'll learn two values for each parameter: $\mu$ which is the mean (ie the actual value) and $\sigma$, our uncertainty for that estimate.

## Bayesian Regression
Instead of learning these parameters directly, we'll put a prior over these parameters, and learn a posterior distribution given our observed data.  To do this, we'll use pyro's `random_module()` to lift the parameters we would like to learn.  `random_module()` replaces the original parameters of the neural net with random variables sampled from our prior.  For example:

In [None]:
mu = Variable(torch.zeros(1, 1))
sigma = Variable(torch.ones(1, 1))
# define a prior we want to sample from
prior = Normal(mu, sigma)
# overload the parameters in the regression nn with samples from the prior
lifted_module = pyro.random_module("regression_module", regression_model, prior)
# sample a nn from the prior
sampled_nn = lifted_module()

## Model
Our model defines unit Gaussians for both the weight and the bias, samples a nn from the prior defined in the guide, and runs the nn forward on the data. We then score this predicted value against the observed value, with a fixed variance.

The guide defines priors over the weights and biases.  The parameters we want to learn are registered in the param store via `pyro.param()`.  Note that we pass the log variances through a `softplus()` to ensure positivity. We then define Gaussian priors with these parameters and wrap the `regression_model` with `pyro.random_module()`. `lifted_module` is a distribution over nns and calling the function samples a nn.

In [None]:
def model(data):
    # Create unit normal priors over the parameters
    x_data = data[:, :-1]
    y_data = data[:, -1]
    mu, sigma = Variable(torch.zeros(p, 1)), Variable(torch.ones(p, 1))
    bias_mu, bias_sigma = Variable(torch.zeros(1)), Variable(torch.ones(1))
    w_prior, b_prior = Normal(mu, sigma), Normal(bias_mu, bias_sigma)
    priors = {'linear.weight': w_prior, 'linear.bias': b_prior}
    # wrap regression model that lifts module parameters to random variables
    # sampled from the priors in the guide
    lifted_module = pyro.random_module("module", regression_model, priors)
    # sample a nn
    lifted_nn = lifted_module()
    # run the nn forward
    latent = lifted_nn(x_data).squeeze()
    # condition on the observed data
    pyro.observe("obs", Normal(latent, Variable(torch.ones(data.size(0)))), y_data.squeeze())

softplus = torch.nn.Softplus()

def guide(data):
    w_mu = Variable(torch.randn(p, 1), requires_grad=True)
    w_log_sig = Variable(-3.0 * torch.ones(p, 1) + 0.05 * torch.randn(p, 1), requires_grad=True)
    b_mu = Variable(torch.randn(1), requires_grad=True)
    b_log_sig = Variable(-3.0 * torch.ones(1) + 0.05 * torch.randn(1), requires_grad=True)
    # register learnable params in the param store
    mw_param = pyro.param("guide_mean_weight", w_mu)
    sw_param = softplus(pyro.param("guide_log_sigma_weight", w_log_sig))
    mb_param = pyro.param("guide_mean_bias", b_mu)
    sb_param = softplus(pyro.param("guide_log_sigma_bias", b_log_sig))
    # gaussian priors for w and b
    w_prior, b_prior = Normal(mw_param, sw_param), Normal(mb_param, sb_param)
    priors = {'linear.weight': w_prior, 'linear.bias': b_prior}
    # overloading the parameters in the module with random samples from the prior
    lifted_module = pyro.random_module("module", regression_model, priors)
    # sample a nn
    lifted_module()


## Inference
For inference, we'll still use the Adam optimizer with a learning rate of 0.01, but this time we're going to optimize the evidence lower bound (ELBO).  For more information on the ELBO and SVI, see the [SVI Tutorial](svi_part_i).  To train, we will iterate over the number of epochs and feed the data to our SVI object. We'll print the loss every 100 epochs.

In [None]:
optim = Adam({"lr": 0.01})
svi = SVI(model, guide, optim, loss="ELBO")

def main():
    data = build_linear_dataset(N, p)
    for j in range(num_epochs):
        # calculate the loss and take a gradient step
        epoch_loss = svi.step(data)
        if j % 100 == 0:
            print("epoch avg loss {}".format(epoch_loss/float(N)))
            
if __name__ == '__main__':
    main()

## Model Criticism
Let's compare our output to our previous result:

In [None]:
print pyro.get_param_store()._params

As you can see, the means are pretty close to the value we previously learned; however, instead of a point estimate, we learned a _distribution over possible values_ of $w, b$. (Note that we are using $\log \sigma$, so the more negative the value is, the narrower the width.)

Let's evaluate our model by checking its predicting accuracy on new test data. This is known as _point evaluation_.  We'll calculate the MSE of our synthesized data compared to the ground truth.

In [None]:
X = np.linspace(8, 12, num=20)
y = 3 * X + 1
X, y = X.reshape((20, 1)), y.reshape((20, 1))
x_data, y_data = Variable(torch.Tensor(X)), Variable(torch.Tensor(y))
y_pred = regression_model(x_data)
loss = nn.MSELoss()
# compare the MSE between the observed data and the data predicted by our posterior
print loss(y_pred, y_data)

See the full code on [Github](https://github.com/uber/pyro/blob/dev/examples/bayesian_regression.py).