# Exercise - Pyro Polynomial Regression

## Table of Contents
* [Introduction](#Introduction)
* [Requirements](#Requirements) 
  * [Knowledge](#Knowledge)
  * [Modules](#Python-Modules)
* [Data](#Examples)
* [Exercises](#Examples)
  * [Pytorch Regression Model](#Pytorch-Regression-Model)
  * [Probabilistic Model](#Probabilistic-Model)
  * [Evaluation / Visualization](#Evaluation-/-Visualization)
  * [Additional-Exercise](#Additional-Exercise)
* [Licenses](#Licenses)

## Introduction



Remark: In order to detect errors in your own code, execute the notebook cells containing `assert` or `assert_almost_equal`. These statements raise exceptions, as long as the calculated result is not yet correct.

## Requirements

### Knowledge

#### Theory

All *Pyro*-exercises are intended as part of the course [Bayesian Learning](https://dev.deep-teaching.org/courses/bayesian-learning). Therefore work through the course up to and including chapter [Probabilistic Progrmaming](https://dev.deep-teaching.org/courses/bayesian-learning#probabilistic-programming).


#### Pyro

* The official Tutorial:
    * https://pyro.ai/examples/intro_part_i.html
    * https://pyro.ai/examples/intro_part_ii.html
    * https://pyro.ai/examples/svi_part_i.html

### Python Modules

In [None]:
import numpy as np

import torch
import torch.nn as nn

import pyro
from pyro.distributions import Normal
from pyro.infer import SVI, JitTrace_ELBO, Trace_ELBO
from pyro.optim import Adam

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
### for debugging
pyro.enable_validation(True)

## Data

First we generate some data, e.b. 10 datapoints for training and 10 for validation. We are going to store the data as follows:

* `x_train`:
$
\vec x_{train} = (x^{(1)},x^{(2)}, \dots x^{(10)})
$
* `x_test`:
$
\vec x_{test} = (\hat x^{(1)},\hat x^{(2)}, \dots \hat x^{(10)})
$
* `y_train`:
$
\vec y_{train} = (y^{(1)},y^{(2)}, \dots y^{(10)})
$
* `y_test`:
$
\vec y_{test} = (\hat y^{(1)},\hat y^{(2)}, \dots \hat y^{(10)})
$

<!--
As well seperately:

* `f_train`:
$$
f_{train} = \begin{bmatrix}
x^{(1)} & x^{(1)2} & x^{(1)3} \\
x^{(2)} & x^{(3)2} & x^{(1)3} \\
\ldots & \ldots & \ldots \\
x^{(10)} & x^{(10)2} & x^{(10)3}
\end{bmatrix}
$$
* `f_test`:
$$
f_{test} = \begin{bmatrix}
\hat x^{(1)} & \hat x^{(1)2} & \hat x^{(1)3} \\
\hat x^{(2)} & \hat x^{(3)2} & \hat x^{(1)3} \\
\ldots & \ldots & \ldots \\
\hat x^{(10)} & \hat x^{(10)2} & \hat x^{(10)3}
\end{bmatrix}
$$
-->

To generate the $y$ values, we use the variable (to us unknown values!!!)
* `w`:
$
\vec w = (w_0, w_1, w_2, w_3)
$, including the bias $b=w_0$

We then can generate our $y$ values (train and test) with:
$$
y^{(i)} = b + w_1 x^{(i)}+ w_1 x^{(i)2}+ w_3 x^{(i)3} + \mathcal N(\mu=0, \sigma=2.5)
$$


In [None]:
np.random.seed(4200)

In [None]:
noise_mean = 0.
noise_scale = 2.5
num_examples = 10
N = num_examples  # number of data points

In [None]:
def polynom(x, w):
    order = w.shape[0]
    x_poly = np.array([x**i for i in range(order)]).T
    y = np.dot(x_poly, w) 
    return x_poly, y

In [None]:
def get_toy_data(nb_data, w, noise_mean, noise_scale):
    x = np.random.uniform(-8.0, 9.0, size=nb_data)
    x_poly, y_ = polynom(x, w)
    y = y_ + np.random.normal(noise_mean, noise_scale, size=nb_data)
    return np.float32(x), np.float32(y)

In [None]:
### w = (bias, w1, w2, w3)
w = np.array([9., 3., 0.1, -0.08])
x_train, y_train = get_toy_data(N, w, noise_mean, noise_scale)
x_test, y_test = get_toy_data(N, w, noise_mean, noise_scale)

In [None]:
### Plot train / test data
plt.plot(x_train, y_train, "bo", label="training data")
plt.plot(x_test, y_test, "ro", label="test data")

### Plot ground truth
x_ = np.arange(-8.0, 9.0, 0.01)
_, y_ = polynom(x_, w)
plt.plot(x_, y_, label='ground truth without noise')

plt.xlabel('x')
plt.ylabel('y')
plt.title('Train and Test data')
plt.legend()

In [None]:
grad_of_fit_polynom = 3

In [None]:
num_features = grad_of_fit_polynom  # number of features

## Exercises

### Pytorch Regression Model

**Task:**

Our goal is to learn the values for $\vec w = (w_1, w_2, w_3)$ and the bias $b$. 

For this purpose your first task is to implement the `RegressionModel` class extending `torch.nn.Module` with pytorch, which calculates:

$$
\vec y = \vec w X^T + b
$$, with 

* $\vec w = (w_1, w_2, w_3)$
* and $X = $

$$\begin{bmatrix}
x^{(1)} & x^{(1)2} & x^{(1)3} \\
x^{(2)} & x^{(3)2} & x^{(1)3} \\
\ldots & \ldots & \ldots \\
x^{(10)} & x^{(10)2} & x^{(10)3}
\end{bmatrix}
$$

**Hint:**

You can use pytorchs `nn.Linear` class, which initializes wheights and bias internally.

In [None]:
### NN with one linear layer
class RegressionModel(nn.Module):
    def __init__(self, num_features):
        super(RegressionModel, self).__init__()
        ##################################
        ### TODO: Initialize nn.Linear ###
        ##################################

    def forward(self, x):
        ##########################################
        ### TODO: Calc and return forward pass ###
        ##########################################
        
        return

In [None]:
regression_model = RegressionModel(grad_of_fit_polynom)

The following cell should output something similar to the following (though with different values of course, since they are initialized randomly):
```
linear.weight tensor([[ 0.1868, -0.1874,  0.3968]])
linear.bias tensor([-0.0651])
```

In [None]:
for name, param in regression_model.named_parameters():
    if param.requires_grad:
        print (name, param.data)

### Probabilistic Model

#### Pyro model

In this section we define the probabilistic model with pytorch and pyro. Read the cells carefully and try to understand the code.

In [None]:
# we use this to on noise (~Normal) to be always positive
softplus = torch.nn.Softplus()

In [None]:
def model(data):
    # Create normal priors over the parameters with high variance (10.)
    w_loc = torch.zeros((1, grad_of_fit_polynom))
    w_scale = torch.ones((1, grad_of_fit_polynom)) * 10.
    b_loc = torch.zeros(1)
    b_scale = torch.ones(1) * 10.
    nl_loc = torch.zeros(1)
    nl_scale = torch.ones(1) * 10.
    
    w_prior = Normal(w_loc, w_scale).independent(2)
    b_prior = Normal(b_loc, b_scale).independent(1) 
    noise_level_prior = softplus(pyro.sample("noise_level", Normal(nl_loc, nl_scale)))
    
    # these must be the names of the model
    priors = {'linear.weight': w_prior, 'linear.bias': b_prior}
    # lift module parameters to random variables sampled from the priors
    lifted_module = pyro.random_module("module", regression_model, priors)
    # sample a regressor (which also samples w and b)
    lifted_reg_model = lifted_module()
    
    with pyro.iarange("map", N):
        # all columns except the last are x, x^2, x^3
        x_data = data[:, :-1]
        # last column is y
        y_data = data[:, -1]

        # run the regressor forward conditioned on data
        prediction_mean = lifted_reg_model(x_data).squeeze(-1)
        
        pyro.sample("obs", Normal(prediction_mean, 
                                  noise_level_prior * torch.ones(data.size(0))), obs=y_data)  


#### Pyro guide

**Task:**

Implement the `guide` to fit the models implementation

**Hint:**

Use `softplus` on parameters, which may not become negative


In [None]:
def guide(data):
    ############
    ### TODO ###
    ############
    
    return

#### Perform Inference

Now that our `model` and `guide` is defined, we can do stochastic variational inference.

In [None]:
jit = False

pyro.clear_param_store()

### enhance feature x with polynomial
x_poly = np.array([x_train**i for i in range(1,grad_of_fit_polynom+1)], np.float32)
print(x_poly.T.shape)
data_ = np.concatenate([x_poly.T, y_train.reshape(-1,1)],  axis=1)
print(data_.shape)

data = torch.tensor(data_, dtype=torch.float32)

### make tensors and modules CUDA
#data = data.cuda()
#softplus.cuda()
#regression_model.cuda()
        
adam_params = {"lr": 0.05, "betas": (0.95, 0.999)}
optim = Adam(adam_params)
elbo = JitTrace_ELBO() if jit else Trace_ELBO()
svi = SVI(model, guide, optim, loss=elbo)

In [None]:
num_epochs = 5000 
avg_loss = 0
losses = []
for j in range(num_epochs):
    epoch_loss = svi.step(data)
    losses.append(epoch_loss)
    if j % 100 == 0:
        print("epoch avg loss {}".format(epoch_loss/float(N)))
        avg_loss = 0

### Evaluation / Visualization

#### Costs

Most easy thing to do is visualize our costs.

In [None]:
plt.plot(losses)
plt.title("ELBO")
plt.xlabel("step")
plt.ylabel("loss");

Next we can compare the values we found for the mean $\mu$ of our model parameters $\vec w \sim \mathcal N(\mu_w,\sigma_w)$ and the bias $b \sim \mathcal N(\mu_b,\sigma_b)$ with the values we used in our data generation process.

And also of course the $\mu$ and $\sigma$ of our $noise \sim \mathcal N(\mu_{noise},\sigma_{noise})$.

In [None]:
mw_param = pyro.param("guide_mean_weight")
sw_param = pyro.param("guide_log_scale_weight")
print(mw_param.detach().numpy())
print(w[1:])

In [None]:
mb_param = pyro.param("guide_mean_bias")
sb_param = pyro.param("guide_log_scale_bias")
print(mb_param.detach().numpy())
print(w[0])

In [None]:
noise_mean_param = softplus(pyro.param("guide_log_mean_noise_level"))
noise_sigma_param = softplus(pyro.param("guide_log_sigma_noise_level"))
print(noise_mean_param.detach().numpy())
print(noise_sigma_param.detach().numpy())
print(noise_mean)
print(noise_scale)

#### Possible Models

Using $\mu_w$ and $\mu_b$ we could easily plot the most likely model. But for this purpose we would not have needed bayesian inference and could just have used the frequentist approach.

**Task:**

Use $\mu_w$ and $\mu_b$ in conjunction with $\sigma_w$ and $\sigma_b$ to sample, let's say 50, possible $\vec w$s and $b$s and plot the corresponding models together with the ground truth and the most likely model.

The result should look somewhat similar to the following:

<img src="https://gitlab.com/deep.TEACHING/educational-materials/raw/master/media/klaus/pyro_poly_regression.png" alt="internet connection needed">

**Hint:**

You do not have to use pyro or pytorch to sample. Just use the according numpy function to draw samples from a normal distribution.




In [None]:
######################
### Your code here ###
######################

### Additional Exercise

Which of the models works best with the test data `x_test` and `y_test`? Is it the most likely one?

## Licenses

### Notebook License (CC-BY-SA 4.0)

*The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).*

Exercise - Pyro Polynomial Regression <br/>
by Christian Herta<br/>
is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).<br/>
Based on a work at https://gitlab.com/deep.TEACHING.


### Code License (MIT)

*The following license only applies to code cells of the notebook.*

Copyright 2019 Christian Herta

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.