## Students
Please fill in your names and S/U-numbers:
* Student 1 name, S/U-number:
* Student 2 name, S/U-number:
* Student 3 name, S/U-number:

# Statistical Machine Learning 2020
# Assignment 4
# Deadline: 23 December 2020
## Instructions
* You can __work in groups__ (= max 3 people). __Write the full name and S/U-number of all team members in the header above.__
* Make sure you __fill in any place that says__ `YOUR CODE HERE` or "YOUR ANSWER HERE" __including comments, derivations, explanations, graphs, etc.__ This means that the elements and/or intermediate steps required to derive the answer have to be in the report. (Answers like 'No' or 'x=27.2' by themselves are not sufficient, even when they are the result of running your code.) If an exercise requires coding, explain briefly what the code does (in comments). All figures should have titles (descriptions), axis labels, and legends (if applicable).
* Please do not add new cells unless necessary, try to write the answers only in the provided cells. Before you turn this problem in, __make sure everything runs as expected__. First, *restart the kernel* (in the menubar, select Kernel$\rightarrow$Restart) and then *run all cells* (in the menubar, select Cell$\rightarrow$Run All). The assignment was written in (and we strongly recommend using) Python 3 by using the corresponding Python 3 kernel for Jupyter.
* The assignment includes certain cells that contain tests. Most of the tests are marked as *hidden* and are used for automatic grading. NB: These hidden tests do not provide any feedback! There are also a couple of tests / checks that are visible, which are meant to help you avoid basic coding errors.
* __Upload the exercises to Brightspace as a single .zip file containing the submitter's S/U-number: 'SML20_as04_&lt;S/U-number&gt;.zip'__, for example 'SML20_as04_S123456.zip'. For those working in groups, it is sufficient if one team member uploads the solutions.
* For any problems or questions, send us an email, or just ask. Email addresses: G.Bucur@cs.ru.nl, Yuliya.Shapovalova@ru.nl, and tomc@cs.ru.nl.

## Introduction
Assignment 4 consists of:
1. Gaussian processes (50 points);
2. EM and doping (50 points);
3. Gibbs sampling and Metropolis-Hastings (50 points);
4. __Variational inference for Bayesian linear regression (50 points)__.

## Libraries
First, we import the basic libraries necessary to develop this assignment. Of course you are free to import further libraries, if required, in the allotted cells.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it to at least version 3."

import numpy as np
import matplotlib.pyplot as plt
import itertools
import functools

# Set fixed random seed for reproducibility
np.random.seed(2020)

## Variational inference for Bayesian linear regression
In this assignment we will consider variational inference for Bayesian linear regression. Using variational inference we can find an approximate posterior distribution of the parameters of interest. While generally there is no need for variational inference in linear regression problems, it is useful to derive it and implement it for a better understanding.

Recall the likelihood function for $\mathbf{w}$
\begin{equation}
p(\mathbf{t}|\mathbf{w}) = \prod_{n=1}^{N}N(t_{n}|\mathbf{w}^{T}\phi_{n}, \beta^{-1}) 
\end{equation}
and the prior over $\mathbf{w}$
\begin{equation}
p(\mathbf{w}|\alpha) = N(\mathbf{w}|0, \alpha^{-1}\mathbf{I}),
\end{equation}
where $\phi_{n}=\phi(\mathbf{x_{n}})$. <br>
We take a gamma prior distribution for $\alpha$
\begin{equation}
p(\alpha)=Gam(\alpha|a_{0}, b_{0}).
\end{equation}
Assume $\beta = 0.01$ to be known and fix it at its 'true value'.

### Let us create a simulated data set.
We generate the data from a function $\sin(2\pi x)$.

In [None]:
def create_toy_data(func, sample_size, std, domain=[0, 1]):
    x = np.linspace(domain[0], domain[1], sample_size)
    np.random.shuffle(x)
    t = func(x) + np.random.normal(scale=std, size=x.shape)
    return x, t

def cubic(x):
    return x * (x - 5) * (x + 5)

def PolynomialFeature(x,degree):
    if x.ndim == 1:
        x = x[:, None]
    x_t = x.transpose()
    features = [np.ones(len(x))]
    for degree in range(1, degree + 1):
        for items in itertools.combinations_with_replacement(x_t, degree):
            features.append(functools.reduce(lambda x, y: x * y, items))
    return np.asarray(features).transpose()

x_train, y_train = create_toy_data(cubic, 10, 10., [-5, 5])
x = np.linspace(-5, 5, 100)
y = cubic(x)

feature = PolynomialFeature(x,degree=3)
X_train = PolynomialFeature(x_train, degree=3)
X = PolynomialFeature(x, degree=3)

plt.scatter(x_train, y_train, s=100, facecolor="none", edgecolor="b")
plt.plot(x, y, c="g", label="$\sin(2\pi x)$")
plt.show()

1. Write down the joint distribution of all of the variables $p(\mathbf{t}, \mathbf{w}, \alpha)$.

YOUR ANSWER HERE

2. Using the variational framework we would like to find an approximation for the posterior distribution $p(\mathbf{w}, \alpha|\mathbf{t})$. Write down the steps (parameter updates at every iteration) for the variational inference algorithm in the case of Bayesian linear regression.

YOUR ANSWER HERE

3. Implement the variational inference algorithm for Bayesian linear regression.

In [None]:
def VariationalRegression(X, t, beta, a0, b0, iter_max:int=100):
    """
    Variational Bayesian estimation for linear regression.

    Parameters
    ----------
    X : (N, M) np.ndarray
        training independent variable
    t : (N,) np.ndarray
        training dependent variable
    beta : float
        precision of observation noise (assumed to be known)
    a0 : float
        a parameter of prior gamma distribution Gamma(alpha|a0,b0)
    b0 : float
        another parameter of prior gamma distribution Gamma(alpha|a0,b0)    
    iter_max : int, optional
        maximum number of iteration (the default is 100)
    
    Returns
    -------
    w_mean : mean of the variational posterior of w
    w_variance: covarioance of the variational posterior of w
    a: parameter of the variational posterior of alpha
    b: parameter of the variational posterior of beta
    """
    # YOUR CODE HERE
    raise NotImplementedError()

4. Implement the predictive distribution over $t$ given a new data point ($x$) (see equation 10.105) in Bishop.

In [None]:
def predict(X, w_mean, w_variance):
        """
        Make a prediction based on the input.

        Parameters
        ----------
        X : (N, D) np.ndarray
            independent variable

        Returns
        -------
        y : (N,) np.ndarray
            mean of predictive distribution
        y_std : (N,) np.ndarray
            standard deviation of predictive distribution
        """
        # YOUR CODE HERE
        raise NotImplementedError()

5. Plot the predictive distribution (both mean and standard deviation) for the following situations: maximum iterations in the variational inference algorithm set to 1, 2, 3, 5, 10.

In [None]:
"""
Plot predictive distributions.
"""
# YOUR CODE HERE
raise NotImplementedError()

Comment on the results, in particular on the speed of convergence.

YOUR ANSWER HERE

6. One of the quantities of interest in variational inference is the variational lower bound. The lower bound for Bayesian linear regression is given in Bishop in equations (10.107) - (10.112). Assume $a_{0}=0$ and $b_{0}=0$. Implement the function that computes the variational lower bound.  

In [None]:
def LB(X, t, M, beta, a0, b0, a, b, w_mean, w_var):
        """
        Compute the variational lower bound.

        Parameters
        X : (N, M) np.ndarray
            training independent variables
        t : (N,) np.ndarray
            training dependent variable
        M : integer
            order of the polynomial
        beta : float
            precision of observation noise (assumed to be known)
        a0 : float
            a parameter of the prior gamma distribution Gamma(alpha|a0,b0)
        b0 : float
            another parameter of the prior gamma distribution Gamma(alpha|a0,b0)
        a : float
            a parameter of the posterior gamma distribution Gamma(alpha|a,b)
        b : float
            another parameter of the posterior gamma distribution Gamma(alpha|a,b)

        w_mean: float
            the mean of the variational posterior distribution of the weights
        w_var: float
            the variance of the covariance posterior distribution of the weights

        Returns
        -------
        LB : lower bound
        """
        # YOUR CODE HERE
        raise NotImplementedError()

7. Produce a plot of the lower bound against different orders of the polynomial (see fig. 10.9 in Bishop as an example).

In [None]:
"""
Plot the lower bound versus different orders of polynomials.
"""
# YOUR CODE HERE
raise NotImplementedError()

How can you use the variational lower bound for model selection in a regression problem? How would it compare to the maximum likelihood solution?

YOUR ANSWER HERE

8. What would we gain/lose by using MCMC (for example Gibbs sampling) for this problem instead of variational inference? For which types of problems is MCMC more suitable than VI? For which types of problems is VI more suitable than MCMC?

YOUR ANSWER HERE