In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal, norm
%matplotlib inline

# Session 1

# Linear Models

we have a dataset $D = \{(x_n,y_n)\}, n = 1,...,N$ of corresponding values of input observations $x$ and output observations $y$ for $N$ independent observations. Let us model the function $f(x)$ with the linear expression:

$$ f(x) = w_0+w_1 x $$

The likelihood is normal and centered on the function $f(x)$:

$$
p(y_n | w_0,w_1) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left( - \frac{\big(y_n - (w_0+w_1x_n)\big)^2}{2\sigma^2}\right) 
$$

The total likelihood is the product of the likelihood for each data point:

$$ p(D|w_0,w_1) = \prod_{n=1}^N p(y_n|w_0,w_1) $$

Instead of computing the likelihood it is often beneficial to work with the logarithm to the likelihood instead. When the goal is to maximize the likelihood this is acceptable, as 
the logarithm is monotonically increasing, so maximizing the likelihood is equivalent to maximizing the log-likelihood.

If we use the log-likelihood, we end up with a sum instead of the product:

$$ \ln p(D|w_0,w_1) = \sum_{n=1}^N \ln p(y_n|w_0,w_d).    $$

And we can then maximize the likelihood wrt. $w_0$ and $w_1$ by - equivalently - minimizing the negative log-likelihood 

$$ -\ln L(w_0,w_1)=-\ln p(D|w_0,w_1) = -\sum_{n=1}^N \ln p(y_n|w_0,w_d) = \frac{1}{2\sigma^2}\sum_{n=1}^N \big(y_n - (w_0+w_1 x_n)\big)^2-\ln\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right),
$$



### Questions

$\star$ Find the derivatives of this negative log-likelihood $-\ln L$ with respect to the two weights, $\frac{d}{dw_0}(-\ln L)$ and $\frac{d}{dw_1}(-\ln L)$. This gives you two equations in two unknowns, which we will call $\hat{w}_0$ and $\hat{w}_1$ so as not to confuse them with the true $w_0$ and $w_1$. Show that this system of equations is identical to the following (see hints for definitions):

$$ 
\begin{bmatrix}1 & \overline{x}\\ \overline{x} & \overline{xx}\end{bmatrix}
\begin{bmatrix} \hat{w}_0 \\ \hat{w}_1\end{bmatrix}
=
\begin{bmatrix} \overline{y} \\ \overline{xy}\end{bmatrix}
$$
    


In the following exercises you are going to construct a synthetic data-set with a two-dimensional input variable and a 1-dimensional output variable. Based on this you are going to compare the estimated weight vector with the true weights (used to generate the data) as well as the influence of number of points in the data-set.

$\star$ Construct a synthetic data-set $D$ with $N = 1000$ data points, using the following procedure: 
  - For each data point $n$, draw $x_n$ uniformly and independently such that $x_n\in[-1,1]$.
  - Assume that the corresponding target value $y_n$ are given by the deterministic function $f(x_n)$ with added  normal-distributed noise $\mathbf{\epsilon}\sim\mathcal{N}(0,\sigma^2)$.
  - Let $w_0 = 7$, $w_1 = 3$, $\sigma^2 = 1$. For each data point $n$ generate the target value $y_n$, where a random noise value is generated and added independently for each $y$, such that:  
  
  $$ y_n = w_0 + w_1x_n + \epsilon_n$$

$\star$ Estimate the weights $w_0$ and $w_1$ for the generated data by solving the linear system. You can use the `np.linalg.solve` function to solve.

$\star$ For all values $n = 1,...,N$, estimate the weights as above for the reduced data set containing the first $n$ data points. Plot the number of samples against the distance between the true and estimated weights $\sqrt{(\hat{w}_0-w_0)^2+(\hat{w}_1 - w_1)^2}$. How does the number of data points influcence this distance?

$\star$ Repeat the experiment, for different variances of the normal noise, $\sigma^2$. How does this affect the predicted weights?
    
### Hints

$\bullet$ When taking the derivative, you can remove all terms that do not depend on the weights.

$\bullet$ You may find the following definitions useful for simplifying your derivations
$$ 
    \overline{x} = \frac{1}{N}\sum_{n=1}^N x_n\;, \qquad \overline{y} = \frac{1}{N}\sum_{n=1}^N y_n\;, \qquad \overline{xx} = \frac{1}{N}\sum_{n=1}^N x_n^2\;, \qquad \overline{xy} = \frac{1}{N}\sum_{n=1}^N x_ny_n\;.
$$

In [1]:
#code

# Session 2

## The Generative Story of Linear Regression 
The likelihood of observing $y$ given input $x$ in a probabilistic linear regression model is given by

$$ p(y_n|x_n,w_0,w_1) = \mathcal{N}(y_n;w_0+w_1*x_n,\sigma^2)$$

So $y$ is assumed to be equal to the $w_0+w_1*x$ plus some normal-distributed noise $\epsilon\sim\mathcal{N}(0,\sigma^2)$.

To form a complete generative story, we also assume a prior distribution. Let us take it to be 

$$ p(w_0,w_1)=\mathcal{N}(w_0;0,\sigma_0^2)\mathcal{N}(w_1;0,\sigma^2_1)$$

and we sample the inputs $x$ as, 

$$p(x_n)=\operatorname{Uniform}_{[0,1]}(x_n)$$.

* Construct a sampler that follows the generative story and returns samples of $w_0$, $w_1$, and a dataset of size $N$. Note that the $N$ members of the dataset share the same $w_0$ and $w_1$. Illustrate the effect of $\sigma_0^2$, $\sigma_1^2$, and $\sigma^2$.
* Draw a final dataset of size $N=5$ with $\sigma^2=(0.2)^2$, $\sigma_0^2=\sigma_1^2=1$ to use in the following parts.

In [None]:
# code

## Composition of Likelihoods
We now want to see how the observation likelihoods look on their own, how they combine into the data likelihood, and how they combine with the prior. The most probable parameter is going to be where the posterior is highest.

* plot a contour of the prior.
* plot a contour of the likelihood for the first datapoint in your dataset. Pick three parameter pairs where the likelihood is high (by inspecting the contour plot; you can plot them on top of the likelihood if it helps) and plot their corresponding linear functions along with the datapoint. How do the lines relate to the datapoint?

In [None]:
# grid 
span = np.linspace(-3,3)
W0, W1 = np.meshgrid(span,span)
# contour plotting: 
# plt.contourf(W0,W1,f(W0,W1),cmap='RdBu_r') for filled contours (or plt.contour(...) for normal)

In [None]:
# plot of prior

In [None]:
# plot of contour

In [None]:
# plot of high likelihood lines

Finally, make a grid of $5\times 3$ plots using `plt.subplot`. Let each row correspond to a datapoint.

* In the first column, plot the contour of the likelihood function of the corresponding datapoint.
* In the second column, plot the contour of the product of the likelihood function of the current datapoint and all the previous ones.
* In the third column, plot the contour of the product of the prior and the joint likelihood visualized in the second column.

Add a dot (e.g. using `plt.scatter`) illustrating the true solution you used to generate the dataset to the plots in the third column.

In [None]:
# grid of plots

## MAP
Plot the 5 MAP solutions corresponding to the maxima of the five plots in column 3 along with each datapoint. Color the MAP solution and the datapoint according to the row (number of datapoints observed). Plot the true solution as well.

For this example, you can use `np.argmax` to extract the highest value of each array. This is an approximation to the true MAP.

In [None]:
# code

## Priors
The posterior balances the prior and the data. If there is not a lot of data, the prior will have a bigger impact.

Take the likelihood function of a single datapoint, like in the first column in the grid of plots you did before. Then make four plots: 

1. with the likelihood alone.
2. with the likelihood multiplied by a normal prior with a large variance.
3. with the likelihood multiplied by the normal prior that we used to generate the data.
4. with the likelihood multiplied by a normal prior with a very small variance.

Note that with enough data, even a strong prior will eventually be dominated by the likelihood.

In [None]:
# code