$$ \LaTeX \text{ command declarations here.}
\newcommand{\R}{\mathbb{R}}
\renewcommand{\vec}[1]{\mathbf{#1}}
$$

In [115]:
# plotting
%matplotlib inline
from matplotlib import pyplot as plt;

# scientific
import numpy as np;
from sklearn import svm
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.learning_curve import learning_curve
from sklearn.kernel_ridge import KernelRidge

# math
from __future__ import division


# EECS 545:  Machine Learning
## Lecture 10:  Bayesian Linear Regression and Gaussian Processes
* Instructor:  **Jacob Abernethy**
* Date:  February 11, 2016


## Linear Gaussian Distributions

* Linear combination of Gaussian random variable also has a Gaussian distribution.


    - Its marginal distribution is also a Gaussian.
    - Its conditional distribution is also a Gaussian

## Bayes’ Theorem for Gaussian Variables

* Given:
$$ p(x) = \mathcal{N}(x|\mu , \Lambda^{-1}) $$
$$ p(y|x) = \mathcal{N}(y|Ax+b , L^{-1}) $$

* we have:
$$ p(y) = \mathcal{N}(y|A\mu + b , L^{-1} + A\Lambda^{-1}A^T) $$
$$ p(x|y) = \mathcal{N}(x|\Sigma\{A^TL(y-b) + \Lambda\mu\} , \Sigma) $$

where $ \Sigma = (\Lambda + A^TLA)^{-1}$

# Bayesian Linear Regression

## What is Regression?

* Given a set of observations: $ x = \{ x_1 . . . x_N \}$ and the corresponding target values: $ t = \{ t_1 . . . t_N \} $



* We want to learn a function $y(x)=t$ to predict future values. 
    - We have just learned to find the maximum likelihood weights $w_{ML}$ , to predict $y(x,w_{ML})$.
    
    $$ w_{ML} = (\lambda I + \Phi^T\Phi)^{-1} \Phi^T t $$

## Overview: Bayesian Linear Regression

* With a likelihood:$ p(t|X,w,\beta) = \prod_{n=1}^{N} \mathcal{N}(t_n|w^T \phi(x_n), \beta^{-1}) $ 

* and a prior: $ p(w) = \mathcal{N}(w|m_0, S_0) $


* We can get a posterior as:
$$ p(w|t) = \mathcal{N}(w|m_N, S_N) $$
where: $m_N = S_N(S_0^{-1}m_0 + \beta \Phi^Tt)$ and $S_N^{-1} = S_0^{-1} + \beta\Phi^T\Phi $

## Simplifying the Prior

* Zero-mean isotropic Gaussian:
$$ p(w|\alpha) = \mathcal{N}(w|0, \alpha^{-1}I) $$


* The corresponding posterior is:
$$ p(w|t) = \mathcal{N}(w|m_N, S_N) $$



* where $m_N = \beta S_N\Phi^Tt$ and $S_N^{-1} = \alpha I + \beta \Phi^T\Phi $ 

## Derivation of posterior distribution

* Use linear Gaussian distributions:
$$ p(w) = \mathcal{N}(w|0, \alpha^{−1}I)$$
$$ p(t|w,x) = \mathcal{N}(t|\phi w, \beta^{−1}I) $$


* Conditional distribution:
$$ p(w|t,x) = \mathcal{N}(w|\Sigma(\beta\Phi^Tt), \Sigma^{−1}) $$

where $ \Sigma = (\alpha I + \beta \Phi^T \Phi)^{-1} $



## Recall

$$ 
\begin{array}{rcl}
p(x) &= &\mathcal{N}(x|\mu, \Lambda^{-1})\\
p(y|x) &= &\mathcal{N}(y|Ax+b, L^{-1})\\
p(y) &= &\mathcal{N}(y|A\mu + b , L^{-1} + A\Lambda^{-1}A^T)\\
p(x|y) &= &\mathcal{N}(x|\Sigma\{A^TL(y-b) + \Lambda\mu\} , \Sigma)\\
\Sigma &= &(\Lambda + A^TLA)^{-1}
\end{array}
$$

## Recall: Maximum Likelihood

* Same as minimizing sum-of-squared error, with regularization term:
$$ ln p(w|t) = - \frac{\beta}{2}\sum_{n=1}^{N}\{ t_n - w^T\phi(x_n)\}^2 - \frac{\alpha}{2}w^Tw + constant $$


* This is the same as $w_{ML}$ , with $ \lambda = \frac{\alpha}{\beta}$


* So the solution is (same as $w_{ML}$):
$$ w_{MAP} = (\lambda I + \Phi^T \Phi)^{-1}\Phi^Tt $$

* But now we have the variance on w, as well!

## Sequential Bayesian Learning

* Simple model: $y(x,w) = w_0 + w_1 x$

** Posterior  = Prior * Likelihoood **
<img src = "sbl.png">

## Sequential Bayesian Learning

* Samples drawn from posterior

$\hspace{5em}$likelihood (for a given example) $\hspace{8em}$ posterior $\hspace{12em}$ data
<img src = "sbl2.png">

## Sequential Bayesian Learning

* Sample lines drawn from posterior.

$\hspace{5em}$likelihood (for a given example) $\hspace{8em}$ posterior $\hspace{12em}$ data
<img src = "sbl3.png">

## Sequential Bayesian Learning

* Sample lines drawn from posterior

$\hspace{5em}$likelihood (for a given example) $\hspace{8em}$ posterior $\hspace{12em}$ data
<img src = "sbl4.png">

## Predictive Distribution

* Our real goal is to predict t given new x, so we evaluate the predictive distribution:
$$ p(t|x,{\bf t}, \alpha, \beta) = \int p(t|x,w \beta) p(w|{\bf t}, \alpha, \beta) dw $$


* where $ p(t|x,w,\beta) = \mathcal{N}(t|y(x,w), \beta^{-1} $ and we just derived $ p(w|{\bf t}) = \mathcal{N}(w|m_N, S_N)$, $m_N = \beta S_N \Phi^Tt$ and $S_N^{-1} = \alpha I + \beta \Phi^T \Phi$


* Note: This is equal to $W_MAP^*$, the optimal solution of regularized linear regression.

## Predictive Distribution

* The integral is a convolution of Gaussians, so the result is:
$$ p(t_x,{\bf t}, \alpha, \beta) = \mathcal{N}(t|m_N^T\phi(x), \sigma_N^2(x)) $$ 


* where $ \sigma_N^2(x) = \frac{1}{\beta}+\phi(x)^TS_n\phi(x)$
    - Intuitively, this corresponds to noise in the data + uncertainty in $w$

## Predictive Distribution & Samples

* Using 9 Gaussian basis functions $ \phi_j(x) = exp\left\{ -\frac{(x- \mu_j)}{2s^2} \right\}$

<img src = "n1.png" height = 200px, width = 300px> <img src = "n1b.png" height = 200px, width = 300px>
$\hspace{20em}N=1$ observed point



<img src = "n2.png" height = 200px, width = 300px> <img src = "n2b.png" height = 200px, width = 300px>
$\hspace{20em}N=2$ observed points


<img src = "n4.png" height = 200px, width = 300px> <img src = "n4b.png" height = 200px, width = 300px>
$\hspace{20em}N=4$ observed points



<img src = "n25.png" height = 200px, width = 300px> <img src = "n25b.png" height = 200px, width = 300px>
$\hspace{20em}N=25$ observed points

## Gaussian Processes

### Why?
* Here are some data points. What function did they come from?
<img src="gp1.png">


* GPs are a nice way of expressing this “prior on functions” idea.


* Applications: Regression and Classification

## Definition of GP

* A Gaussian process is defined as a **probability distribution over functions y(x)**, such that the set of values of y(x) evaluated at an arbitrary set of points $x_1 , x_2 , ... , x_n$ jointly have a **Gaussian distribution**.
    - Any finite subset of indices defines a multivariate Gaussian distribution (i.e., $(y(x_1 ), y(x_2 ), ... , y(x_n )$) is a multivariate Gaussian.
    

* What determines the GP is
    - The mean function $\mu(x) = \mathop{\mathbb{E}}(y(x))$
    - The covariance function (kernel) $k(x,x')=\mathop{\mathbb{E}}(y(x)y(x'))$
    - In most applications, we take $\mu(x)=0$. Hence the prior is represented by the kernel.

## Covariance function of GP defines prior

* The figure show samples of functions drawn from Gaussian processes for two different choices of kernel functions.


<img src="gpleft.png">
$$k(x, x') = exp(−\theta ||x − x'||_2^2) \hspace{4em} k(x, x') = exp(−\theta ||x − x'||_1)$$

## Linear Regression Revisited

* Linear regression model: Combination of M fixed basis functions given by $\phi(x)$, so that 
$$ y(x) = w^T \phi(x) $$


* Prior distribution: $ p(w) = \mathcal{N}(w|0, \alpha^{-1}I) $


* Given training data points $(x_1 , x_2 , ... , x_n)$ , what is the joint distribution of $ y(x_1) , y(x_2) , ..., y(x_n)$?
    - y is the vector with elements $y_i = y(x_i)$ , which is given by $$ {\bf y = \Phi w }$$ 
    - where $\Phi$ is the design matrix with elements $\Phi_{nk} = \Phi_k(x_n)$

## Linear Regression Revisited

* $y = \Phi w$:- y is a linear combination of Gaussian distributed variables w, hence itself is Gaussian.

* Mean and covariance:
$$ \mathop{\mathbb{E}}[y] = \Phi\mathop{\mathbb{E}}[w] = 0 $$
$$ Cov(y)  =\mathop{\mathbb{E}}[yy^T] = \Phi\mathop{\mathbb{E}}[ww^T]\Phi^T = \frac{1}{\alpha}\Phi\Phi^T = K $$


* K is the gram matrix with elements $K_{nm} = \kappa(x_n,x_m) = \frac{1}{\alpha} \Phi(x_n)^T\Phi(x_m)$ and $\kappa(x,x')$ is the kernel function. 

## Bayesian Linear Regression and GP

* In summary, Bayesian linear regression is a special instance of a Gaussian Process.



* It is defined by the linear regression model $y(x) = w^T\phi(x)$ with a weight prior $p(w)=\mathcal{N}(w|0, \alpha^{-1} I)$


* The kernel function is given by $\kappa(x_n, x_m) = \frac{1}{\alpha}\phi(x_n)^T\phi(x_m)$


* Features in Bayesian LR $\iff$ kernel functions in GP.

## GP for regression

* Consider the noise on the observed target values $t_n = y_n + \epsilon_n$ where $y_n = y(x_n)$ and $\epsilon_n $ is a random noise.


* Equivalently, consider a noise process: $p(t_n|y_n) = \mathcal{N}(t_n|y_n, \beta^{-1})$ where where $\beta$ is a hyperparameter (precision of the noise).


* Since $\epsilon_n$ independent, this can be represented as multivariate Gaussian
$$ p({\bf t|y}) = \mathcal{N}({\bf t}|{\bf y}, \beta^{-1} {\bf I}) $$

## GP for regression

* From the definition of GP, the marginal distribution p(y) is given by $p(y) = \mathcal{N}(y|0,K)$.
    - Note: $p(y)$ and $p(t|y)$ forms linear Gaussian distribution


* Then, the marginal distribution of t is given by 
$$p(t) = \int p(t|y)p(y)dy = \mathcal{N}(t|0, C) $$ 
    - where the covariance matrix C has elements 
$$ C(x_n, x_m) = \kappa(x_n,x_m) + \beta^{-1}\delta_{mn} $$

## Example: sampling data points

* Sample function from GP (blue): $y(x) ~ GP(\mu, K)$


* Sample points from GP (red): $(x_1 , y (x_1)) , (x_2 , y(x_2)) ,..., (x_N , y(x_N)) ~ GP(\mu, K)$


* Add noise (green): $t_n(x) ~ y_n(x) + \mathcal{N}(0, \beta^{-1} )$

<img src ="gpexample.png"> 

## GP for regression

* We have used GP to build a model of the joint distribution over sets of data points.


* **Goal**: Given data $(x_1 , ... , x_n)$ and target values $t = (t_1 , ... , t_n)$ , predict $t_{n+1}$ for query point $x_{n+1}$ .


* Idea: GP assumes that $p(t_1 , ... , t_n, t_{n+1} ~ \mathcal{N}(0, C_{n+1} )$ where 
        - $C_{n+1}$ is $(n+1)\times(n+1)$ matrix $C(x_n,x_m) = \kappa(x_n, x_m) + \beta^{-1}\delta_{nm}$
        
        
* $C_{n+1} = \left\{\begin{array}{cc} C_n & k \\ k^T & c \end{array}\right\}$ where $C_n$ is $n \times n$ matrix, and $c = k(x_{n+1} ,x_{n+1})+ \beta^{-1}$

## GP for regression

* The conditional distribution $p(t_{n+1} |t)$ is a Gaussian distribution with mean and covariance given by:
$$ m(x_{n+1}) = k^T C_n^{-1}t \\ \sigma^2(x_{n+1}) = c- k^TC_n^{-1} k$$



* These are the key results that define Gaussian process regression.


* The predictive distribution is a Gaussian whose **mean and variance both depend** on $x_{n+1}$


## An Example of GP regression

* Green: underlying function (sine function)
* Blue: samples from GP (with noise)
* Red: prediction from GP regression ) with “error bars”

<img src = "gp_regression.png">

## GP for Regression

* The only restriction on the kernel is that the covariance matrix given by $C(x_n, x_m) = \kappa(x_n, x_m) + \beta^{-1}\delta_{nm}$ must be positive definite.



* GP will involve a matrix of size $N \times N$, for which require $O(N^3)$ computations. 

## Learning Hyperparameters

* Log likelihood: 
$$ ln(p(t|\theta) = -\frac{1}{2} ln|C_N| - \frac{1}{2}t^TC_N^{-1}t - \frac{N}{2} ln(2\pi) $$


* Gradient Ascent for parameter $\theta$:
$$ \frac{\partial}{\partial{\theta_i}}lnp(t|\theta) = c-\frac{1}{2}Tr(C_N^{-1})\frac{\partial{C_N}}{\partial{\theta_i}} + \frac{1}{2}t^TC_N^{-1}\frac{\partial{C_N}}{\partial{\theta_i}}C_N^{-1}t $$


* where we have used the following properties:
$$\frac{\partial{A^{-1}}}{\partial{x}} = -A^{-1}\frac{\partial{A}}{\partial{x}}A^{-1}$$
$$ \frac{\partial{ln|A|}}{\partial{x}} = Tr A^{-1} \frac{\partial{A}}{\partial{x}} $$

## Summary of GP

* Distribution over functions

*  GP generates data points that are jointly a Gaussian distribution


* Most interesting structure is in $\kappa(x,x’)$, the kernel.


* GP can be used for regression to predict the target for a new input example.