# 0. Introduction

## Supervised Learning
+ Traning data comprising example of the input vectors along with their corresponding target vector
+ Task: **classification, regression**

## Unsupervised Learning 
+ Training data only consists of a set of input vector
+ Task: **clustering, density estimation, visualization**

## Reinforcement Learning 
+ Concerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward
+ trade-off: exploration and exploitation

# 1. Example: Polynomial Curve Fitting

+ Training data set (**x,t**) with size N 
$$\mathbf{x}\equiv(x_1, ..., x_N)^T$$ 
$$\mathbf{t}\equiv(t_1, ..., t_N)$$

+ Predictions: Polynomial Function
$$y(x,\mathbf{w})=w_0+w_1x+w_2x^2+...+w_Mx^M=\sum_{j=0}^{M}w_jx^j $$

+ Minimize an error function
$$E(\mathbf{w})=\frac{1}{2}\sum_{n=1}^{N}[y(x_n,\mathbf{w})-t_n]^2$$
where the factor of 1/2 is included for later convenience. Take its derivation with respect to coefficient w to find optimal solution $y(x,w^*)$

+ Model selection: choose a suitable M to avoid overfitting  


+ root-mean-square (RMS) error:
$$E_{RMS}=\sqrt{2E(w^*)/N}$$  


+ The larger the dataset, the more complex the model that we can afford to fit to the data  


+ regularization (weight decay):
$$\widetilde{E}(\mathbf{w}) = \frac{1}{2}\sum_{n=1}^{N}[y(x_n,\mathbf{w})-t_n]^2+\frac{\lambda}{2}||w||^2$$
where $||\mathbf{w}||^2 = \mathbf{w}^T\mathbf{w}=w_0^2+w_1^2+...+w_M^2$ and the coefficient $\lambda$ governs the relative importance of the regularization term 

## 2. Probability Theory 

![Capture.JPG](attachment:Capture.JPG)


## joint probablity
$$p(X=x_i,Y=y_j)=\frac{n_{ij}}{N}$$

## marginal probability 
$$p(X=x_i)=\sum_{j=1}^{L}p(X=x_i,Y=y_j)=\frac{c_i}{N}$$

## conditional probability
$$p(Y=y_j|X=x_i)=\frac{n_{ij}}{c_i}$$

## The Rules of Probablity

**sum rule** $$p(X)=\sum_{Y}{p(X,Y)}$$


**profuct rule** $$p(X,Y)=p(Y|X)p(X)$$

## Bayes' theorem
$$p(Y|X)=\frac{p(X|Y)p(Y)}{p(X)}$$

Prior probability is p(Y)

Posterior probability is p(Y|X)

## Probability density (Probability with respect to continuous variables)

If the probability of a real-valued variable x falling in the interval (x, x + δx) is given by p(x)δx for δx → 0, then
p(x) is called the probability density over x

+ the probability that x lie in interval (a,b)
$$P(x\in (a,b))=\int_{a}^{b}p(x)dx$$


+ $$\int p(x)dx=1$$

+ The sum and product rule
$$p(x)=\int p(x,y)dy$$
$$p(x,y)=p(y|x)p(x)$$

## Expectation and covariances

+ Expectation 
$$E[f] = \sum_{x} p(x)f(x) \tag{discrete}$$
$$E[f] = \int p(x)f(x)dx \tag{continuous}$$
if we are given a finite number N of points drawn from the probablity distribution or probability density
$$E[f]\simeq\frac{1}{N}\sum_{n=1}^{N}f(x_n)$$

+ Variance 
$$var[f]= E[(f(x)-E[f(x)])^2]$$
$$var[x]=E[x^2]-E[x]^2$$


+ Covariance
$$cov[x,y]=E_{x,y}[\{x-E(x)\}\{y-E[y]\}]\\=E_{x,y}[xy]-E[x]E[y]$$

## Bayesian probabilities

**Comparison between Bayesian and frequentist**


+ From a Bayesison perspective, we use the machinery of probability to describe the uncertainty in model parameters such as w, or indeed in the choice of model itself


+ it introduces prior probabilities and convert it into a posterior probabilities after observatiion
$$p(w|D)=\frac{p(D|w)p(w)}{p(D)}$$
The numerator is $posterior\propto likelihood \times prior$, the denominator is the normalization constant


+ In a frequentist setting, w is considered to be a fixed parameter, whose value is determined by some form of ‘estimator’, and error bars on this estimate are obtained by considering the distribution of possible data sets D. By contrast, from the Bayesian viewpoint there is only a single data set D (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over w.

## The Gaussian distribution

+ For a single real-valued variable x
$$ \mathcal{N}(x|\mu,\sigma^2)=\frac{1}{(2\pi\sigma^2)^{\frac{1}{2}}}exp\{-\frac{1}{2\sigma^2}(x-\mu)^2\} $$
where $\mu$ is the mean $\sigma^2$ is the variance


+ For a D-dimensional vector x of continuous variables
$$ \mathcal{N}(x|\mathbf{\mu},\Sigma)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma|^{1/2}}exp\{-\frac{1}{2}(x-\mathbf{\mu})^T\Sigma^{-1}(x-\mathbf{\mu})\} $$
where D-dimensional vector $\mu$ is called the mean, the D X D matrix $\Sigma$ is called the covariance.

+ For a data set sampled from a gaussian distribution given $\mu$ and $\sigma^2$
$$p(\mathbf{x}|\mu, \sigma^2)=\prod_{n=1}^{N}\mathcal{N}(x_n|\mu, \sigma^2)$$
Then the log likelihood function can be written as 
$$ln p(\mathbf{x}|\mu, \sigma^2)=-\frac{1}{2\sigma^2}\sum_{n=1}^{N}(x_n-\mu)^2-\frac{N}{2}ln\sigma^2-\frac{N}{2}ln(2\pi)$$
The maximum likelihood solution is 
    $$\mu_{ML}=\frac{1}{N}\sum_{n=1}^{N}x_n$$
    $$\sigma^2_{ML}=\frac{1}{N}\sum_{n=1}^{N}(x_n-\mu_{ML})^2$$
The expectation of these quantities are
$$E[\mu_{ML}]=\mu$$
$$E[\sigma^2_{ML}]=(\frac{N-1}{N})\sigma^2$$
so underestimate the true varience

## Curve fitting re-visited


The curve fitting problem can be treated as a Gaussian probability distribution
$$p(t|x,w,\beta)=\mathcal{N}(t|y(x,w),\beta^{-1})$$
where $\beta$ is a precision parameter corresponding to the inverse variance of the distribution. Then we use maximum likelihood approach to find parameter $w_{ML}$ and $\beta_{ML}$ 


The Bayesian approach maximizes the posterior distribution for w (**MAP**) which is proportional to the product of the prior distribution and the likelihood function

## Bayesian curve fitting


# 3. Model Selection


# 4. The Curse of Dimensionality


# 5. Decision Theory





# 6. Information Theory

**Entropy** ($log_2$ can be replaced by ln)

$$H[X]=-\sum_{x}p(x)log_2p(x)$$

**differential entropy**

$$H[X]=-\int p(x)lnp(x)dx$$

**Relative entropy**

$$KL(p||q)=-\int p(x)lnq(x)dx-(-\int p(x)lnp(x)dx)\\=-\int p(x)ln\{\frac{q(x)}{p(x)}\}dx$$

**mutual information**
$$I[x,y]\equiv KL(p(x,y)||p(x)p(y))$$