# 1. Introduction

## 1.2 Probability Theory

### 1.2.2 Expectations and covariances

The average value of some function $f(x)$ under a probability distribution $p(x)$ is called the _expectation_ of $f(x)$.

For a discrete distribution
$$
\mathop{\mathbb{E}}[f] = \sum_{x}p(x)f(x) \tag{1.33}
$$
For a continuous distribution
$$
\mathop{\mathbb{E}}[f] = \int p(x)f(x)dx \tag{1.34}
$$

If we are given a finite number N of points drawn from the probability distribution or probability density, the the _expectation_ can be approximated as a finite sum over these points. The approximation becomes exact in the limit $N\to\infty$ 
$$\mathop{\mathbb{E}}[f] \simeq \frac{1}{N}\sum^{N}_{n=1}f(x_{n}) \tag{1.35}$$

We use a subcript to indicate which variable is being averaged over a functions of several variables. So the _expectation_ of the function $f(x, y)$ with respect to the distribution of $x$ is denoted by $$\mathop{\mathbb{E}_{x}}[f(x, y)] \tag{1.36}$$ 
Note $\mathop{\mathbb{E}_{x}}[f(x, y)]$ will be a function of $y$.
And we use $\mathop{\mathbb{E}_{x}}[f\mid y]$ to denote a _conditional expectation_ with repect to a conditional distribution.
$$\mathop{\mathbb{E}_{x}}[f\mid y] = \sum_{x}p(x\mid y)f(x) = \int p(x\mid y)f(x)dx \tag{1.37}$$

The covariance and variance is defined by
$$
\begin{equation}
\begin{split}
cov(f,g) & = \mathop{\mathbb{E}_{x, y}}\big[(f(x) - \mathop{\mathbb{E}}[f(x)])(g(y) - \mathop{\mathbb{E}}[g(y)])\big]\\
& = \mathop{\mathbb{E}_{x, y}}\big[(f(x)g(y)\big] - \mathop{\mathbb{E}_{x, y}}\big[f(x)\mathop{\mathbb{E}}[g(y)]\big] - \mathop{\mathbb{E}_{x, y}}\big[g(y)\mathop{\mathbb{E}}[f(x)]\big] + \mathop{\mathbb{E}_{x, y}}\big[\mathop{\mathbb{E}}[f(x)]\mathop{\mathbb{E}}[g(y)]\big]\\
& = \mathop{\mathbb{E}_{x, y}}[(f(x)g(y)] - \mathop{\mathbb{E}}[f(x)]\mathop{\mathbb{E}}[g(y)] - \mathop{\mathbb{E}}[g(y)]\mathop{\mathbb{E}}[f(x)] + \mathop{\mathbb{E}}[f(x)]\mathop{\mathbb{E}}[g(y)]\\
& = \mathop{\mathbb{E}_{x, y}}[(f(x)g(y)] - \mathop{\mathbb{E}}[f(x)]\mathop{\mathbb{E}}[g(y)]\\
\end{split}
\end{equation}
$$

$$var(f) = \mathop{\mathbb{E}}\big[\big(f(x) - \mathop{\mathbb{E}}[f(x)]\big)^{2}\big] \tag{1.38}$$
$$var(f) = \mathop{\mathbb{E}}[(f(x)^{2}] - \mathop{\mathbb{E}}[f(x)]^{2} \tag{1.39}$$
If $f(x) = x$, $g(y) = y$
$$var(x) = cov(x, x) = \mathop{\mathbb{E}}[x^{2}] - \mathop{\mathbb{E}}[x]^{2} \tag{1.40}$$
$$cov(x, y) = \mathop{\mathbb{E}_{x, y}}[xy] - \mathop{\mathbb{E}}[x]\mathop{\mathbb{E}}[y] \tag{1.41}$$

for $\textbf{x}\in R^{m}$ and $\textbf{y}\in R^{n}$, the result is a matrix.
$$
cov(\textbf{x}, \textbf{y}) = \mathop{\mathbb{E}_{\textbf{x}, \textbf{y}}}[\textbf{x}\textbf{y}^{T}] - \mathop{\mathbb{E}}[\textbf{x}]\mathop{\mathbb{E}}[\textbf{y}]^{T} \tag{1.42}
$$

The covariance matrix generalizes the notion of variance to multiple dimensions.
$$
\begin{equation}
\begin{split}
\Sigma(\textbf{x}) & = cov(\textbf{x}, \textbf{x})\\
& = 
\begin{bmatrix}
    \mathop{\mathbb{E}}\bigg[\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\bigg] & \mathop{\mathbb{E}}\bigg[\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\bigg] & \dots  & \mathop{\mathbb{E}}\bigg[\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\bigg]\\
    \mathop{\mathbb{E}}\bigg[\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\bigg] & \mathop{\mathbb{E}}\bigg[\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\bigg] & \dots  & \mathop{\mathbb{E}}\bigg[\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\bigg] \\
    \vdots & \vdots & \ddots & \vdots \\
    \mathop{\mathbb{E}}\bigg[\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\bigg] & \mathop{\mathbb{E}}\bigg[\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\bigg] & \dots  & \mathop{\mathbb{E}}\bigg[\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\bigg]
\end{bmatrix}
\end{split}
\end{equation}
$$

### 1.2.3 Bayesian probabilities

Bayes's theorem allows us to evaluate the uncertainty in $\textbf{w}$ in the form of the posterior probability $p(\textbf{w}\mid \mathcal{D})$ after we have incorporated the evidence provided by the observed data $\mathcal{D}$.
$$p(\textbf{w}\mid \mathcal{D}) = \frac{p(\mathcal{D}\mid\textbf{w})p(\textbf{w})}{p(\mathcal{D})} \tag{1.43}$$
The quantity $p(\mathcal{D}\mid\textbf{w})$ is called the _likelihood function_. It expresses how probable the observed data ser is for the specified parameter vector $\textbf{w}$. $p(\textbf{w})$ is the prior probability distribution of $\textbf{w}$ before observing the data.
We can state Bayes's theorem in words
$$posterior \propto likelihood \times prior \tag{1.44}$$
The denominator is the normalization constant. which ensures that the posterior distribution integrates to one.
$$p(\mathcal{D}) = \int p(\mathcal{D}\mid\textbf{w})p(\textbf{w})d\textbf{w} \tag{1.45}$$
In a frequentist setting, $\textbf{w}$ is determined by some form of "estimator". A widely used one is _maximum likelihood_, in which $\textbf{w}$ is set to the value that maximizes the likelihood function $p(\mathcal{D}\mid\textbf{w})$.

### 1.2.4 The Gaussian distribution

For the case of a single real-value variable $x$, the Gaussian distribution is defined by
$$\mathcal{N}(x\mid\mu, \sigma^{2}) = \frac{1}{(2\pi\sigma^{2})^{1/2}}exp\left\{-\frac{1}{2\sigma^{2}}(x - \mu)^{2}\right\} \tag{1.46}$$

$$\mathop{\mathbb{E}}[x] = \mu = \text{"mean"}\tag{1.49}$$
$$\mathop{\mathbb{E}}[x^{2}] = \mu^{2} + \sigma^{2} \tag{1.50}$$
$$var[x] = \mathop{\mathbb{E}}[x^{2}] - \mathop{\mathbb{E}}[x]^{2} = \sigma^{2} =\text{"variance"} \tag{1.51}$$
The reciprocal of the variance, written as $\beta = \frac{1}{\sigma^{2}}$, is called the _precision_.

$\mathcal{N}$ defined over a D-dimensional vector x of continuous variables with the covariance $\Sigma$ is given by
$$\mathcal{N}(x\mid\mu, \Sigma) = \frac{1}{(2\pi)^{D/2}}\frac{1}{\left|\Sigma\right|^{1/2}}exp\left\{-\frac{1}{2}(x - \mu)^{T}\Sigma^{-1}(x - \mu)\right\} \tag{1.52}$$

Suppose that we have a data set of N observation that are _independent and identically distributed_ $\textbf{X}$, the using the fact that the _joint probability_ of two independet events is given by the product of marginal probabilities. The probability of the data set, given $\mu$ and $\sigma^{2}$ is
$$p(\textbf{X}\mid\mu, \sigma^{2}) = \prod^{N}_{n=1}\mathcal{N}(x_{n}|\mu, \sigma^{2}) \tag{1.53}$$

Taking the log of the likelihhod function, results in the form
$$
ln\,p(\textbf{X}\mid\mu, \sigma^{2}) = -\frac{1}{2\sigma^{2}}\sum^{N}_{n=1}(x_{n} - \mu)^{2} - \frac{N}{2} ln\,\sigma^{2} - \frac{N}{2} ln\,(2\pi) \tag{1.54}
$$

Maximizing with respect to $\mu$, we obtain the maximum likelihood solution which is the _sample mean_.
$$
\begin{equation}
\begin{split}
\hat{\mu} & = \arg\max_{\mu}eq(1.54)\\
\implies\frac{\partial}{\partial\mu}eq(1.54) = 0 & \implies\sum^{N}_{n=1}x_{n} = N\mu\\
\end{split}
\end{equation}
$$
$$\hat{\mu} = \frac{1}{N}\sum^{N}_{n=1}x_{n} \tag{1.55}$$

Similarly, maximizing eq(1.54) with respect to $\sigma^{2}$, we can the _sample variance_.
$$
\begin{equation}
\begin{split}
\hat{\sigma}^{2} & = \arg\max_{\sigma^{2}}eq(1.54)\\
\implies\frac{\partial}{\partial\sigma^{2}}eq(1.54) = 0 & \implies\frac{1}{2(\sigma^{2})^{2}}\sum^{N}_{n=1}(x_{n} - \mu)^{2} = \frac{N}{2\sigma^{2}}\\
\end{split}
\end{equation}
$$
$$\hat{\sigma}^{2} = \frac{1}{N}\sum^{N}_{n=1}(x_{n} - \hat{\mu})^{2} \tag{1.56}$$

$$
\mathop{\mathbb{E}}[\hat{\mu}] = \mathop{\mathbb{E}}\bigg[\frac{1}{N}\sum^{N}_{n=1}x_{n}\bigg] = \frac{1}{N}\sum^{N}_{n=1}\mathop{\mathbb{E}}[x_{n}] = \mu \tag{1.57}
$$

The maximum likelihood approach systematically underestimates the variance of the distribution by a factor $(N - 1) / N$.
$$
\begin{equation}
\begin{split}
\mathop{\mathbb{E}}[\hat{\sigma}^{2}] & = \mathop{\mathbb{E}}\bigg[\frac{1}{N}\sum^{N}_{i=1}(x_{i} - \frac{1}{N}\sum^{N}_{j=1}x_{j})^{2}\bigg]\\ 
& = \frac{1}{n}\sum^{N}_{i=1}\mathop{\mathbb{E}}\bigg[x_{i}^{2} - \frac{2}{N}x_{i}\sum^{N}_{j=1}x_{j} + \frac{1}{N^{2}}\sum^{N}_{j=1}x_{j}\sum^{N}_{k=1}x_{k}\bigg]\\
& = \frac{1}{n}\sum^{N}_{i=1}\bigg[\frac{N-2}{N}\mathop{\mathbb{E}}[x_{i}^{2}] - \frac{2}{N}\sum^{N}_{j\neq i}\mathop{\mathbb{E}}[x_{i}x_{j}] + \frac{1}{N^{2}}\sum^{N}_{j=1}\sum^{N}_{k\neq j}\mathop{\mathbb{E}}[x_{j}x_{k}] + \frac{1}{N^{2}}\sum^{N}_{j=1}\mathop{\mathbb{E}}[x_{j}^{2}]\bigg]\\
& = \frac{1}{n}\sum^{N}_{i=1}\bigg[\frac{N-2}{N}(\mu^{2} + \sigma^{2}) - \frac{2}{N}(N - 1)\mu^{2} + \frac{1}{N^{2}}N(N - 1)\mu^{2} + \frac{1}{N}(\mu^{2} + \sigma^{2})\bigg]\\
& = \frac{N-1}{N}\sigma^{2}\\
\end{split}
\end{equation}
$$
$$
\mathop{\mathbb{E}}[\hat{\sigma}^{2}] = \frac{N-1}{N}\sigma^{2} \tag{1.58}
$$

This is an example of a phenomenon called _bias_ and is related to the problem of over-fitting. The bias of $\hat{\theta}$ is defined as
$$
B(\hat{\theta}) = \mathop{\mathbb{E}}[\hat{\theta} - \theta] = \mathop{\mathbb{E}}[\hat{\theta}] - \theta
$$
The estimator $\hat{\theta}$ is an _unbiased estimator_ of $\theta$ if and only if $B(\hat{\theta}) = 0$

From eq(1.58) it follows that the following estimate for the variance is unbiased
$$
s^{2} = \tilde{\sigma}^{2} = \frac{N}{N-1}\hat{\sigma}^{2} = \frac{1}{N-1}\sum^{N}_{n=1}(x_{n} - \hat{\mu})^{2} \tag{1.59}
$$

The bias of the maximum lkelihood solution becomes less significant as the number N of data points icreases, and in the limit $N\to\infty$ the maximum likelihood solution for the variance euqals the true variance of the distribution that enerated the data.

### 1.2.5 Curve fitting re-visited

We assume that given the value of _x_, the corresponding value of _t_ has a Gaussian distribution with a mean equal to the value $y(x,\textbf{w})$. Thus we can express our uncertainty over the value of the target variable using a probability distribution.
$$
p(t\mid x, \textbf{w}, \beta) = \mathcal{N}(t\mid y(x, \textbf{w}), \beta^{-1}) \tag{1.60}
$$

We use the training data $\{\textbf{x},\textbf{w}\}$ to determine the values of the unknown parameters $\textbf{w}$ and $\beta$ by maximum likelihood. If the data are assumed to be drawn independently from the distribution, then the likelihood function is given by
$$
p(\textbf{t}\mid\textbf{x},\textbf{w},\beta) = \prod^{N}_{n=1}\mathcal{N}(t_{n}\mid y(x_{n}, \textbf{w}), \beta^{-1}) \tag{1.61}
$$

Substituting for the form of the Gaussian distribution, and take the logarithm
$$
ln\,p(\textbf{t}\mid\textbf{x},\textbf{w},\beta) =  -\frac{\beta}{2}\sum^{N}_{n=1}\mathcal\{y(x_{n}, \textbf{w}) - t_{n}\}^{2} + \frac{N}{2}ln\,\beta - \frac{N}{2}ln\,(2\pi) \tag{1.62}
$$

Maximizes likelihood with respect to $\textbf{w}$ we can obtain the _sum-of-squares-error-function_ defined by eq(1.2). Maximizing likehood with respect to $\beta$ gives
$$
\frac{1}{\hat{\beta}} = \frac{1}{N}\sum^{N}_{n=1}\{y(x_{n}, \hat{\textbf{w}}) - t_{n}\}^{2} \tag{1.63}
$$

Having determined the parameters $\textbf{w}$ and $\beta$, we can express our probabilistic model in terms of the _predictive distribution_ that gives the probability distribution over _t_.
$$p(t\mid x,\hat{\textbf{w}},\hat{\beta}) = \mathcal{N}(t\mid y(x, \hat{\textbf{w}}), \hat{\beta}^{-1}) \tag{1.64}$$

We introduce a prior distribution over the polynomial coefficients $\textbf{w}$. Here use a Gaussian distribution just for simplicity.
$$
p(\textbf{w}\mid\alpha) = \mathcal{N}(\textbf{w}\mid 0, \alpha^{-1}\textbf{I}) = \Big(\frac{\alpha}{2\pi}\Big)^{(M + 1) / 2} exp\big\{-\frac{\alpha}{2}\textbf{w}^{T}\textbf{w}\big\} \tag{1.65}
$$
Where $\alpha$ is the precision of the distribution, and $M + 1$ is the total number of elements in the vector $\textbf{w}$ for an $M^{th}$ oder polynomial

Variables such as $\alpha$, which control the distribution of model parameters, are called _hyperparameters_. Using Bayes's theorem
$$p(\textbf{w}\mid\textbf{x}, \textbf{t}, \alpha, \beta) \propto p(\textbf{t}\mid\textbf{x}, \textbf{w}, \beta)p(\textbf{w}\mid\alpha)\tag{1.66}$$

We can now determine $\textbf{w}$ by finding the most probable value of $\textbf{w}$ by maximizing the posterior distribution.
Taking the negative logarithm of eq(1.66), we find that the maximum of the posterior is given by the minimum of
$$\frac{\beta}{2}\sum^{N}_{n=1}\{y(x_{n}, \hat{\textbf{w}}) - t_{n}\}^{2} + \frac{\alpha}{2}\textbf{w}^{T}\textbf{w}\tag{1.67}$$
eq(1.4) is equivalent to minimize above equation with a regularization parameter given $\lambda = \frac{\alpha}{\beta}$

TODO> PRML

# Machine Learning

## Learning Algorithms

A machine learning algorithm is a computer program which is said to learn from experience **E** with respect to someclass of tasks **T** and performance measure **P**, if its performance at tasks in **T**, asmeasured by **P**, improves with experience **E**.

### The Task, T

Machine learning tasks are usually described in terms of how the machine learning system should process an example. An example is a collection of features that have been quantitatively measured from some object or event that we want the machine learning system to process. We typically represent an example as avector $x \in R^n$ where each entry $x_i$ of the vector is another feature.

#### Classiﬁcation

The learning algorithm is usually asked to produce a function $f:R^n \to \{1, . . . , k\}$ to specify which of $k$ categories some input belongs to. There are other variants where $f$ outputs a probability distribution over classes.

#### Classiﬁcation with missing inputs

To solve the classiﬁcation task when some of the inputs may be missing, the learning algorithm must learn a set of functions. Each function corresponds to classifying $x$ with a diﬀerent subset of its inputs missing. One way to eﬃciently deﬁne such a large set of functions is to learn a probability distribution over all the relevant variables, then solve the classiﬁcation task by marginalizing out the missing variables. The computer program needs to learn only a single function describing the joint probability distribution.

#### Regression

The learning algorithmis asked to output a function $f:R^n→ R$ to predict a numerical value given some input.

#### Transcription

The machine learning system is asked to observe a relatively unstructured representation of some kind of data and transcribe the information into discrete textual form.

#### Machine translation

In a machine translation task, the input already consists of a sequence of symbols in some language, and the computer program must convert this into a sequence of symbols in another language.

#### Structured output

Structured output tasks involve any task where the output is a vector with important relationships between the diﬀerent elements. This is a broad category and subsumes the transcription and translation tasks described above, as well as many other tasks.

#### Anomaly detection

The computer program sifts through a set of events or objects and ﬂags some of them as being unusualor atypical.

#### Synthesis and sampling

The machine learning al-gorithm is asked to generate new examples that are similar to those in thetraining data. In some cases, we want the sampling or synthesis procedure to generate aspeciﬁc kind of output given the input.

#### Imputation of missing values

In this type of task, the machine learning algorithm is given a new example $x ∈ R^n$, but with some entries $x_i$ of $x$ missing. The algorithm must provide a prediction of the values of the missing entries.

#### Denoising

The machine learning algorithm is given as input a corrupted example $\tilde{x} ∈ R^n$ obtained by an unknown corruption process from a clean example $x ∈ R^n$. The learner must predict the clean examplex from its corrupted version $\tilde{x}$, or more generally predict the conditional probability distribution $p(x | \tilde{x})$.

#### Density estimationorprobability mass function estimation

The machine learning algorithm is asked to learn a function $p_{model}:R^n \to R$, where $p_{model}(x)$ can be interpreted as a probability density function (if $x$ is continuous) or a probability mass function (if $x$ is discrete) on the space that the examples were drawn from.

### Performance Measure, $P$

To evaluate the abilities of a machine learning algorithm, we must design a quantitative measure of its performance. Usually this performance measure $P$ is speciﬁc to the task T being carried out by the system. For tasks such as classiﬁcation we often measure the **accuracy** or the **error rate** of the model. We often refer to the error rate as the expected **0-1 loss**. The **0-1 loss** on a particular example is 0 if it is correctly classiﬁed and 1 if it is not. For tasks such as density estimation, the most common approach is to report the average log-probability the model assigns to some examples. We therefore evaluate these performance measures usingatest setof data that is separate from the data used for training the machinelearning system. In other cases, we know what quantity we would ideally like to measure, but measuring it is impractical. In these cases, one must design a good approximation to the desired criterion.

### The Experience, E

Machine learning algorithms can be broadly categorized as unsupervised or supervised. Most of the learning algorithms can be understood as being allowed to experience an entiredataset. A dataset is a collection of many examples, Sometimes we call examples data points.

**Unsupervised learning** algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset.

**Supervised learning algorithms** experience a dataset containing features, but each example is also associated with a label or target.

Roughly speaking, unsupervised learning involves observing several examples of a random vector $x$ and attempting to implicitly or explicitly learn the probability distribution $p(x)$, or some interesting properties of that distribution; while supervised learning involves observing several examples of a random vector $x$ and an associated value or vectory, then learning to predicty from $x$, usually by estimating $p(y | x)$.

The chain rule of probability states that for a vector $x \in R^n$, the joint distribution can be decomposed as

$$
p(x) =\prod_{i=1}^n{p(x_i| x_1, . . . , x_{i−1})} 
$$

This decomposition means that we can solve the ostensibly unsupervised problem of modeling $p(x)$ by splitting it into $n$ supervised learning problems.

# Reference

*   [Deep Learning](http://www.deeplearningbook.org/)