## I. Simple Curve Fitting

### A. Version $\alpha$

* Polynomial:
    * $y(x,\mathbf{w}) = w_0 + w_1x + w_2x^2 + ... w_Mx^M = \sum_{j=0}^Mw_jx^j$.
* SS Error:
    * $E(\mathbf{w}) = \frac{1}{2}\sum_{n=1}^N\{y(x_n,\mathbf{w})-t_n\}^2$.
* Root-Mean-Squared (RMS) Error (for evaluating Generalization):
    * $E_{RMS} = \sqrt{2E(\mathbf{w}^*)/N}$, where $\mathbf{w}^* = argmin_wE(\mathbf{w})$.
* L2/Ridge Regularization:
    * $\tilde{E}(\mathbf{w}) = \frac{1}{2}\sum_{n=1}^N\{y(x_n,\mathbf{w})-t_n\}^2 + \frac{\lambda}{2}||\mathbf{w}||^2$.

### B. Version $\beta$

* Gaussian Model:
    * Idea: Given input $\mathbf{x}=(x_1,...,x_N)^T$ and their corresponding target values $\mathbf{t}=(t_1,...,t_N)^T$, model the uncertainty over the value of the target variable using a probability distribution (i.e. $p(t=t_i)$). The probability distribution is modeled with a Gaussian, where the mean is the polynomial $y(x,\mathbf{w})$ (i.e. prediction) and the variance $\sigma^2$.
    * Model: 
        * Equation: $p(t\mid x,\mathbf{w},\beta) = \mathcal{N}(t\mid y(x,\mathbf{w}),\beta^{-1})$.
        * $\beta$: Precision, $\beta=\frac{1}{\sigma^2}$; $\mathbf{w} = (w_1,...,w_N)^T$.
* MLE:
    * i) Likelihood (of Target Values): $p(\mathbf{t}\mid \mathbf{x},\mathbf{w},\beta) = \prod_{n=1}^N\mathcal{N}(t_n|y(x_n,\mathbf{w}),\beta^{-1})$.
    * ii) Log-Likelihood: $\mathtt{ln}p(\mathbf{t}\mid \mathbf{x},\mathbf{w},\beta) = -\frac{\beta}{2}\sum_{n=1}^N\{y(x_n,\mathbf{w})-t_n\}^2 + \frac{N}{2}\mathtt{ln}\beta - \frac{N}{2}\mathtt{ln}(2\pi)$.
    * iii) Finding MLE Values for Parameters Using Log-Likelihood: $\mathbf{w}_{ML}$ and $\beta_{ML}$ (NB: $\beta^{-1}=\frac{1}{N}\sum_{n=1}^N\{y(x_n,\mathbf{w})-t_n\}^2$, which is proportional to the SS Error, which is why we using SS Error as the cost function).
    * iv) Formulate Predictive Distribution: $p(t\mid x,\mathbf{w}_{ML},\beta_{ML}) = \mathcal{N}(t\mid y(x,\mathbf{w}_{ML}),\beta_{ML}^{-1})$.
    
* Half Bayesian Approach to Find $\mathbf{w}$ (MAP, 48:1.65-6, point estimate):
    * Prior Distribution of $\mathbf{w}$: $p(\mathbf{w}\mid\alpha) = \mathcal{N}(\mathbf{w}\mid 0,\alpha^{-1}\mathbf{I})$.
    * Posterior Distribution of $\mathbf{w}$: $p(\mathbf{w}\mid \mathbf{x},\mathbf{t},\alpha,\beta) \propto p(\mathbf{t}\mid \mathbf{x},\mathbf{w},\beta)p(\mathbf{w}\mid\alpha)$.
    * Take negative logarithm of the posterior, and then get $argmin_\mathbf{w}$.

* Full Baysian Approach:
    * $\alpha,\beta$ assumed to be know hyperparameters.
    * Expectation of Target Distribution: $p(t\mid x,\mathbf{x},\mathbf{t}) = \int p(t\mid x,\mathbf{w})p(\mathbf{w}\mid\mathbf{x},\mathbf{t})d\mathbf{w}$.

### C. Concepts

* Bias:
    * E.g. systematic underestimation of the variance of a distribution (e.g. using a straight line to fit $sin(\pi x)$).

## II. Basics

### A. Probability

* Basic Rules:
    * Sum Rule: $p(X) = \sum_Yp(X,Y)$.
    * Product Rule: $p(X,Y) = p(Y|X)p(X)$.
* Bayes Rule:
    * $p(Y|X) = \frac{p(X|Y)p(Y)}{p(X)}$, where $p(X) = \sum_Yp(X|Y)p(Y)$.

### B. Decision

* Expected Loss (Classification):
    * Let $L_{kj}$ be the cost/loss incurred by classifying class $k$ datum $x$ as class $j$. This would allow us to configure/weight *Type I/II* Errors (i.e. false positive / false negative), and $L_{kk}$ is set to $0$.
    * Cost: 
        * Total Cost: $E[L] = \sum_k\sum_j\int_{R_j}L_{kj}p(\mathbf{x},\mathcal{C})d\mathbf{x}$.
        * Cost for Decision Region $R_j$: $\sum_kL_{kj}p(\mathbf{x},\mathcal{C_k}) = \sum_kL_{kj}p(\mathcal{C_k}\mid \mathbf{x})p(\mathbf{x}) \propto \sum_kL_{kj}p(\mathcal{C_k}\mid \mathbf{x})$.
        
* Expected Loss (Regression):
    * Cost:
        * Total (Quadratic) Cost: $E[L] = \int\int L(t,y(\mathbf{x}))p(\mathbf{x},t)d\mathbf{x}dt = \int\int\{y(\mathbf{x})-t\}^2p(\mathbf{x},t)d\mathbf{x}dt$.
    * Model $y(\mathbf{x})$ that Minimizes Cost:
        * Differentiate $E[L]$ wrt. $y(\mathbf{x})$ and set the result to $0$: $\frac{\delta E[L]}{\delta y(\mathbf{x})}=2\int\{y(\mathbf{x})-t\}^2p(\mathbf{x},t)dt = 0$.
        * Solve for $y(\mathbf{x})$: $y(\mathbf{x})=\frac{\int tp(\mathbf{x},t)dt}{p(\mathbf{x})} = \int tp(t\mid\mathbf{x})dt = E_t[t\mid\mathbf{x}]$.  
    * Minimizable & Noise:
        * $E[L] = \int\{y(\mathbf{x})-E[t\mid\mathbf{x}]\}^2p(\mathbf{x})d\mathbf{x} + \int\{E[t\mid\mathbf{x}]-t\}^2p(\mathbf{x})d\mathbf{x}$ (derivation. 65:1.89-90).
        * The first term becomes $0$ when $y(\mathbf{x})=E[t\mid\mathbf{x}]$, which is minimizable, whereas the second term is the variance of the distribution of $t$, averaged over $\mathbf{x}$, which is intrinsic of the target data and cannot be reduced.
* Rejection Option:
    * Setting up a threshold parameter $\theta$, and only make decision when $p(\mathcal{C}|\mathbf{x})>\theta$.
    * Leave the rejected case to, e.g. human experts to decide.

### C. Information

* Entropy (Uniformity):
    * $H[x] = -\sum_xp(x)log_2p(x)$.

## III. Probability Distributions

### A. Bernoulli

* PDF: $Bern(x\mid\mu) = \mu^x(1-\mu)^{1-x}$.
* Expectation: $\mu$.
* Var: $\mu(1-\mu)$.
* Likelihood: $p(\mathcal{D}\mid\mu) = \prod_{n=1}^Np(x_n\mid\mu) = \prod_{n=1}^N\mu^{x_n}(1-\mu)^{1-x_n}$.
* MLE: $\frac{\partial \mathtt{ln}p(\mathcal{D}\mid\mu)}{\partial \mu} = 0 \Rightarrow \mu_{ML} = \bar{x}$.

### B. Binomial

* PDF: $Bin(m\mid N,\mu) = \binom{N}{m}\mu^m(1-\mu)^{N-m}$.
* Expectation: $N\mu$.
* Var: $N\mu(1-\mu)$.

### C. Beta

* PDF: $Beta(\mu\mid\ a,b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\mu^{a-1}(1-\mu)^{b-1}$.
* Expectation: $\frac{a}{a+b}$.
* Var: $\frac{ab}{(a+b)^2(a+b+1)}$.
