# Machine Learning Miscellaneous  

## MSE Variance-Bias Decomposition

$$MSE(\hat{y}) = \mathbb{E}_y[(y - \hat{y})^2 ] = \mathbb{E}_y[(y^2 - 2y\hat{y} + \hat{y}^2)] = \mathbb{E}_y[\hat{y}^2] - 2y\mathbb{E}[\hat{y}] + y^2)]$$

$$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = Var[\hat{y}] - (\mathbb{E}_y[\hat{y}]^2 - 2y\mathbb{E}[\hat{y}] + y^2) = Var[\hat{y}] + (\mathbb{E}[\hat{y}] - y)^2$$

$$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = Var[\hat{y}]  + \text{Bias}(\hat{y})^2$$

[](./-f93b1bd8-3df9-41b5-a6ac-5ee1a0044b8c.md)

## Derivatives with respect to Vectors

Let $x \in \mathbb{R}^n$ (a column vector) and let $f: \mathbb{R}^n \rightarrow \mathbb{R}$.  The derivative of $f$ with respect to $x$ is the row vector:

$$\frac{\partial f}{\partial x} = [\frac{\partial f}{\partial x_1}, \cdots, \frac{\partial f}{\partial x_n}]$$

which is the gradient of $f$.

### Hessian Matrix 

**Hessian Matrix** is the square matrix of second partial derivatives of a scalar valued function $f$.

$$H(f) =   \begin{bmatrix}
    \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\
    \frac{\partial^2 f}{\partial x_2 \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\
    \vdots & \ddots & \vdots \\
    \frac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n^2}
  \end{bmatrix}$$

1. If the gradient is $0$, $f$ has **critical point** at $x$.  Determinant of Hessian at $x$ is **discriminant**
    - If Determinant is 0, then x is degenerate critical point
    - If Determinant is not 0:
        - If Hessian is positive definite, f attains a local minimum at x
        - If Hessian is negative definite, f attains a local maximum at x
        - If Hessian has both positive and negative eigenvalues, x is a saddle point for f
        - Else is inconclusive
    

Let $x \in \mathbb{R}^n$ (a column vector) and let $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$.  The derivative of $f$ with respect to $x$ is the $m$ by $n$ matrix.:

$$\frac{\partial f}{\partial x} = \begin{bmatrix}
    \frac{\partial f(x)_1}{\partial x_1} & \cdots & \frac{\partial f(x)_1}{\partial x_n} \\
    \vdots & \ddots & \vdots \\
    \frac{\partial f(x)_m}{\partial x_1} & \cdots & \frac{\partial f(x)_m}{\partial x_n}
  \end{bmatrix}$$

which is the **Jacobian matrix** of $f$.

### Common Derivative Terms

$$\frac{\partial u^Tx}{\partial x} = u^T$$

$$\frac{\partial x^Tx}{\partial x} = 2x^T$$

$$\frac{\partial Ax}{\partial x} = A$$

$$\frac{\partial x^TAx}{\partial x} = x^T(A + A^T)$$

## LS with uncertain A matrix (NEED DOUBLE CHECK)

$A(\delta)x = A_0 x + \sum_{i=1}^P A_i x \delta_i$

Zero Mean:  $\mathbb{E}[A(\delta)x] = A_0 x$

$$\begin{split} 
\mathbb{E}[x^T A\{\delta\}^T A\{\delta\}x] & = \mathbb{E}[x^TA_0^TA_0x + 2 \sum_{i=1}^P x^TA_0^TA_ix\delta_i + \sum_{i=1}^P\sum_{j=1}^P x^TA_i^TA_jx \delta_i \delta_j] \\ &= x^TA_0^TA_0x + \sum_{i=1}^P \sigma_i ^2 x^TA_i^TA_i x\end{split}$$

### Univariate LASSO

$$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$$

$$\min_{x \in \mathbb{R}} f(x) = \frac{1}{2} \norm{ax - y}_2^2 + \lambda \lvert x \rvert$$

## Laplace Distribution

$$p(x; \mu, b) = \frac{1}{2b}\exp(-\frac{\lvert x - \mu \rvert}{b})$$

### Terminologies

Ridge Regression

grid search

bootstrap method

Hebb's rule

Parzen Windows

Centroid Method