# Basic Tools for CIL
## Matrix-vector basis
* Symmetric matrix: $A = A^T$
* Orthogonal matrix: $A^{-1} = A^T$ i.e. $A^T A = A^{-1}A = I$ and $det(I) = 1$
* Transposed matrix: $(A^T)^{-1} = (A^{-1})^{T}$
* Inner Prod: $\langle x,y \rangle =\| x\|_2 \cdot \| y\| \cdot cos(\theta) $, if $y$ is a unit vector then the inner product projects $x$ onto $y$.
    * $\langle x,y \rangle = x^T y = \sum_i^N x_i y_i$
    * $\langle x+y, x+y \rangle = \langle x, x \rangle + \langle y, y \rangle + 2 \langle x, y \rangle$
    * $\langle x-y, x-y \rangle = \langle x, x \rangle + \langle y, y \rangle - 2 \langle x, y \rangle$
    * $\langle x, y+z \rangle = \langle x,y \rangle + \langle x,z \rangle$
    * $\langle x+z, y \rangle = \langle x,y \rangle + \langle z,y \rangle$
* Outer product: $X = u v^T$ and $X_{i,j} = u_i v_j$
* Orthonormal basis: Set of vectors in an $N$ dimensional space for which the basis vectors fulfill:
    * Unit vectors (length = 1)
    * Together the vectors have an inner product of zero, i.e. the vectors are orthogonal
    * Ex for basis for $R^3$: $\{e_1, e_2, e_3\} = \{(0,0,1),(0,1,0), (1,0,0)\}$
        * Being a basis for $R^3$ means that every vector $v \in R^3$ can be written as a sum of the 3 vectors scaled: $v = e_1 \cdot x +  e_2 \cdot y +  e_3 \cdot z$
* Gram-Schmidt orthonormal basis algorithm: Finds an orthonormal basis $u=u_1 ... u_k$ given linearly independent set $v = v_1 ... v_k$ where:
    * $u_1 = v_1$
    * $u_2 = v_2 - \frac{\langle v_2, u_1 \rangle}{\langle u_1, u_1 \rangle}$
    * $u_3 = v_3 - \frac{\langle v_3, u_1 \rangle}{\langle u_1, u_1 \rangle} - \frac{\langle v_3, u_2 \rangle}{\langle u_2, u_2 \rangle} $
    * ...
    * $u_k = v_k - \sum_i^{k-1} \frac{\langle v_k, u_i \rangle}{\langle u_i, u_i \rangle}$

## Norms 
### Vector norms
* Zero norm: $\| x\|_0$ is the number of non-zero elements in $x$
    * Formally $|\{i| x_i \neq 0\}|$
* P-norm: $\| x\|_p = (\sum_i^N |x_i|^p) ^{\frac{1}{p}}$
    * Ex Euclidean norm: $\| x\|_2 = (\sum_i^N x_i^2) ^{\frac{1}{2}}$
    * Ex one norm $\| x\|_1 = (\sum_i^N |x_i|)$ 
    
### Matrix norms
Given $M \in R^{m\times n}$, the i'th eigenvalue of $X$ is denoted $\sigma_i$ or $\sigma_i(X)$
* Fröbenius: $\| X\|_F = (\sum_i^m \sum_j^n X_{ij}^2) ^{\frac{1}{2}} = (\sum_i^{min(m,n)} \sigma_i^2) ^{\frac{1}{2}}$
* 1-Norm: $\| X\|_1 = (\sum_{i,j}^{m,n} |x_{i,j}|)$
* Euclidean norm: $\| X\|_2 = \sigma_{max}(X)$
* Spectral norm (p-norm): $\| X\|_p = max_{v \neq 0} \frac{ \| Xv\|_p }{\| v\|_p}$
* Nuclear norm (star norm): $\| X\|_* = \sum_i^{min(m,n)} \sigma_i$

## Derivatives

### Vectors
* $\frac{\partial}{\partial x} (b^T x) = \frac{\partial}{\partial x} (x^T b) = b$
* $\frac{\partial}{\partial x} (x^T x) = \frac{\partial}{\partial x} (\| x\|_2^2)= 2x$
* $\frac{\partial}{\partial x} (x^T Ax) = (A^T A) x$ and if $A$ is symm then $=2Ax$
* $\frac{\partial}{\partial x} (b^T Ax) = A^T b$
* $\frac{\partial}{\partial x} (\| x-b\|_2) = \frac{x-b}{\|x-b\|_2}$

### Matrices
* $\frac{\partial}{\partial X} (c^T Xb) = bc^T$
* $\frac{\partial}{\partial X} (\| X\|_F^2) = 2X$


## Eigenvalues and eigenvectors
* $Ax = \lambda x$
* $A \in R^{N\times N}$: square matrix, $x$: column vector, $\lambda$: scalar

### Find eigenvalues
The EV problem: Given a matrix $A$ solve the characteristic equation $\lambda$ s.t. $det(A - \lambda I) = 0$ which will result in some high degree polynomial, the eigenvalues are then the roots of this polynomial.

### Find eigenvectors
For each eigenvalue $\lambda_i$ it holds that $A-\lambda I)x_i = 0$, $x_i, \lambda_i$ being the i'th eigenvector, eigenvalue pair. This is a linear system and can be solved by Gaussian elimination.

Eigenvectors are not normalized to unit vectors, which is often desired - to fix this perform the following operation $\tilde{x} = \frac{x}{\|x\|_2}$

### Eigen-decomposition
* $A$ can be decomposed as $A = Q \Lambda Q^T$ where $Q$ is an orthogonal matrix ($QQ^T = I$)

## Probability Theory
* Joint probability of variables $X$ and $Y$: $P(x) := Pr[X = x] := \sum_{y \in Y} p(x,y)$
* Coniditional probability: $P(x|y) := Pr[X = x | Y = y] := \frac{p(x,y)}{p(y)}$ where $P(y) > 0$
* Necessary property of probability density: $\forall y\in Y: \sum_{x \in X} p(x|y) = 1$
* Marginal probability, chain rule: $p(x,y) = p(x|y) p(y)$
* Bayes Theorem using chain rule and conditional probability: $p(x|y) = \frac{p(y|x) p(x)}{p(y)}$
* Independence between stochastic variables: $p(y|x) = p(y)$ then $p(x|y) = p(x)$
* Probability of a sequencee of IID obs: $p(x_1, x_2 ... x_N) = \Pi_i^N p(x_i)$

# Singular Value Decomposition (SVD)
![](svd.png "SVD illustration")

Using the above illustration, and denoting $Sigma$ as $D$ instead, SVD builds on the following:

Given a matrix (ex dataset) $X \in R^{N \times M}$, $X$ can be decomposed as
* $X = U D V^T$

* $U^TU = V^T V = I$
* Columns of $U$: Eigenvectors of $X X^T$
* Columns of $V$: Eigenvectors of $X^T X$
* Diag(D): $\sqrt{\sigma(X X^T)}$ (also called **singular values**)
    * Note: $\sigma(X X^T) = \sigma(X^T X)$
* $U \in R^{N\times N}, D \in R^{N \times M}, V \in R^{M\times M}$
    * We do not require $N = M$, but if $N = M$ then $ U = V$ and $U, D, V \in R^{N \times N}$
    
### The SVD procedure

#### (1) Compute Eigenvalues
Compute the eigenvalues $\lambda = [\lambda_1 ... \lambda_K]$ for $X X^T$, the singular values are then $\sigma_i(X) = \sqrt{\lambda_i(X)}$.

Find the eigenvalues with the characteristic equation: $det(X - \lambda I) = 0$

#### (2) Compute U (Eigenvectors)
Solve the linear system: $(X^T X) v_i = \lambda_i v_i $ for all $\lambda_i$. The vector $v_i$ is then the eigenvector for eigenvalue $\lambda_i$.
* On matrix form: $(X^T X) V = (D^T D) V$
* $D^T D$ being the diagonal matrix containing eigenvalues for $X^T X$.
The right singular vectors $V = [v_1 ... v_K]$ are now identified.

#### (3) Compute V
Solve for $X v_i = \sigma_i u_i$, or on matrix form $X V = D U$ where $U = [u_1 ... u_K]$ are the left singular vectors. Concretely, this is solved as:
* $D^{-1} X V = U D D^{-1} =  X V D^{-1}= U$

#### (3) Reconstruction
From $X V D^{-1}= U$ let us reconstruct $X$:
* $X V D^{-1} D = U D$ so now $X V = U D$
* $X V V^T = U D V^T$, and since $V V^T = I$, this means means $X = U D V^T$.

# Dimensionality Reduction with PCA

Given $X \in R^{D \times N}$ (a dataset)
* $D$: The dimension of each observation
* $N$: The number of observations.

PCA builds on SVD of the covariance matrix of a dataset $X$. The covariance matrix is a symmetrical matrix of dimension $D$ defining the features' covariance, $\Sigma_X = cov[X_i, X_j]$ for all features. Since $\Sigma_X$ is symmetrical of dimension $R^{D\times D}$ the resulting decomposition $U D V^T$ has the property that $U = V$ and $U,V \in R^{D \times D}$.

Goal: Reduce dimension $D$ to dimension $K$, s.t. $K << D$
* I.e. transform $X \rightarrow \tilde{X} \in R^{K \times N}$

## PCA (Principal Component Analysis)

### PCA Procedure
#### (1) Compute empirical mean observation
Compute the mean along the rows: $\bar{x} = \frac{1}{N} \sum_i^N x_i$, $\bar{x} \in R^D$

#### (2) Center dataset 
Center wrt. the empirical mean, by subtracting the mean observation: $\bar{X} = X - \bar{x}$

#### (3) Compute the covariance matrix
Covariance matrix of $X$: $\Sigma_X = \frac{1}{N} \sum_i^N (x_i - \bar{x}) (x_i - \bar{x})^T = \frac{1}{N} \bar{X} \bar{X}^T$

#### (4) Perform EV decomposition 
Decompose the the cov. matrix : $\Sigma_X = U \Lambda U^T$ (see chapter 1, Eigen-decomposition)
* $U \in R^{D\times D}$, D is the outer dimension
* $\Lambda \in R^{D \times D}$ diagonal matrix (since cov matrix is symmetrical of dimension $D \times D$)

It then holds that $diag(\Lambda) = \sigma_i(\Sigma_X)$ for $i=1...D$ in descending order.

$U \in R^{D\times K}$
Select the first $K < D$ for which a substantial amount of data is preserved, for this it can help inspecting the explained variance: $var = \frac{\sum_i^K \sigma_i^2}{\sum_j^D \sigma_j^2}$. A good choice for $K$ will preserve more than 90% of data, while still being a much smaller than D.

The $K$ first eigenvectors are then found as the $K$ first columns in $U$, that is, $U_K = [u_1 ... u_K]$ and $\sigma = [\sigma_1 ... \sigma_K]$
    
#### (5) Compressing the Dataset
Downproject centered dataset to new basis: $\bar{Z} = U_K^T \bar{X}$

#### (6) Reconstructing the Dataset 
To reconstruct dataset: 
* Up-project to original basis of dimension $D$: $\tilde{\bar{X}} = U_K \bar{Z} = U_K^T U_K \bar{X} = I \bar{X} = \bar{X}$
    * Here, we used the orthonormality of $U$, i.e. $U U^T = I$
* Undo centering by adding the mean observation $\bar{x}$ once again: $\tilde{X} = \tilde{\bar{X}} + \bar{x}$



Robust PCA
* Corrupted points

Convex optimization
* Convex sets
* Rank of matrix
* Norms and zero norm

Lagrangians
* Lagrange duality
* Lagrangian Dual function

Gradient descent
* Condition for learning rate to guarantee convergence
