# Basic Tools for CIL
## Matrix-vector basis
* Symmetric matrix: $A = A^T$
* Orthogonal matrix: $A^{-1} = A^T$ i.e. $A^T A = A^{-1}A = I$ and $det(I) = 1$
* Transposed matrix: $(A^T)^{-1} = (A^{-1})^{T}$
* Inner Prod: $\langle x,y \rangle =\| x\|_2 \cdot \| y\| \cdot cos(\theta) $, if $y$ is a unit vector then the inner product projects $x$ onto $y$.
    * $\langle x,y \rangle = x^T y = \sum_i^N x_i y_i$
    * $\langle x+y, x+y \rangle = \langle x, x \rangle + \langle y, y \rangle + 2 \langle x, y \rangle$
    * $\langle x-y, x-y \rangle = \langle x, x \rangle + \langle y, y \rangle - 2 \langle x, y \rangle$
    * $\langle x, y+z \rangle = \langle x,y \rangle + \langle x,z \rangle$
    * $\langle x+z, y \rangle = \langle x,y \rangle + \langle z,y \rangle$
* Outer product: $X = u v^T$ and $X_{i,j} = u_i v_j$
* Orthonormal basis: Set of vectors in an $N$ dimensional space for which the basis vectors fulfill:
    * Unit vectors (length = 1)
    * Together the vectors have an inner product of zero, i.e. the vectors are orthogonal
    * Ex for basis for $R^3$: $\{e_1, e_2, e_3\} = \{(0,0,1),(0,1,0), (1,0,0)\}$
        * Being a basis for $R^3$ means that every vector $v \in R^3$ can be written as a sum of the 3 vectors scaled: $v = e_1 \cdot x +  e_2 \cdot y +  e_3 \cdot z$
* Gram-Schmidt orthonormal basis algorithm: Finds an orthonormal basis $u=u_1 ... u_k$ given linearly independent set $v = v_1 ... v_k$ where:
    * $u_1 = v_1$
    * $u_2 = v_2 - \frac{\langle v_2, u_1 \rangle}{\langle u_1, u_1 \rangle}$
    * $u_3 = v_3 - \frac{\langle v_3, u_1 \rangle}{\langle u_1, u_1 \rangle} - \frac{\langle v_3, u_2 \rangle}{\langle u_2, u_2 \rangle} $
    * ...
    * $u_k = v_k - \sum_i^{k-1} \frac{\langle v_k, u_i \rangle}{\langle u_i, u_i \rangle}$

## Norms 
### Vector norms
* Zero norm: $\| x\|_0$ is the number of non-zero elements in $x$
    * Formally $|\{i| x_i \neq 0\}|$
* P-norm: $\| x\|_p = (\sum_i^N |x_i|^p) ^{\frac{1}{p}}$
    * Ex Euclidean norm: $\| x\|_2 = (\sum_i^N x_i^2) ^{\frac{1}{2}}$
    * Ex one norm $\| x\|_1 = (\sum_i^N |x_i|)$ 
    
### Matrix norms
Given $M \in R^{m\times n}$, the i'th eigenvalue of $X$ is denoted $\sigma_i$ or $\sigma_i(X)$
* Fröbenius: $\| X\|_F = (\sum_i^m \sum_j^n X_{ij}^2) ^{\frac{1}{2}} = (\sum_i^{min(m,n)} \sigma_i^2) ^{\frac{1}{2}}$
* 1-Norm: $\| X\|_1 = (\sum_{i,j}^{m,n} |x_{i,j}|)$
* Euclidean norm: $\| X\|_2 = \sigma_{max}(X)$
* Spectral norm (p-norm): $\| X\|_p = max_{v \neq 0} \frac{ \| Xv\|_p }{\| v\|_p}$
* Nuclear norm (star norm): $\| X\|_* = \sum_i^{min(m,n)} \sigma_i$

## Derivatives

### Vectors
* $\frac{\partial}{\partial x} (b^T x) = \frac{\partial}{\partial x} (x^T b) = b$
* $\frac{\partial}{\partial x} (x^T x) = \frac{\partial}{\partial x} (\| x\|_2^2)= 2x$
* $\frac{\partial}{\partial x} (x^T Ax) = (A^T A) x$ and if $A$ is symm then $=2Ax$
* $\frac{\partial}{\partial x} (b^T Ax) = A^T b$
* $\frac{\partial}{\partial x} (\| x-b\|_2) = \frac{x-b}{\|x-b\|_2}$

### Matrices
* $\frac{\partial}{\partial X} (c^T Xb) = bc^T$
* $\frac{\partial}{\partial X} (\| X\|_F^2) = 2X$


## Eigenvalues and eigenvectors
* $Ax = \lambda x$
* $A \in R^{N\times N}$: square matrix, $x$: column vector, $\lambda$: scalar

### Find eigenvalues
The EV problem: Given a matrix $A$ solve the characteristic equation $\lambda$ s.t. $det(A - \lambda I) = 0$ which will result in some high degree polynomial, the eigenvalues are then the roots of this polynomial.

### Find eigenvectors
For each eigenvalue $\lambda_i$ it holds that $A-\lambda I)x_i = 0$, $x_i, \lambda_i$ being the i'th eigenvector, eigenvalue pair. This is a linear system and can be solved by Gaussian elimination.

Eigenvectors are not normalized to unit vectors, which is often desired - to fix this perform the following operation $\tilde{x} = \frac{x}{\|x\|_2}$

### Eigen-decomposition
* $A$ can be decomposed as $A = Q \Lambda Q^T$ where $Q$ is an orthogonal matrix ($QQ^T = I$)

## Probability Theory
* Joint probability of variables $X$ and $Y$: $P(x) := Pr[X = x] := \sum_{y \in Y} p(x,y)$
* Coniditional probability: $P(x|y) := Pr[X = x | Y = y] := \frac{p(x,y)}{p(y)}$ where $P(y) > 0$
* Necessary property of probability density: $\forall y\in Y: \sum_{x \in X} p(x|y) = 1$
* Marginal probability, chain rule: $p(x,y) = p(x|y) p(y)$
* Bayes Theorem using chain rule and conditional probability: $p(x|y) = \frac{p(y|x) p(x)}{p(y)}$
* Independence between stochastic variables: $p(y|x) = p(y)$ then $p(x|y) = p(x)$
* Probability of a sequencee of IID obs: $p(x_1, x_2 ... x_N) = \Pi_i^N p(x_i)$

## Lagrange Multipliers
### Constrained Optimization
We are given the following:
* $f(x)$: The objective function to optimize.
* $g_i(x) \leq 0$ for $i=1...n$: Inequality constraint(s)
* $h_j(x) = a_j^T x - b_j = 0$ for $j=1...m$: Equality constraints(s)

We then formulate a langrangian 
$$L(x, \lambda, v) := f(x) + \sum_i^n \lambda_i g_i(x) + \sum_j^m v_j h_j(x)$$ 

which is then the constrained optimization objective.

### Dual Function
$$D(\lambda, v) := inf_x L(x, \lambda, v) \in R$$

Finding lower bound on optimum $f(x^*)$

Lagrange dual problem:
* Maximize $D$
* Subject to $\lambda \geq 0$

### Convex Optimization
![](convex-set.png "Convex illustration")

* Figure 1: Set is convex; it is not possible to draw a line segment which will lie outside the set.
* Figure 2: Not convex, line illustrates this
* Figure 3: Not convex, since not completely closed, i.e. 2 points on the closed edge can form a line segment which falls on the open circumference

A function $f: R^D \rightarrow R$ is convex iff for all points $x,y \in dom(f)$ and for all $\theta \in [0, 1]$
$$f(\theta x + (1 − \theta)y) ≤ \theta f(x) + (1 − \theta)f(y)$$

![](convex-graph.png "Convex illustration")

#### Epigraph
Convexity can also be defined from the Epigraph of a function $epi(f)$ which is the set of points lying above the graph of $f$.

A function $f$ is convex if $epi(f)$ is a convex set.

* The graph of $f$: $\{(x, f(x))\quad  | \quad x \in dom(f) \}$
* Epigraph of $f$: $epi(f) = \{(x, t) \quad | \quad x \in dom(f), \quad f(x) \leq t \}$

#### Common convex functions
Linear functions: $f(x) = a^T x$

Affine functions: $f(x) = a^T x + b$

Exponentials: $f(x) = exp(\alpha x)$

Norms in $R^D$ are convex


### Optimization for Matrix Factorizations
Formulation

$$min_{U,Z} \quad f(U,Z)$$

Where $f: R^{D\times N} \rightarrow R$

Such that:
$$U \in Q_1 \subseteq R ^{D \times K}$$
$$Z \in Q_2 \subseteq R ^{N \times K}$$

#### Example: Matrix Reconstruction
$$f(U,Z) = \frac{1}{2} \|X - UZ^T\|_F ^2$$
$$Q_1 = R^{D\times K}$$
$$Q_2 = R^{N\times K}$$

Explicit solution: SVD, but rarely possible

#### Other examples
K-Means: $$f(U, \hat{Z}) = \|X - U\hat{Z}^T\|_F ^2 = \sum_{n=1}^N \sum_{k=1}^K \hat{Z}_{nk} \|x_n - u_k\|_2^2$$

NNMF:
$$f(U,Z) = \frac{1}{2} \|X - UZ^T\|_F ^2$$
$$Q_1 = R_{\geq 0}^{D\times K}$$
$$Q_2 = R_{\geq 0}^{N\times K}$$

Collab Filtering (Matrix Completion)
$$f(U,Z) = \frac{1}{\vert \Omega \vert } \sum_{i,j \in \Omega} \frac{1}{2} [x_{ij} - (uz^T)_{ij}]^2$$
$$Q_1 = R^{D\times K}$$
$$Q_2 = R^{N\times K}$$