# PCA

*Dimensionality Reduction using Eigenvalue Decomposition*

---
* [Implementation in Python](../pymlalgo/pca/pca.py)
* [Demo](../demo/pca_demo.ipynb)

---

## Symbols and Conventions
Refer to [Symbols and Conventions](symbols_and_conventions.ipynb) for details. The following list contains the summary and the symbols specific to *PCA*
* $n$ is the number of training examples
* $d$ is the number of features in each training example (a.k.a the dimension of the training example)
* $X$ is the features matrix of shape $n \times d$
* $Z$ is the centered $X$ matrix i.e. $z_j = x_{ij} - \bar{x_j}$ where $i \in \{1, 2, ..., n\}$ and $j \in \{1, 2,..., d\}$
     * $x_j$ is one column of $X$ matrix that is one feature of all training examples
     * $x_{ij}$ is the $j^{th}$ feature of $i^{th}$ training example
     * $\bar{x_j}$ is the mean of $x_j$
* $A$ is the covariance matrix of $Z$ of shape $d \times d$
    * $A = \frac{1}{n - 1}Z^TZ$
* $\lambda_j$ is the $j^{th}$ eigenvalue
* $v_j$ is the $j^{th}$ eigenvector

## Eigenvalues and Eigenvectors
Any diagonalizable matrix, $A$ can be written in the form
$$Av = \lambda v$$
where $\lambda$ is called the eigenvalue $v$ is called the eigenvector.
* $A$ must be a square matrix, lets say the shape is $d \times d$, thus
* $v$ must be of shape $d \times 1$
* $\lambda$ is a scalar

The equation has multiple solutions, however for dimensionality reduction, the largest eigenvector associated with the largest eigenvalue explains the maximum variance.

## Power Iteration
Power iteration is an algorithm to calculate the largest (in absolute value) eigenvalue for a diagonalizable matrix.

1. Input the matrix $A$
2. Initialize the vector $v_0$ of shape $d \times 1$
3. For $t = \{1, 2, 3, ...\}$ apply the updates until stopping criteria not satisfied
    1. $v_t = Av_{t-1}$
    2. $v_t = \frac{v_t}{||v_t||_2}$
    3. $\lambda_t = v_t^TAv_t$

### Stopping Criteria
The stopping criteria can be either of the three or their combination whichever is satisfied first
1. Maximum number of iterations
2. $||Av - \lambda v||_2 < \epsilon$ where $\epsilon$ is a very small positive value. Since power iteration optimizes the loss given by the equation, as soon as the loss reaches a very small value, the algorithm has converged.
3. $||v_t - v_{t-1}||_2 < \epsilon$ where $\epsilon$ is a very small value

### Dimensionality Reduction
Once the eigenvector, $v$ is computed, the dimension of the training examples can be reduced to $n \times 1$ by taking the dot product $Zv$. This dot product is known as the first principal component.

### $2^{nd}, 3^{rd}, ..., d^{th}$ Principal Component
To find the second principal component, the variance explained by first principal component is subtracted from $A$.
$$A_2 = A_1 - \lambda_1  v_1 v_1^T$$
Here, the subscript $_1$ represents the first (largest) eigenvector. Now power iteration is applied to the resulting matrix, $A_2$

### PCA Using Power Iteration
Joining all the moving parts, $X$ can be reduced to $d_{reduced}$ dimensions using the following algorithm
1. Input $X$ and $d_{reduced} < d$
2. Compute centered matrix, $Z = X - \bar{X}$ (Note: $\bar{X}$ is of shape $d \times 1$ and is the mean of each feature across all training examples
3. Compute the covariance matrix $A = \frac{1}{n - 1} Z^TZ$
4. Repeat for $j = \{1, 2, ,3, ....., d_{reduced}\}$
    1. Compute $v_j$ and $\lambda_j$ using power iteration
    2. Update $A_j = A_{j-1} - \lambda_j v_j v_j^T$
    3. Compute $j^{th}$ principal component $X_{pj} = Zv_j$


## Normalized Oja Algorithm
Normalized Oja algorithm is another method to find principal components. The algorithm also takes a diagonalizable matrix and computes the largest eigenvector
1. Input diagonalizable matrix $A$ and learning rate $\eta$ (Note: Here $A$ is not the covariance matrix)
2. Initialize vector $v_0$ of shape $d \times 1$
3. Update $v_0 = \frac{v_0}{||v_0||_2}$
4. For $t = \{1, 2, 3, ...\}$ apply the updates until stopping criteria not satisfied
    1. $v_t =\eta Av_{t-1}$
    2.  $v_t = \frac{v_t}{||v_t||_2}$

### Stopping Criteria
Stopping criteria $1$ & $3$ can be used for Oja algorithm

### Principal Components
Similar to Power Iteration, to compute the principal component, the following rule is applied to project the eigenvectors one dimensional space:
$$X_{p1} = Zv$$

### $2^{nd}, 3^{rd}, ..., d^{th}$ Principal Component
The first principal component is projected to $d$-dimensions and then subtracted from the centered features matrix $Z$. Diagonalizable matrix $A$ is then calculated from the resultant $Z$ and then Oja algorithm is applied to it:
$$Z_2 = Z_1 - Z_1v_1v_1^T$$
$$A_2 = Z_2^TZ_2$$

### PCA Using Oja Algorithm
Collecting all the steps together, $X$ can be reduced to $d_{reduced}$ dimensions using the following algorithm:
1. Input $X$ and $d_{reduced} < d$
2. Compute centered matrix, $Z = X - \bar{X}$ (Note: $\bar{X}$ is of shape $d \times 1$ and is the mean of each feature across all training examples
 3. Compute $A =  Z^TZ$
4. Repeat for $j = \{1, 2, ,3, ....., d_{reduced}\}$
    1. Compute $v_j$ using Oja Algorithm
    2. Compute $j^{th}$ principal component $X_{pj} = Zv_j$
    3. Update $Z_j = Z_{j-1} - Z_{j-1}v_jv_j^T$
    4. Update $A_j = Z_j^TZ_j$