# Principal components analysis

Principal components analysis (PCA) is a technique for reducing the dimension of an data matrix $\mathbf{X} \in \mathbb{R}^{D \times N}$. 
<!-- The first principal component direction of the data is that along which the observations vary the most. -->
When faced with a large set of correlated variables, principal components allow us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set.

PCA refers to the process by which principal components are computed, and the subsequent use of these components in understanding the data. PCA is an unsupervised approach, since
it involves only a set of features $x_1, \ldots, x_D$ and no associated response $y$. Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization (visualisation of the observations or visualisation of the variables). It can also be used as a tool for data imputation — that is, for filling in missing values in a data matrix.

```{admonition} Video
<iframe width="700" height="394" src="https://www.youtube.com/embed/HMOI_lkzW08?start=10" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Explaining main ideas behine principal components analysis, by StatQuest](https://www.youtube.com/embed/HMOI_lkzW08?start=10)
```

## What are principal components?

Suppose that we wish to visualize n observations with measurements on a set of $ D $features, $x_1, x_2, \ldots, x_D $, as part of an exploratory data analysis. We could do this by examining two-dimensional scatterplots of the data, each of which contains the $N$ observations’ measurements on two of the features. However, there are $ D(D-1)/2 $ such scatterplots, and it is difficult to visualise more than a few of them at a time. If $ D $is large, then it will certainly not be possible to look at all of them; moreover, most likely none of them will be informative since they each contain just a small fraction of the total information present in the data set. Clearly, a better method is required to visualize the n observations when $ D $is large. In particular, we would like to find a low-dimensional representation of the data that captures as much of the information as possible. For instance, if we can obtain a two-dimensional representation of the data that captures most of the information, then we can plot the observations in this low-dimensional space.

PCA provides a tool to do just this. It finds a low-dimensional representation of a data set that contains as much as possible of the variation. The idea is that each of the n observations lives in $ D $ -dimensional space, but not all of these dimensions are equally interesting. PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is measured by the amount that the observations vary along each dimension. Each of the dimensions found by PCA is a linear combination of the $ D $ features. We now explain the manner in which these dimensions, or principal components, are found.

The first principal component of a set of features $x_1, \ldots, x_D$ is the normalized linear combination of the features

$$
z_1 = \phi_{1,1}x_1 + \phi_{1,2}x_2 + \cdots + \phi_{1,D}x_D
$$

that has the largest variance. By normalized, we mean that $\sum_{j=1}^N \phi_{j,1}^2 = 1$. We refer to the elements $\phi_{i, 1}, \ldots, \phi_{i, D} $ as the loadings of the first principal loading component; together, the loadings make up the principal component loading vector, $\boldsymbol{\phi} = [\phi_{1,1}, \ldots, \phi_{1,D}]^\top$. We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance.

Given a $D \times N$ data set $\mathbf{X}$, how do we compute the first principal component? Since we are only interested in variance, we assume that each of the variables in $ \mathbf{X} $ has been centred to have mean zero (that is, the column means of $ \mathbf{X} $ are zero). We then look for the linear combination of the sample feature values of the form

$$
z_1 = \phi_{1,1}x_1 + \phi_{1,2}x_2 + \cdots + \phi_{1,D}x_D
$$

that has largest sample variance, subject to the constraint that $\sum_{j=1}^N \phi_{j,1}^2 = 1$. In other words, the first principal component loading vector solves the optimization problem

```{math}
:label: eq:1stpc
\begin{equation}
\arg\max_{\boldsymbol{\phi}} \left\{\frac{1}{N} \sum_{i=1}^N \left(\sum_{j=1}^D \phi_{j,1} x_{j,i} \right)^2\right\} \quad \text{subject to} \quad \sum_{j=1}^D \phi_{j,1}^2 = 1
\end{equation}
```

where the objective can also be written as $\frac{1}{N} \sum_{i=1}^N z_{i,1}^2 $. Equation {eq}`eq:1stpc` maximises the sample variance of the $ N $ values of $ z_{i,1} $. We refer $ z_{1,1}, \ldots, z_{N,1} $ as the first principal component scores. Equation {eq}`eq:1stpc` can be solved via a standard technique in linear algebra, [_eigen decomposition_](https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix), the mathematical details of which are beyond the scope of this course. 



## Exercises

min 3 max 5

