# Dimensionality Reduction

High-dimensional data is everywhere in modern machine learning, from genomics datasets with thousands of features to images with millions of pixels. Dimensionality reduction techniques provide powerful tools to compress this data while preserving its most important structure and information. 

By the end of this module, you will be able to define and demonstrate mastery of the following key concepts:

* **Principal Component Analysis (PCA)** - Understanding how to extract the most informative directions from high-dimensional data using the eigendecomposition of the covariance matrix
* **Singular Value Decomposition (SVD)** - Learning how to factorize matrices into orthogonal components and apply this technique for dimensionality reduction and data compression
* **Covariance Matrix Analysis** - Computing and interpreting the empirical covariance matrix to understand relationships between features and identify the principal directions of variation

These mathematical frameworks form the foundation of many machine learning algorithms and data analysis techniques, providing both theoretical insights and practical tools for working with complex datasets. Let's go!

___

## Dimensionality Reduction Problem
Suppose we have a dataset $\mathcal{D} = \left\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{n}\right\}$ where $\mathbf{x}_{i}\in\mathbb{R}^{m}$ is an $m$-dimensional feature vector that we want to compress into $k$ dimensions: $\mathbf{x}_i \in \R^m \;\rightarrow\; \mathbf{y}_i \in \R^k$ where $k\ll{m}$.

* **Composite features.**  Each lower-dimensional vector $\mathbf{y}_i$ is called a *composite feature*, since it's a linear combination of the original features.  Reducing dimensionality can help us visualize high-dimensional data in 2–3D, reduce the computational complexity of a machine learning algorithm, or give us a more compact representation of the data that retains the most important information.

Imagine we have a _magical transformation matrix_ $\mathbf{P}\in\mathbb{R}^{k\times{m}}$ so that: $\mathbf{y} = \mathbf{P}\;(\mathbf{x} - \bar{\mathbf{x}})$ where $\mathbf{y}\in\mathbb{R}^{k}$ is the new composite feature vector and $\bar{\mathbf{x}}$ is the mean of the data.  If we write $\mathbf{P} = [\,\mathbf{\phi}_1^\top;\dots;\mathbf{\phi}_k^\top]$, then each row $\mathbf{\phi}_i^\top$ extracts one component:
$$
\begin{align*}
y_{i} = \phi_{i}^{\top}\;(\mathbf{x} - \bar{\mathbf{x}})\quad{i=1,2,\dots,k}
\end{align*}
$$

Wow, that sounds great!  What are these magical transformation vectors $\phi_{i}^{\top}$? 
* __TL;DR.__ The $\phi_{i}^{\top}$ vectors are the top-$k$ eigenvectors of the data's covariance matrix, and this reduction procedure has a special name, it is known as *Principal Component Analysis* (PCA).
* **Alternative story.** Equivalently, you can compute the transformation matrix $\mathbf{P}$ via the Singular Value Decomposition (SVD) of the centered data matrix. In this case, the transformation vectors are the top $k$ right singular vectors (first k-columns of the $\mathbf{V}$ matrix).

__Hmmm__: There are few items that we need to introduce before we can get to the SVD story, so let's start with a quick review of the covariance matrix.
___

<div>
    <center>
        <img src="figs/Fig-Cov-Schematic.png" width="680"/>
    </center>
</div>

## Empirical Covariance Matrix
The covariance matrix is a key concept in statistics and machine learning that describes the relationships between different features in a dataset. 
Suppose we have a dataset $\mathcal{D} = \left\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{n}\right\}$ where $\mathbf{x}_{i}\in\mathbb{R}^{m}$. 
The empirical covariance matrix $\hat{\mathbf{\Sigma}}\in\mathbb{R}^{m\times{m}}$ is a square symmetric matrix whose eigenvalues and eigenvectors have some _magical properties_.

Let $x_k^{(i)}$ be the value of feature $i$ on sample $k$.  Collect these into the vector $\mathbf{x}^{(i)}=[x_1^{(i)},\dots,x_n^{(i)}]^\top$. Then, the covariance between feature(s) $i$ and $j$ is given by: 
$$
\begin{align*}
    \hat{\Sigma}_{ij} &= \frac{1}{n-1}\sum_{k=1}^{n}\bigl(x^{(i)}_k-\bar{x}_i\bigr)\,\bigl(x^{(j)}_k-\bar{x}_j\bigr) \quad\Longrightarrow\quad
\hat\Sigma_{ij}=\sigma_i\,\sigma_j\,\rho_{ij}\\
\end{align*}
$$
where $\mathbf{x}^{(\star)}$ denotes the $n$ samples for feature $\star$ in the dataset $\mathcal{D}$, the mean $\bar{x}_\star$ is the average of feature $\star$ across all samples, i.e., $\bar{x}_\star = \frac{1}{n}\sum_{k=1}^{n}x_k^{(\star)}$, the terms $\sigma_{\star}$ denotes the standard deviation computed over the $n$ samples for feature $\star$ in the dataset $\mathcal{D}$, and $\rho_{ij}\in\left[-1,1\right]$ denotes the correlation between features $i$ and $j$ in the dataset $\mathcal{D}$. But where does this come from? Let's break it down a bit more. Starting with the definition of the covariance between features $i$ and $j$:
$$
\begin{align*}
\hat{\Sigma}_{ij} &= \frac{1}{n-1}\sum_{k=1}^{n}\overbrace{\bigl(x^{(i)}_k-\bar{x}_i\bigr)}^{\text{distance from mean}}\,\bigl(x^{(j)}_k-\bar{x}_j\bigr)\\
\sigma_i^2 & = \hat\Sigma_{ii}=\frac{1}{n-1}\sum_{k=1}^{n}(x_k^{(i)}-\bar x_i)^2,\quad\sigma_j^2=\hat\Sigma_{jj} = \frac{1}{n-1}\sum_{k=1}^{n}(x_k^{(j)}-\bar x_j)^2\\
\rho_{ij} &= \frac{\displaystyle
  \overbrace{\frac{1}{n-1}\sum_{k=1}^n\bigl(x^{(i)}_k-\bar{x}_i\bigr)\,\bigl(x^{(j)}_k-\bar{x}_j\bigr)}^{\hat\Sigma_{ij}}}{
  \underbrace{\sqrt{\frac{1}{n-1}\sum_{k=1}^n\bigl(x^{(i)}_k-\bar{x}_i\bigr)^2}}_{\sigma_i}
  \;\underbrace{\sqrt{\frac{1}{n-1}\sum_{k=1}^n\bigl(x^{(j)}_k-\bar{x}_j\bigr)^2}}_{\sigma_j}
} = \frac{\hat\Sigma_{ij}}{\sigma_i\,\sigma_j}  \quad\Longrightarrow\quad \hat\Sigma_{ij} = \sigma_i\,\sigma_j\,\rho_{ij}\quad\blacksquare
\end{align*}
$$ 

However, computing the correlation $\rho_{ij}$ is not necessary to compute the covariance matrix $\hat{\mathbf{\Sigma}}$.  We can compute the covariance matrix directly from the centered feature vectors $\tilde{\mathbf{x}}^{(i)} = \mathbf{x}^{(i)} - \bar{x}_i\mathbf{1}$ where $\bar{x}_i$ is the mean (scalar) of feature $i$ (across all samples) and $\mathbf{1}$ is a vector of ones. The empirical covariance matrix is then given by:
$$
\hat{\mathbf{\Sigma}} = \left(\frac{1}{n-1}\right)\;\tilde{\mathbf{X}}^{\top}\tilde{\mathbf{X}}
$$
where $\tilde{\mathbf{X}} = [\tilde{\mathbf{x}}^{(1)},\dots,\tilde{\mathbf{x}}^{(m)}]\in\mathbb{R}^{n\times m}$ is the matrix of centered feature vectors (samples on the rows, features on the columns), and $\tilde{\mathbf{X}}^\top$ is the transpose of the matrix of centered feature vectors.

The covariance matrix $\hat{\mathbf{\Sigma}}$ has the following (interesting) properties:

* _Elements_: The diagonal elements of the covariance matrix $\hat{\Sigma}_{ii}\in\hat{\mathbf{\Sigma}}$ are the variances of feature $i$ (non-negative),
while the off-diagonal elements $\hat{\Sigma}_{ij}\in\hat{\mathbf{\Sigma}}$ for $i\neq{j}$ measure the relationship between feature(s) $i$ and $j$ in the dataset $\mathcal{D}$. The sign and magnitude of the covariance $\hat{\Sigma}_{ij}$ indicate the strength and direction of the linear relationship between features $i$ and $j$.
* _Positive, negative of no relationship?_: The sign of the off-diagonal covariance elements $\hat{\Sigma}_{ij}\in\hat{\mathbf{\Sigma}}$ indicates whether features $i$ and $j$ are positively or negatively correlated, or they are uncorrelated. If $\hat{\Sigma}_{ij} > 0$, then features $i$ and $j$ are positively correlated, meaning that as one feature increases, the other feature also tends to increase. If $\hat{\Sigma}_{ij} < 0$, then features $i$ and $j$ are negatively correlated, meaning that as one feature increases, the other feature tends to decrease. If $\hat{\Sigma}_{ij} = 0$, then features $i$ and $j$ are uncorrelated, meaning that there is no linear relationship between the two features.
* _Symmetry_: The covariance matrix $\hat{\mathbf{\Sigma}}$ is symmetric, meaning that $\hat{\Sigma}_{ij} = \hat{\Sigma}_{ji}$ for all $i$ and $j$. This is because the covariance between features $i$ and $j$ is the same as the covariance between features $j$ and $i$.

### Decomposing the Covariance Matrix
Now to the magical part! The (empirical) covariance matrix $\hat{\mathbf{\Sigma}}$ can be decomposed into its eigenvalues and eigenvectors. But why is this so important? Many reasons, but let's look at a few of the most important ones. Because $\hat{\mathbf{\Sigma}}$ is real, symmetric, and positive semi-definite, its eigenpair $(\mathbf{v}_i,\lambda_i)$ have several special and useful properties:

1. **Orthonormal eigenvectors**: The eigenvectors $\mathbf{v}_i$ of the covariance matrix $\hat{\mathbf{\Sigma}}$ are orthonormal, meaning that they are mutually orthogonal and have unit length. This means that $\mathbf{V}^\top \mathbf{V} = \mathbf{I}$, where $\mathbf{V}=[\,\mathbf{v}_1,\dots,\mathbf{v}_m]\in\R^{m\times m}$ is the matrix of eigenvectors. Thus, we can write the covariance matrix in terms of its eigenvectors and eigenvalues as:

   $$
   \hat{\mathbf{\Sigma}} \;=\; \mathbf{V}\,\mathbf{\Lambda}\,\mathbf{V}^\top,
   $$

   where $\mathbf{V}=[\,\mathbf{v}_1,\dots,\mathbf{v}_m]\in\R^{m\times m}$ has orthonormal columns ($\mathbf{V}^\top \mathbf{V} = \mathbf{I}$), where each $\mathbf{v}_i$ is an eigenvector of $\hat{\mathbf{\Sigma}}$.

2. **Non-negative eigenvalues**: All eigenvalues $\lambda_i$ of the covariance matrix $\hat{\mathbf{\Sigma}}$ are non-negative, i.e., $\lambda_i \ge 0$. This is because the covariance matrix is positive semi-definite. The eigenvalues can be arranged in descending order:

   $$
   \Lambda=\operatorname{diag}(\lambda_1,\dots,\lambda_m),
   \quad
   \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_m \geq 0.
   $$

   Each $\lambda_i$ equals the _variance of the data_ along the direction $\mathbf{v}_i$. That seems to be a very important property, so let's unpack it a bit more.

3. **Principal components**: Projecting a centered data vector $\mathbf{x}$ onto $\mathbf{v}_i$ gives the i-th principal component:

   $$
   y_i \;=\; \mathbf{v}_i^\top(\mathbf{x}-\bar{\mathbf{x}}),
   $$

   whose variance across the dataset is exactly $\lambda_i$. Thus, the eigenvalues $\lambda_i$ represent the amount of variance captured by each principal component $\mathbf{v}_i$. The first principal component $\mathbf{v}_1$ captures the most variance, the second principal component $\mathbf{v}_2$ captures the second most variance, and so on.

4. **Optimal low-rank approximation**
   Keeping only the top $k$ eigenvectors (the columns of $\mathbf{V}_k=[\mathbf{v}_1,\dots,\mathbf{v}_k]$) yields the best $k$-dimensional linear reconstruction of the data in the least-squares sense.  In other words, the subspace spanned by $\{\mathbf{v}_1,\dots,\mathbf{v}_k\}$ maximizes retained variance and minimizes reconstruction error. We'll see this in action later in this module!

These properties make the covariance matrix eigenpairs $(\mathbf{v}_i,\lambda_i)$ the foundation of Principal Component Analysis (PCA), where eigenvectors provide maximum variance directions and eigenvalues measure captured variance. Now let's explore an alternative approach using Singular Value Decomposition (SVD).
___

## Singular value decomposition
The singular value decomposition _factors_ a matrix into three components. 
Given $\mathbf{A}\in\R^{m\times n}$ of rank $r(\mathbf{A})$, the **full** SVD is given by the factorization:
$$
\mathbf{A} = \mathbf{U}\,\mathbf{S}\,\mathbf{V}^\top,
$$

where
* $\mathbf{U}\in\R^{m\times m}$ is orthogonal ($\mathbf{U}^\top \mathbf{U} = \mathbf{I}_m$),
* $\mathbf{S}\in\R^{m\times n}$ is rectangular diagonal (with the singular values $\sigma_1\ge\cdots\ge\sigma_r>0$ on the first $r(\mathbf{A})$ diagonal entries, zeros elsewhere). The singular values are the square roots of the eigenvalues of the matrix $\mathbf{A}^{\top}\mathbf{A}$, i.e., $\sigma_i = \sqrt{\lambda_i(\mathbf{A}^\top\mathbf{A})}$, 
where $\lambda_i(\mathbf{A}^\top\mathbf{A})$ denote the eigenvalues of the matrix $\mathbf{A}^\top\mathbf{A}$. The number of non-zero singular values is [the rank](https://en.wikipedia.org/wiki/Rank_(linear_algebra)) of the matrix $\mathbf{A}$, where $r(\mathbf{A}) \leq\min\left(n,m\right)$.
* $\mathbf{V}\in\R^{n\times n}$ is orthogonal ($\mathbf{V}^\top \mathbf{V} = \mathbf{I}_n$).

__The full SVD__ stores **all** $m$ left‐singular vectors and all $n$ right‐singular vectors, even though only the first $r(\mathbf{A})$ correspond to nonzero singular values.

* **Pros:** You have the complete orthonormal bases for both row‐ and column‐spaces.
* **Cons:** Memory requirement is $\mathcal{O}(m^2 + n^2)$, which can be wasteful if $r(\mathbf{A})\ll\min(m,n)$. When you have a low rank matrix, you can save memory by storing only the first $r(\mathbf{A})$ left‐ and right‐singular vectors, which is the idea underlying the **thin SVD**.

__Thin SVD__: The thin factorization is given by $\mathbf{U}_r\in\R^{m\times r},\mathbf{V}_r\in\R^{n\times r},\mathbf{S}_r\in\R^{r\times r}$, where $r = r(\mathbf{A})$ is the rank of the matrix $\mathbf{A}$. Use the thin SVD whenever you only care about the nonzero singular values/directions (which is most of the time!).

#### Dimensionality Reduction?
The SVD decomposes any rectangular matrix into a weighted sum of rank-1 blocks, making it ideal for dimensionality reduction and data compression.

Let $\mathbf{A}\in\mathbb{R}^{m\times{n}}$ have the singular value decomposition $\mathbf{A} = \mathbf{U}\mathbf{S}\mathbf{V}^{\top}$. Then, the matrix $\mathbf{A}$ can be written as:
$$
\mathbf{A} = \sum_{i=1}^{r(\mathbf{A})}\sigma_{i}\,\mathbf{u}_{i}\mathbf{v}_{i}^{\top}
$$
where $r(\mathbf{A})$ is the rank of matrix $\mathbf{A}$, the vectors $\mathbf{u}_{i}$ and $\mathbf{v}_{i}$ are the $i$-th left and right singular vectors, and $\sigma_{i}$ are the (ordered) singular values. The [outer-product](https://en.wikipedia.org/wiki/Outer_product) $\mathbf{u}_{i}\mathbf{v}_{i}^{\top}$ is a separable rank-1 block of the matrix $\mathbf{A}$. 

By truncating the sum at $k\ll{r}(\mathbf{A})$, the SVD yields the best rank-$k$ approximation in both the Frobenius and spectral norms (Eckart–Young theorem). Let
$$
\mathbf{A} = \sum_{i=1}^r \sigma_i\,\mathbf{u}_i\,\mathbf{v}_i^\top
$$
be the full SVD of $\mathbf{A}\in\R^{m\times n}$, with singular values $\sigma_1\ge\sigma_2\ge\cdots\ge\sigma_r>0$ and rank $r$.  Define the **truncated** SVD of rank $k$ by
$$
\mathbf{A}_k = \sum_{i=1}^k \sigma_i\,\mathbf{u}_i\,\mathbf{v}_i^\top.
$$
Then for any other matrix $\mathbf{B}$ of rank at most $k$,
$$
\|\mathbf{A} - \mathbf{A}_k\|_F \;\le\;\|\mathbf{A} - \mathbf{B}\|_F,\quad\|\mathbf{A} - \mathbf{A}_k\|_2 \;\le\;\|\mathbf{A} - \mathbf{B}\|_2,
$$
where $\|\cdot\|_F$ is the Frobenius norm and $\|\cdot\|_2$ the operator (spectral) norm.  In other words, $\mathbf{A}_k$ is the best rank-$k$ approximation to $\mathbf{A}$ under both measures of error. Thus, truncating the SVD isn't just a heuristic, it's *provably* the closest rank-$k$ matrix to $\mathbf{A}$, whether you measure closeness by total squared error (Frobenius) or by maximum stretch (spectral).

___