# L7a: Dimensionality Reduction
High-dimensional data is everywhere in modern machine learning, from genomics datasets with thousands of features to images with millions of pixels. Dimensionality reduction techniques provide powerful tools to compress this data while preserving its most important structure and information. 

> __Learning Objectives:__
>
> By the end of this lecture, you will be able to define and demonstrate mastery of the following key concepts:
>
>* **Covariance Matrix Analysis** - Computing and interpreting the empirical covariance matrix to understand relationships between features, including its eigendecomposition and the meaning of eigenvectors (principal directions of variation) and eigenvalues (variance along those directions)
>* **Singular Value Decomposition (SVD)** - Understanding how to factorize matrices into orthogonal components (full and thin SVD), and how to apply truncated SVD for optimal low-rank approximation, dimensionality reduction, and data compression
>* **Mathematical Foundations of PCA** - Understanding how Principal Component Analysis emerges from the eigendecomposition of the covariance matrix and its equivalence to SVD of centered data for dimensionality reduction

These mathematical frameworks form the foundation of many machine learning algorithms and data analysis techniques, providing both theoretical insights and practical tools for working with complex datasets. Let's go!

___

## Examples
Today, we will be using the following example(s) to illustrate key concepts:

> [▶ Fun with Singular Value Decomposition](CHEME-5800-L7a-Example-FunWithSVD-Fall-2025.ipynb). In this example, students will decompose a grayscale image using singular value decomposition and understand what this mathematical technique reveals about the structure of data. You'll learn how SVD breaks down complex information into simpler, ranked components that capture different amounts of detail.

___

## Dimensionality Reduction Problem
Suppose we have a dataset $\mathcal{D} = \left\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{n}\right\}$ where $\mathbf{x}_{i}\in\mathbb{R}^{m}$ is an $m$-dimensional feature vector that we want to compress into $k$ dimensions: $\mathbf{x}_i \in \mathbb{R}^m \;\rightarrow\; \mathbf{y}_i \in \mathbb{R}^k$ where $k\ll{m}$.

> **Why does this matter?** Dimensionality reduction shows up everywhere in real applications:
> * **Genomics:** Gene expression data might have 20,000+ features (genes) but only hundreds of samples. Reducing to key patterns reveals biological processes.
> * **Image compression:** A megapixel image has millions of dimensions, but most information concentrates in far fewer components (like we'll see with SVD!).
> * **Recommendation systems:** User-item matrices are huge and sparse. Low-dimensional representations capture latent preferences efficiently. The winning solution to Netflix's famous $1M Prize used matrix factorization techniques (a form of SVD) to improve movie recommendations.
> * **Visualization:** You can't plot 1000-dimensional data, but you can visualize the top 2–3 components to spot patterns and clusters.

**Composite features:**  Each lower-dimensional vector $\mathbf{y}_i$ is called a *composite feature*, since it's a linear combination of the original features.  Reducing dimensionality can help us visualize high-dimensional data in 2–3D, reduce the computational complexity of a machine learning algorithm, or give us a more compact representation of the data that retains the most important information.

Imagine we have a _magical transformation matrix_ $\mathbf{P}\in\mathbb{R}^{k\times{m}}$ so that: $\mathbf{y} = \mathbf{P}\;(\mathbf{x} - \bar{\mathbf{x}})$ where $\mathbf{y}\in\mathbb{R}^{k}$ is the new composite feature vector and $\bar{\mathbf{x}}$ is the mean of the original features.  If we write $\mathbf{P} = [\,\mathbf{\phi}_1^\top;\dots;\mathbf{\phi}_k^\top]$, then each row $\mathbf{\phi}_i^\top$ extracts one component:
$$
\begin{align*}
y_{i} = \phi_{i}^{\top}\;(\mathbf{x} - \bar{\mathbf{x}})\quad{i=1,2,\dots,k}
\end{align*}
$$

Wow, that sounds great!  What are these magical transformation vectors $\phi_{i}^{\top}$? 
* __TL;DR.__ The $\phi_{i}^{\top}$ vectors are the top-$k$ eigenvectors of the data's covariance matrix, and this reduction procedure has a special name, it is known as __Principal Component Analysis (PCA)__.
* **Alternative story.** Equivalently, you can compute the transformation matrix $\mathbf{P}$ via another technique called the Singular Value Decomposition (SVD) of the centered data matrix. In this case, the transformation vectors are the top $k$ right singular vectors (first k-columns of the $\mathbf{V}$ matrix).

__Hmmm__: There are a few items that we need to introduce before we can get to the SVD story, so let's start with a quick review of the covariance matrix.
___

<div>
    <center>
        <img src="figs/Fig-Cov-Schematic.png" width="680"/>
    </center>
</div>

## Empirical Covariance Matrix
The covariance matrix is a key concept in statistics and machine learning that describes the relationships between different features in a dataset. 

Suppose we have a dataset $\mathcal{D} = \left\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{n}\right\}$ of $m$ features and $n$ samples, where $\mathbf{x}_{k}\in\mathbb{R}^{m}$ is the vector of $m$ features for sample $k$, and there are $n$ samples. The empirical covariance matrix $\hat{\mathbf{\Sigma}}\in\mathbb{R}^{m\times m}$ is a square symmetric matrix that summarizes the pairwise covariances between the $m$ features in the dataset $\mathcal{D}$.

Let $x_k^{(i)}$ be the $i$-th feature of sample $k$. Collect the values for feature $i$ into the vector $\mathbf{x}^{(i)}=[x_1^{(i)},\dots,x_n^{(i)}]^\top$. Then, the covariance between features $i$ and $j$ is given by: 
$$
\begin{align*}
    \hat{\Sigma}_{ij} &= \frac{1}{n-1}\sum_{k=1}^{n}\bigl(x^{(i)}_k-\bar{x}_i\bigr)\,\bigl(x^{(j)}_k-\bar{x}_j\bigr) \quad\Longrightarrow\quad\boxed{
\hat\Sigma_{ij}=\sigma_i\,\sigma_j\,\rho_{ij}}\\
\end{align*}
$$
where the mean $\bar{x}_i = \frac{1}{n}\sum_{k=1}^{n}x_k^{(i)}$ is the average of feature $i$ across all $n$ samples, the term $\sigma_{i} = \sqrt{\hat{\Sigma}_{ii}}$ denotes the standard deviation for feature $i$, and $\rho_{ij}\in\left[-1,1\right]$ denotes the correlation between features $i$ and $j$ in the dataset $\mathcal{D}$. But where does this come from? Let's break it down a bit more.
Starting with the definition of the covariance between features $i$ and $j$:
$$
\begin{align*}
\hat{\Sigma}_{ij} &= \frac{1}{n-1}\sum_{k=1}^{n}\overbrace{\bigl(x^{(i)}_k-\bar{x}_i\bigr)}^{\text{deviation from mean}}\,\bigl(x^{(j)}_k-\bar{x}_j\bigr)\\
\sigma_i^2 & = \hat\Sigma_{ii}=\frac{1}{n-1}\sum_{k=1}^{n}(x_k^{(i)}-\bar x_i)^2,\quad\sigma_j^2=\hat\Sigma_{jj} = \frac{1}{n-1}\sum_{k=1}^{n}(x_k^{(j)}-\bar x_j)^2\\
\rho_{ij} &= \frac{\displaystyle
  \overbrace{\frac{1}{n-1}\sum_{k=1}^n\bigl(x^{(i)}_k-\bar{x}_i\bigr)\,\bigl(x^{(j)}_k-\bar{x}_j\bigr)}^{\hat\Sigma_{ij}}}{
  \underbrace{\sqrt{\frac{1}{n-1}\sum_{k=1}^n\bigl(x^{(i)}_k-\bar{x}_i\bigr)^2}}_{\sigma_i}
  \;\underbrace{\sqrt{\frac{1}{n-1}\sum_{k=1}^n\bigl(x^{(j)}_k-\bar{x}_j\bigr)^2}}_{\sigma_j}
} = \frac{\hat\Sigma_{ij}}{\sigma_i\,\sigma_j}  \quad\Longrightarrow\quad \boxed{\hat\Sigma_{ij} = \sigma_i\,\sigma_j\,\rho_{ij}\quad\blacksquare}
\end{align*}
$$ 

However, computing the correlation $\rho_{ij}$ is not necessary to compute the covariance matrix $\hat{\mathbf{\Sigma}}$ directly. We can compute the covariance matrix from the data matrix $\mathbf{X} \in\mathbb{R}^{n \times m}$ (rows = observations/time periods, columns = variables/features) where each row $k$ contains the values for all $m$ features at sample $k$:
$$
\mathbf{X} = \begin{bmatrix}
x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(m)} \\
x_2^{(1)} & x_2^{(2)} & \cdots & x_2^{(m)} \\
\vdots & \vdots & \ddots & \vdots \\
x_n^{(1)} & x_n^{(2)} & \cdots & x_n^{(m)}
\end{bmatrix}
$$
To center the data, we need to subtract the mean for each feature. Let $\mathbf{m} = [\bar{x}_1, \bar{x}_2, \ldots, \bar{x}_m]^{\top}$ be the vector containing the mean for each feature. The centered data matrix is:
$$
\tilde{\mathbf{X}} = \mathbf{X} - \mathbf{1}\mathbf{m}^{\top}
$$
where $\mathbf{1} \in \mathbb{R}^{n}$ is a vector of ones, and $\mathbf{1}\mathbf{m}^{\top}$ creates an $n \times m$ matrix where each row is identical and contains the means. 
> __Outer product:__ The $\mathbf{1}\mathbf{m}^{\top}$ is an example of an outer product. The [outer product](https://en.wikipedia.org/wiki/Outer_product) of two vectors $\mathbf{a} \in \mathbb{R}^{n}$ and $\mathbf{b} \in \mathbb{R}^{m}$ is the $n \times m$ matrix $\mathbf{a}\mathbf{b}^{\top}$. Each element of the outer product is computed as $(\mathbf{a}\mathbf{b}^{\top})_{ij} = a_i b_j$. 

The empirical covariance matrix is then:
$$
\hat{\mathbf{\Sigma}} = \frac{1}{n-1}\tilde{\mathbf{X}}^{\top}\tilde{\mathbf{X}}
$$

> __Covariance Matrix Properties:__
>
> The covariance matrix $\hat{\mathbf{\Sigma}}$ has the following (important) properties:
> * __Elements__: The diagonal elements of the covariance matrix $\hat{\Sigma}_{ii}\in\hat{\mathbf{\Sigma}}$ are the variances of feature $i$ (always non-negative),
while the off-diagonal elements $\hat{\Sigma}_{ij}\in\hat{\mathbf{\Sigma}}$ for $i\neq{j}$ measure the covariance between features $i$ and $j$ in the dataset $\mathcal{D}$. The sign and magnitude of the covariance $\hat{\Sigma}_{ij}$ indicate the strength and direction of the linear relationship between features $i$ and $j$.
>
> * __Positive, negative or no relationship?__: If $\hat{\Sigma}_{ij} > 0$, then features $i$ and $j$ are positively correlated, meaning that when one feature increases above its mean, the other feature also tends to increase above its mean. If $\hat{\Sigma}_{ij} < 0$, then features $i$ and $j$ are negatively correlated, meaning that when one feature increases above its mean, the other feature tends to decrease below its mean. If $\hat{\Sigma}_{ij} = 0$, then features $i$ and $j$ are uncorrelated, meaning that there is no linear relationship between the two features.
>
> * __Symmetry__: The covariance matrix $\hat{\mathbf{\Sigma}}$ is symmetric, meaning that $\hat{\Sigma}_{ij} = \hat{\Sigma}_{ji}$ for all $i$ and $j$. This follows directly from the definition of covariance.
>
> * __Positive Semi-Definite__: The covariance matrix $\hat{\mathbf{\Sigma}}$ is positive semi-definite, meaning that for any vector $\mathbf{v} \in \mathbb{R}^m$, we have $\mathbf{v}^{\top}\hat{\mathbf{\Sigma}}\mathbf{v} \geq 0$. This property ensures that the matrix can be used for valid probability distributions and optimization problems.

### Decomposing the Covariance Matrix
Now to the magical part! The (empirical) covariance matrix $\hat{\mathbf{\Sigma}}$ can be decomposed into its eigenvalues and eigenvectors. But why is this so important? Many reasons, but let's look at a few of the most important ones. Because $\hat{\mathbf{\Sigma}}$ is real, symmetric, and positive semi-definite, its eigenpairs $(\mathbf{v}_i,\lambda_i)$ have several special and useful properties:

1. **Orthonormal eigenvectors**: The eigenvectors $\mathbf{v}_i$ of the covariance matrix $\hat{\mathbf{\Sigma}}$ are orthonormal, meaning that they are mutually orthogonal and have unit length. This means that $\mathbf{V}^\top \mathbf{V} = \mathbf{I}$, where $\mathbf{V}=[\,\mathbf{v}_1,\dots,\mathbf{v}_m]\in\mathbb{R}^{m\times m}$ is the matrix of eigenvectors. Thus, we can write the covariance matrix in terms of its eigenvectors and eigenvalues as:

   $$
   \hat{\mathbf{\Sigma}} \;=\; \mathbf{V}\,\mathbf{\Lambda}\,\mathbf{V}^\top,
   $$

   where $\mathbf{V}=[\,\mathbf{v}_1,\dots,\mathbf{v}_m]\in\mathbb{R}^{m\times m}$ has orthonormal columns ($\mathbf{V}^\top \mathbf{V} = \mathbf{I}$), where each $\mathbf{v}_i$ is an eigenvector of $\hat{\mathbf{\Sigma}}$.

2. **Non-negative eigenvalues**: All eigenvalues $\lambda_i$ of the covariance matrix $\hat{\mathbf{\Sigma}}$ are non-negative, i.e., $\lambda_i \ge 0$. This is because the covariance matrix is positive semi-definite. The eigenvalues can be arranged in descending order:

   $$
   \Lambda=\operatorname{diag}(\lambda_1,\dots,\lambda_m),
   \quad
   \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_m \geq 0.
   $$

   Each $\lambda_i$ equals the _variance of the data_ along the direction $\mathbf{v}_i$. That seems to be a very important property, so let's unpack it a bit more.

3. **Principal components**: Projecting a centered data vector $\mathbf{x}$ onto $\mathbf{v}_i$ gives the $i$-th principal component:

   $$
   y_i \;=\; \mathbf{v}_i^\top(\mathbf{x}-\bar{\mathbf{x}}),
   $$

   whose variance across the dataset is exactly $\lambda_i$. Thus, the eigenvalues $\lambda_i$ represent the amount of variance captured by each principal component $\mathbf{v}_i$. The first principal component $\mathbf{v}_1$ captures the most variance, the second principal component $\mathbf{v}_2$ captures the second most variance, and so on.

4. **Optimal low-rank approximation**
   Keeping only the top $k$ eigenvectors (the columns of $\mathbf{V}_k=[\mathbf{v}_1,\dots,\mathbf{v}_k]$) yields the best $k$-dimensional linear reconstruction of the data in the least-squares sense.  In other words, the subspace spanned by $\{\mathbf{v}_1,\dots,\mathbf{v}_k\}$ maximizes retained variance and minimizes reconstruction error. We'll see this in action later in this module!

These properties make the covariance matrix eigenpairs $(\mathbf{v}_i,\lambda_i)$ the foundation of Principal Component Analysis (PCA), where eigenvectors provide maximum variance directions and eigenvalues measure captured variance. 

### From Eigendecomposition to SVD
The covariance matrix approach is elegant, but it has a limitation: it only works for square, symmetric matrices like $\hat{\mathbf{\Sigma}}$. What if we want to directly factor our data matrix $\tilde{\mathbf{X}}\in\mathbb{R}^{n\times m}$, which might be rectangular (more samples than features, or vice versa)?

> **Enter SVD!** The Singular Value Decomposition provides a factorization that works for *any* rectangular matrix, not just symmetric ones. Even better, SVD applied to the centered data matrix $\tilde{\mathbf{X}}$ gives us the same principal components as the eigendecomposition of $\hat{\mathbf{\Sigma}}$, they're mathematically equivalent approaches! 

This is why SVD has become the workhorse of dimensionality reduction: it's more general, numerically stable, and directly decomposes your data. Let's explore this powerful alternative approach.

___

## Singular value decomposition (SVD)
The singular value decomposition _factors_ a matrix into three components. 
Given a matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ of rank $r(\mathbf{A})$, the **full** SVD is given by the factorization:
$$
\mathbf{A} = \mathbf{U}\,\mathbf{S}\,\mathbf{V}^\top,
$$

where:
* $\mathbf{U}\in\mathbb{R}^{m\times m}$ is orthogonal ($\mathbf{U}^\top \mathbf{U} = \mathbf{I}_m$),
* $\mathbf{S}\in\mathbb{R}^{m\times n}$ is rectangular diagonal (with the singular values $\sigma_1\ge\cdots\ge\sigma_r>0$ on the first $r(\mathbf{A})$ diagonal entries, zeros elsewhere). The singular values are the square roots of the eigenvalues of the matrix $\mathbf{A}^{\top}\mathbf{A}$, i.e., $\sigma_i = \sqrt{\lambda_i(\mathbf{A}^\top\mathbf{A})}$, 
where $\lambda_i(\mathbf{A}^\top\mathbf{A})$ denote the eigenvalues of the matrix $\mathbf{A}^\top\mathbf{A}$. The number of non-zero singular values is [the rank](https://en.wikipedia.org/wiki/Rank_(linear_algebra)) of the matrix $\mathbf{A}$, where $r(\mathbf{A}) \leq\min\left(n,m\right)$.
* $\mathbf{V}\in\mathbb{R}^{n\times n}$ is orthogonal ($\mathbf{V}^\top \mathbf{V} = \mathbf{I}_n$).

__The full SVD__ stores **all** $m$ left‐singular vectors and all $n$ right‐singular vectors, even though only the first $r(\mathbf{A})$ correspond to nonzero singular values.

* **Pros:** You have the complete orthonormal bases for both row‐ and column‐spaces.
* **Cons:** Memory requirement is $\mathcal{O}(m^2 + n^2)$, which can be wasteful if $r(\mathbf{A})\ll\min(m,n)$. When you have a low rank matrix, you can save memory by storing only the first $r(\mathbf{A})$ left‐ and right‐singular vectors, which is the idea underlying the **thin SVD**.

__Thin SVD__: The thin factorization is given by $\mathbf{U}_r\in\mathbb{R}^{m\times r},\mathbf{V}_r\in\mathbb{R}^{n\times r},\mathbf{S}_r\in\mathbb{R}^{r\times r}$, where $r = r(\mathbf{A})$ is the rank of the matrix $\mathbf{A}$. Use the thin SVD whenever you only care about the nonzero singular values/directions (which is most of the time!).

#### Dimensionality Reduction?
The SVD decomposes any rectangular matrix into a weighted sum of rank-1 blocks, making it ideal for dimensionality reduction and data compression. Let $\mathbf{A}\in\mathbb{R}^{m\times{n}}$ have the singular value decomposition $\mathbf{A} = \mathbf{U}\mathbf{S}\mathbf{V}^{\top}$. Then, the matrix $\mathbf{A}$ can be written as:
$$
\mathbf{A} = \sum_{i=1}^{r(\mathbf{A})}\sigma_{i}\,\underbrace{\left(\mathbf{u}_{i}\otimes\mathbf{v}_{i}\right)}_{\mathbf{u}_{i}\mathbf{v}_{i}^{\top}}
$$
where $r(\mathbf{A})$ is the rank of matrix $\mathbf{A}$, the vectors $\mathbf{u}_{i}$ and $\mathbf{v}_{i}$ are the $i$-th left and right singular vectors, and $\sigma_{i}$ are the (ordered) singular values. The [outer-product](https://en.wikipedia.org/wiki/Outer_product) $\mathbf{u}_{i}\mathbf{v}_{i}^{\top}$ is a separable rank-1 block of the matrix $\mathbf{A}$. 

> __Eckart–Young theorem__
>
> By truncating the sum at $k\ll{r}(\mathbf{A})$, with singular values $\sigma_1\ge\sigma_2\ge\cdots\ge\sigma_r>0$, the SVD yields the best rank-$k$ approximation in both the Frobenius and spectral norms (Eckart–Young theorem):
>$$
\mathbf{A}_k = \sum_{i=1}^k \sigma_i\,\mathbf{u}_i\,\mathbf{v}_i^\top.
$$
> For any other matrix $\mathbf{B}$ of rank at most $k$,
> $$
\|\mathbf{A} - \mathbf{A}_k\|_F \;\le\;\|\mathbf{A} - \mathbf{B}\|_F,\quad\|\mathbf{A} - \mathbf{A}_k\|_2 \;\le\;\|\mathbf{A} - \mathbf{B}\|_2,
$$
> where $\|\cdot\|_F$ is the Frobenius norm and $\|\cdot\|_2$ the operator (spectral) norm.  In other words, $\mathbf{A}_k$ is the best rank-$k$ approximation to $\mathbf{A}$ under both measures of error. 

Thus, truncating the SVD isn't just a heuristic, it's *provably* the closest rank-$k$ matrix to $\mathbf{A}$, whether you measure closeness by total squared error (Frobenius) or by maximum stretch (spectral).


Let's look at an example of each of these methods in action.

> __Example__
>
> [▶ Fun with Singular Value Decomposition](CHEME-5800-L7a-Example-FunWithSVD-Fall-2025.ipynb). In this example, students will decompose a grayscale image using singular value decomposition and understand what this mathematical technique reveals about the structure of data. You'll learn how SVD breaks down complex information into simpler, ranked components that capture different amounts of detail.

___

## Lab
In L7b, we will take a deeper dive into singular value decomposition (SVD). We will explore how to compute the SVD of a matrix, understand its properties, and apply SVD for dimensionality reduction on a real-world dataset.

## Summary
In this lecture, we explored the mathematical foundations of dimensionality reduction, focusing on how matrices can be decomposed to reveal the most important directions and patterns in high-dimensional data.

> __Key Takeaways:__
>
> * **The covariance matrix eigendecomposition reveals the principal directions of variation in data** - By decomposing the empirical covariance matrix $\hat{\mathbf{\Sigma}}$ into its eigenvectors $\mathbf{v}_i$ and eigenvalues $\lambda_i$, we identify orthogonal directions that capture maximum variance. The eigenvectors define the principal components, while their corresponding eigenvalues quantify how much variance each direction explains, providing a natural ranking for dimensionality reduction.
>
> * **SVD provides an optimal low-rank approximation for any matrix** - The singular value decomposition factors any matrix $\mathbf{A}$ into orthogonal components $\mathbf{U}$, $\mathbf{S}$, and $\mathbf{V}$, where truncating to the top-$k$ singular values yields the provably best rank-$k$ approximation under both Frobenius $\|\cdot\|_F$ and spectral $\|\cdot\|_2$ norms (Eckart-Young theorem). This makes SVD ideal for data compression and dimensionality reduction applications.
>
> * **PCA and SVD are mathematically equivalent approaches to dimensionality reduction** - Principal Component Analysis via covariance matrix eigendecomposition and SVD of centered data are two sides of the same coin: the right singular vectors of the centered data matrix $\tilde{\mathbf{X}}$ are the eigenvectors of the covariance matrix $\hat{\mathbf{\Sigma}}$, and their squared singular values relate to the eigenvalues. This duality provides flexibility in how we approach dimensionality reduction problems.

These mathematical techniques form the backbone of modern data analysis, enabling us to extract meaningful low-dimensional representations from complex, high-dimensional datasets while preserving the most important information.
___