# Singular Value Decomposition (SVD) and Ordinary Least Squares
Singular Value Decomposition (SVD) is a powerful mathematical tool used in various fields, including statistics, machine learning, and data science. 

>__Motivation:__ SVD in the context of Ordinary Least Squares (OLS) regression provides a way to understand the geometry of the data and the solution to the regression problem. It helps in identifying the directions in which the data varies the most, which can be crucial for understanding the underlying structure of the data. It also gives a principled way to construct approximate solutions for the parameter estimates in the presence of multicollinearity or when the design matrix is not of full rank.

> __Learning Objectives__
>
> By the end of this notebook, you will be able to:
> - **See how SVD breaks down regression solutions** into individual mode contributions. You'll understand how each singular vector pair contributes to your parameter estimates and why this perspective is so insightful.
> - **Compare SVD and direct matrix approaches** for linear regression problems. You'll learn about the theoretical advantages of SVD and understand when each method is most appropriate.
> - **Understand how regularization works** through the SVD lens. You'll see how Ridge regression selectively shrinks different modes and why smaller singular values get penalized more heavily.

Wow! That seems interesting. Let's dig in!
___

## SVD solution for Overdetermined Systems
The singular value decomposition (SVD) of the $n\times{p}$ data matrix $\hat{\mathbf{X}}$ is given by:
$$
\begin{equation*}
\hat{\mathbf{X}} = \mathbf{U}\;\mathbf{\Sigma}\;\mathbf{V}^{T}
\end{equation*}
$$
where $\mathbf{U} \in \mathbb{R}^{n \times n}$ is an orthogonal matrix, $\mathbf{\Sigma} \in \mathbb{R}^{n \times p}$ is a rectangular matrix with singular values on the diagonal, 
and $\mathbf{V} \in \mathbb{R}^{p \times p}$ is an orthogonal matrix. The least-squares estimate of the unknown parameter vector $\mathbf{\theta}$ is given by:
$$
\begin{equation*}
\hat{\mathbf{\theta}} = \mathbf{V}\;\mathbf{\Sigma}^{\dagger}\;\mathbf{U}^{T}\;\mathbf{y}
\end{equation*}
$$
where $\mathbf{\Sigma}^{\dagger}$ is the Moore-Penrose pseudoinverse of $\mathbf{\Sigma}$. For practical computation, this can be written in index notation as:
$$
\begin{equation*}
\hat{\mathbf{\theta}} = \sum_{i=1}^{r_{\hat{X}}}\left(\frac{\mathbf{u}_{i}^{T}\mathbf{y}}{\sigma_{i}}\right)\mathbf{v}_{i}
\end{equation*}
$$
where $r_{\hat{X}} = \min(n,p)$ is the rank of the data matrix $\hat{\mathbf{X}}$, $\mathbf{u}_{i}$ and $\mathbf{v}_{i}$ are the $i$-th columns of $\mathbf{U}$ and $\mathbf{V}$, respectively, and $\sigma_{i}$ is the $i$-th singular value (with $\sigma_i > 0$).

### Key Insights:

* **Mode contribution decomposition**: The SVD solution beautifully shows how different modes (singular vectors) contribute to the parameter estimates. Each term $\left(\frac{\mathbf{u}_{i}^{T}\mathbf{y}}{\sigma_{i}}\right)\mathbf{v}_{i}$ represents the contribution of the $i$-th mode, weighted by how much the response $\mathbf{y}$ projects onto that mode.

* **Geometric interpretation**: The vectors $\mathbf{v}_{i}$ represent orthogonal directions in parameter space, while $\mathbf{u}_{i}$ represent corresponding directions in the data space. The solution is constructed by projecting $\mathbf{y}$ onto each data direction and mapping it to the corresponding parameter direction.

* **Handling problematic data**: When your data matrix doesn't have full rank (meaning some columns are redundant or linearly dependent), SVD automatically identifies these issues and provides the solution with the smallest parameter values without requiring you to manually check for problems.

* **Better numerical behavior**: SVD is more computationally robust than computing $(\hat{\mathbf{X}}^T\hat{\mathbf{X}})^{-1}\hat{\mathbf{X}}^T\mathbf{y}$ directly, especially when your data matrix is poorly conditioned (meaning small changes in the data can cause large changes in the solution). Think of it as a more stable computational path to the same answer.

* **Noise amplification insights**: The term $\sigma_i^{-1}$ immediately shows which modes amplify noise the most. When $\sigma_i$ is very small, dividing by it creates large values, meaning small measurement errors in $\mathbf{y}$ get magnified in those directions. This helps you identify which parts of your solution might be unreliable.

___

## SVD solution for Regularized Least Squares
Let the singular value decomposition (SVD) of the $n\times{p}$ data matrix $\hat{\mathbf{X}}$ be given by:
$$
\begin{equation*}
\hat{\mathbf{X}} = \mathbf{U}\;\mathbf{\Sigma}\;\mathbf{V}^{T}
\end{equation*}
$$
where $\mathbf{U} \in \mathbb{R}^{n \times n}$ is an orthogonal matrix, $\mathbf{\Sigma} \in \mathbb{R}^{n \times p}$ is a rectangular matrix with singular values on the diagonal,
and $\mathbf{V} \in \mathbb{R}^{p \times p}$ is an orthogonal matrix. The regularized least-squares estimate (Ridge regression) of the unknown parameter vector $\mathbf{\theta}$ is given by:
$$
\begin{equation*}
\hat{\mathbf{\theta}}_{\lambda} = \mathbf{V}\left(\mathbf{\Sigma}^{T}\mathbf{\Sigma}+\lambda\mathbf{I}\right)^{-1}\mathbf{\Sigma}^{T}\mathbf{U}^{T}\mathbf{y}
\end{equation*}
$$
or equivalently, in the more computationally efficient index notation:
$$
\begin{equation*}
\hat{\mathbf{\theta}}_{\lambda} = \sum_{i=1}^{r_{\hat{X}}}\left(\frac{\sigma_{i}}{\sigma_{i}^{2}+\lambda}\right)\left(\mathbf{u}_{i}^{T}\mathbf{y}\right)\mathbf{v}_{i}
\end{equation*}
$$
where $r_{\hat{X}} = \min(n,p)$ is the rank of the data matrix $\hat{\mathbf{X}}$, $\mathbf{u}_{i}$ and $\mathbf{v}_{i}$ are the $i$-th columns of $\mathbf{U}$ and $\mathbf{V}$, respectively,
$\sigma_{i}$ is the $i$-th singular value (with $\sigma_i > 0$), and $\lambda \geq 0$ is the regularization parameter.

### Key Insights:

* **Shrinkage effect**: The regularization parameter $\lambda$ shrinks the contribution of each singular value by the factor $\frac{\sigma_{i}}{\sigma_{i}^{2}+\lambda}$, with smaller singular values being shrunk more aggressively.
* **Numerical stability**: This SVD formulation is more numerically stable than directly computing $(\hat{\mathbf{X}}^T\hat{\mathbf{X}} + \lambda\mathbf{I})^{-1}$, especially when $\hat{\mathbf{X}}$ is ill-conditioned.
* **Relationship to filtering**: When $\lambda = 0$, we recover the unregularized solution, and as $\lambda \to \infty$, all coefficients shrink to zero.

___

## SVD vs. Direct Methods: When and Why?

The SVD approach provides an alternative to the direct matrix inversion methods we studied in the previous notebook. Let's compare the two approaches:

We can solve the linear regression problem using the direct approach:
$$\hat{\mathbf{\theta}} = (\hat{\mathbf{X}}^T\hat{\mathbf{X}})^{-1}\hat{\mathbf{X}}^T\mathbf{y}$$

or using the SVD approach:
$$\hat{\mathbf{\theta}} = \mathbf{V}\mathbf{\Sigma}^{\dagger}\mathbf{U}^T\mathbf{y}$$

> __SVD versus direct method?__
> 
> **Rank-deficient matrices**: When $\hat{\mathbf{X}}$ doesn't have full rank, $\hat{\mathbf{X}}^T\hat{\mathbf{X}}$ becomes non-invertible. SVD handles this gracefully with the pseudoinverse.
>
> **Ill-conditioned problems**: When $\hat{\mathbf{X}}^T\hat{\mathbf{X}}$ has a very large condition number (small eigenvalues), direct inversion amplifies numerical errors. SVD provides better numerical stability.
> 
> **Understanding data structure**: SVD reveals the principal directions of variation in your data, which can be valuable for interpretation and dimensionality reduction.
>
> **Regularization insights**: The SVD formulation makes it clear how regularization affects different modes, providing intuition about what Ridge regression actually does.

### Computational trade-offs

The SVD decomposition has computational complexity $O(\min(np^2, n^2p))$, while direct methods are $O(p^3)$ for the matrix inversion. For tall, thin matrices ($n \gg p$), direct methods are often faster, but SVD provides superior numerical properties when stability matters more than speed.

___

## Summary

In this notebook, we've explored how Singular Value Decomposition provides an elegant and numerically robust approach to solving linear regression problems.


> **Key Takeaways**
>
> **SVD Formulation**: The decomposition $\hat{\mathbf{X}} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$ transforms both unregularized and regularized least squares into mode-by-mode solutions:
> - **Unregularized**: $\hat{\mathbf{\theta}} = \sum_{i=1}^{r_{\hat{X}}}\left(\frac{\mathbf{u}_{i}^{T}\mathbf{y}}{\sigma_{i}}\right)\mathbf{v}_{i}$
> - **Regularized**: $\hat{\mathbf{\theta}}_{\lambda} = \sum_{i=1}^{r_{\hat{X}}}\left(\frac{\sigma_{i}}{\sigma_{i}^{2}+\lambda}\right)\left(\mathbf{u}_{i}^{T}\mathbf{y}\right)\mathbf{v}_{i}$
>
> **Geometric Insight**: SVD reveals how each mode (singular vector pair) contributes to the parameter estimates, providing intuition about which directions in parameter space are most important.
>
> **Numerical Advantages**: SVD automatically handles rank-deficient matrices and provides superior numerical stability compared to direct matrix inversion, especially for ill-conditioned problems.
>
> **Regularization Understanding**: The SVD perspective shows how Ridge regression selectively shrinks contributions from different modes, with smaller singular values (noisier directions) being shrunk more aggressively.

### **What's Next?**

In the following live example notebooks, you'll see these SVD concepts applied to real datasets, where you can observe the numerical stability benefits and gain hands-on experience with implementing these methods in practice.