# L8c: Regularization and Cross-Validation
In this lecture, we will discuss regularization techniques to prevent overfitting in machine learning models. We will also cover cross-validation methods to evaluate model performance and ensure generalizability to unseen data.

> __Learning Objectives:__
> 
> By the end of this lecture, you should be able to:
> Three learning objectives here
>

Let's get started!
___

## Examples
Today, we will be using the following example(s) to illustrate key concepts:

> [▶ Let's build a linear model of housing prices](CHEME-5800-L7c-Example-HousingPriceModel-Fall-2025.ipynb). In this example, students will build a linear regression model to predict housing prices based on various features. This will help us understand how to apply ordinary and regularized least squares methods in a practical context.

___

## Concept Review: Overdetermined Linear Regression
Suppose there exists a dataset $\mathcal{D} = \left\{(\mathbf{x}_{i},y_{i}) \mid i = 1,2,\dots,n\right\}$ with $n$ training (labeled) examples, where $\mathbf{x}_{i}\in\mathbb{R}^{m}$ is an $m$-dimensional vector of features (independent input variables) and $y_{i}\in\mathbb{R}$ denotes a scalar response variable (dependent variable).

We can rewrite the linear regression model in matrix-vector form as:
$$
\begin{equation*}
\mathbf{y} = \hat{\mathbf{X}}\;\mathbf{\theta} + \mathbf{\epsilon}
\end{equation*}
$$
where $\hat{\mathbf{X}}$ is an $n\times{p}$ matrix with the augmented features $\hat{\mathbf{x}}_{i}^{\top}$ on the rows, the target (output) vector $\mathbf{y}$ is an $n\times{1}$ column vector with entries $y_{i}$, and the error vector $\mathbf{\epsilon}$ is an $n\times{1}$ column vector with entries $\epsilon_{i}$. 

Given a data matrix $\hat{\mathbf{X}}\in\mathbb{R}^{n\times{p}}$ that is $\texttt{overdetermined}$, i.e., $n \gg p$, and an error model $\mathbf{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma^{2}\;\mathbf{I})$ that follows [a normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) with a mean of zero and variance $\sigma^{2}$. We estimate the model parameters by minimizing the sum of squared errors between the model's estimated outputs and the observed outputs:
$$
\begin{align*}
\hat{\mathbf{\theta}} = \arg\min_{\mathbf{\theta}} \frac{1}{2}\;\lVert~\mathbf{y} - \hat{\mathbf{X}}\;\mathbf{\theta}~\rVert^{2}_{2}
\end{align*}
$$
where $\lVert\star\rVert^{2}_{2}$ is the square of the [L2 vector norm](https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm), and $\hat{\mathbf{\theta}}\in\mathbb{R}^{p}$ is the estimated parameter vector. When $\hat{\mathbf{X}}$ has full column rank (i.e., $\text{rank}(\hat{\mathbf{X}}) = p$ and $\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}$ is invertible), this problem has the analytical solution:
$$
\begin{align*}
\hat{\mathbf{\theta}} &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y}
\end{align*}
$$

The confidence intervals for the estimated parameters can be computed as:
> __Confidence Intervals:__ A $(1-\alpha) \times 100\%$ confidence interval for each parameter $\hat{\theta}_j$ is given by:
> $$
\begin{align*}
\hat{\theta}_j \pm t_{1-\alpha/2,\nu}\; \text{SE}(\hat{\theta}_j)
\end{align*}
$$
> where $t_{1-\alpha/2,\nu}$ is the $(1-\alpha/2)$-quantile of a Student $t$ distribution with $\nu$ degrees of freedom. For a 95% confidence interval, $\alpha = 0.05$ and $t_{1-\alpha/2,\nu} \approx 1.96$. For a 99.9% confidence interval, $\alpha = 0.001$ and $t_{1-\alpha/2,\nu} \approx 3.291$. The standard error $\text{SE}(\hat{\theta}_j)$ (computed above) quantifies the uncertainty in the parameter estimate $\hat{\theta}_j$. It is given by:
>$$
\begin{align*}
\text{SE}(\hat{\theta}_{j}) &= \hat{\sigma}\; \sqrt{\bigl[(\hat{\mathbf X}^\top\hat{\mathbf X})^{-1}\bigr]_{jj}}
\end{align*}
$$

Let's finish the housing price model example from last time. 


> __Example__
>
> [▶ Let's build a linear model of housing prices](CHEME-5800-L7c-Example-HousingPriceModel-Fall-2025.ipynb). In this example, students will build a linear regression model to predict housing prices based on various features. This will help us understand how to apply ordinary and regularized least squares methods in a practical context.

___

## Regularized linear regression
In the __overdetermined case__, we can add a regularization term to the objective function to prevent overfitting and improve generalization, this is called __regularized linear regression__.

> __What is overfitting?__ Overfitting occurs when a model learns the noise in the training data instead of the underlying pattern. This leads to poor performance on unseen data, as the model fails to generalize beyond the training set.

There are several types of regularization techniques, but we will focus on __Ridge regression__ (also known as Tikhonov regularization or L2 regularization). The ridge regression problem is given by:
$$
\begin{align*}
\hat{\mathbf{\theta}}_{\delta} = \arg\min_{\mathbf{\theta}}\left( \frac{1}{2}\;\lVert~\mathbf{y} - \hat{\mathbf{X}}\;\mathbf{\theta}~\rVert^{2}_{2} + \frac{\delta}{2}\;\lVert~\mathbf{\theta}~\rVert^{2}_{2}\right)
\end{align*}
$$
where $\delta> 0$ is the regularization parameter controlling regularization strength.  The first term measures the sum of squared errors, while the second term penalizes large parameter values. The analytical solution for the optimal parameters is given by:
$$
\begin{align*}
\hat{\mathbf{\theta}}_{\delta} &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \delta\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y}
\end{align*}
$$
This solution can also be expressed in terms of the error model $\mathbf{\epsilon}$:
$$
\begin{align*}
\hat{\mathbf{\theta}}_{\delta} &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \delta\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}(\hat{\mathbf{X}}\;\mathbf{\theta} + \mathbf{\epsilon}) \\
\hat{\mathbf{\theta}}_{\delta} &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \delta\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\;\mathbf{\theta} + \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \delta\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon} \\
\hat{\mathbf{\theta}}_{\delta} &= \underbrace{\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \delta\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}}_{\text{Shrinkage}\;\mathbf{P}}\;\mathbf{\theta} + \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \delta\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon} \\
\hat{\mathbf{\theta}}_{\delta} &= \mathbf{P}\;\mathbf{\theta} + \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \delta\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon}\quad\blacksquare\\
\end{align*}
$$

__Key insight__: The regularization term $\delta\;\mathbf{I}$ acts as a penalty for large parameter values, effectively shrinking the estimated parameters towards zero. This helps prevent overfitting by discouraging complex models that fit the training data too closely.

Now, the question is: how do we select the regularization parameter $\delta$? The answer is that we can use cross-validation techniques to tune the hyperparameter $\delta$ by evaluating the model's performance on held-out validation data.
___

## SVD solution for Regularized Least Squares
Let the singular value decomposition (SVD) of the $n\times{p}$ data matrix $\hat{\mathbf{X}}$ be given by:
$$
\begin{equation*}
\hat{\mathbf{X}} = \mathbf{U}\;\mathbf{\Sigma}\;\mathbf{V}^{\top}
\end{equation*}
$$
where $\mathbf{U} \in \mathbb{R}^{n \times n}$ is an orthogonal matrix, $\mathbf{\Sigma} \in \mathbb{R}^{n \times p}$ is a rectangular matrix with singular values on the diagonal,
and $\mathbf{V} \in \mathbb{R}^{p \times p}$ is an orthogonal matrix. The regularized least-squares estimate (Ridge regression) of the unknown parameter vector $\mathbf{\theta}$ is given by:
$$
\begin{equation*}
\hat{\mathbf{\theta}}_{\delta} = \mathbf{V}\left(\mathbf{\Sigma}^{\top}\mathbf{\Sigma}+\delta\mathbf{I}\right)^{-1}\mathbf{\Sigma}^{\top}\mathbf{U}^{\top}\mathbf{y}
\end{equation*}
$$
or equivalently, in the more computationally efficient index notation:
$$
\begin{equation*}
\hat{\mathbf{\theta}}_{\delta} = \sum_{i=1}^{r_{\hat{X}}}\left(\frac{\sigma_{i}}{\sigma_{i}^{2}+\delta}\right)\left(\mathbf{u}_{i}^{\top}\mathbf{y}\right)\mathbf{v}_{i}
\end{equation*}
$$
where $r_{\hat{X}} = \min(n,p)$ is the rank of the data matrix $\hat{\mathbf{X}}$, $\mathbf{u}_{i}$ and $\mathbf{v}_{i}$ are the $i$-th columns of $\mathbf{U}$ and $\mathbf{V}$, respectively,
$\sigma_{i}$ is the $i$-th singular value (with $\sigma_i > 0$), and $\delta \geq 0$ is the regularization parameter.

### Key Insights:

* **Shrinkage effect**: The regularization parameter $\delta$ shrinks the contribution of each singular value by the factor $\frac{\sigma_{i}}{\sigma_{i}^{2}+\delta}$, with smaller singular values being shrunk more aggressively.
* **Numerical stability**: This SVD formulation is more numerically stable than directly computing $(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \delta\mathbf{I})^{-1}$, especially when $\hat{\mathbf{X}}$ is ill-conditioned.
* **Relationship to filtering**: When $\delta = 0$, we recover the unregularized solution, and as $\delta \to \infty$, all coefficients shrink to zero.

___