# L7c: Estimating Linear Model Parameters
To estimate the parameters of a linear model, we use **Ordinary Least Squares (OLS)** and explore its mathematical foundations through both direct and SVD-based solutions. 

> __Learning Objectives__
>
> By the end of this lecture, students will be able to:
> - **Understand and apply OLS solutions** for both overdetermined ($n \gg p$) and underdetermined ($n \ll p$) systems, interpreting the mathematical foundations and when each analytical solution applies.
> - **Implement SVD-based solutions** for linear regression, understanding how singular value decomposition provides numerical stability and insights into data structure compared to direct matrix inversion methods.
> - **Quantify parameter uncertainty** through error variance estimation, standard errors, and confidence intervals, enabling rigorous statistical inference and hypothesis testing about model parameters.

Now let's dive into the mathematical foundations and see how these methods work in practice. Let's go!
___

## Examples
Today, we will be using the following example(s) to illustrate key concepts:

> [▶ Let's build a linear model of housing prices](CHEME-5800-L7c-Example-HousingPriceModel-Fall-2025.ipynb). In this example, students will build a linear regression model to predict housing prices based on various features. This will help us understand how to apply ordinary and regularized least squares methods in a practical context.

___

<div>
    <center>
        <img src="figs/Fig-LinearRegressionModel-Schematic.svg" width="680"/>
    </center>
</div>

## Linear models for continuous prediction tasks
Suppose there exists a dataset $\mathcal{D} = \left\{(\mathbf{x}_{i},y_{i}) \mid i = 1,2,\dots,n\right\}$ with $n$ training (labeled) examples, where $\mathbf{x}_{i}\in\mathbb{R}^{m}$ is an $m$-dimensional vector of features (independent input variables) and $y_{i}\in\mathbb{R}$ denotes a scalar response variable (dependent variable). Then, a $\texttt{linear model}$ for the dataset $\mathcal{D}$ is given (in index-form) by:
$$
\begin{equation*}
y_{i} = \hat{\mathbf{x}}_{i}^{\top}\,\mathbf{\theta} + \epsilon_{i}\qquad{i=1,2,\dots,n}
\end{equation*}
$$
where the augmented features are $\hat{\mathbf{x}}_{i}^{\top}=\left(x_{i1},x_{i2},\dots,x_{im},1\right)$ (we've added an extra `1` to each feature vector to account for the intercept (bias) term), 
the unknown parameters are represented by the $\mathbf{\theta}\in\mathbb{R}^{p}$ vector (where $p=m+1$), and $\epsilon_{i}\in\mathbb{R}$ is the unobserved random error for response $i$, i.e., the component of the target that is _not_ explained by the linear model. 

We can rewrite the linear regression model in matrix-vector form as:
$$
\begin{equation*}
\mathbf{y} = \hat{\mathbf{X}}\;\mathbf{\theta} + \mathbf{\epsilon}
\end{equation*}
$$
where $\hat{\mathbf{X}}$ is an $n\times{p}$ matrix with the augmented features $\hat{\mathbf{x}}_{i}^{\top}$ on the rows, the target (output) vector $\mathbf{y}$ is an $n\times{1}$ column vector with entries $y_{i}$, and the error vector $\mathbf{\epsilon}$ is an $n\times{1}$ column vector with entries $\epsilon_{i}$. The challenge of linear regression is to estimate the unknown parameters $\mathbf{\theta}$ from the dataset $\mathcal{D}$ by minimizing an appropriate loss function, typically the sum of squared errors.

> **Key Insight**: A linear model must only be linear in the parameters, not necessarily the features. For example, we could have polynomial features in the data matrix $\hat{\mathbf{X}}$ such as $1,x,x^{2},x^{3},\dots$. This would still be a linear regression problem because the model remains linear in the parameters $\mathbf{\theta}$.
___

## Overdetermined data matrix without regularization
Suppose you have a data matrix $\hat{\mathbf{X}}\in\mathbb{R}^{n\times{p}}$ that is $\texttt{overdetermined}$, i.e., $n \gg p$, and an error model $\mathbf{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma^{2}\;\mathbf{I})$ that follows [a normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) with a mean of zero and variance $\sigma^{2}$. We estimate the model parameters by minimizing the sum of squared errors between the model's estimated outputs and the observed outputs:
$$
\begin{align*}
\hat{\mathbf{\theta}} = \arg\min_{\mathbf{\theta}} \frac{1}{2}\;\lVert~\mathbf{y} - \hat{\mathbf{X}}\;\mathbf{\theta}~\rVert^{2}_{2}
\end{align*}
$$
where $\lVert\star\rVert^{2}_{2}$ is the square of the [L2 vector norm](https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm), and $\hat{\mathbf{\theta}}\in\mathbb{R}^{p}$ is the estimated parameter vector. When $\hat{\mathbf{X}}$ has full column rank (i.e., $\text{rank}(\hat{\mathbf{X}}) = p$ and $\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}$ is invertible), this problem has the analytical solution:
$$
\begin{align*}
\hat{\mathbf{\theta}} &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y}
\end{align*}
$$

We can also express the estimated parameters in terms of the true parameters and the error model $\mathbf{\epsilon}$:
$$
\begin{align*}
\hat{\mathbf{\theta}} &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y} \\
\hat{\mathbf{\theta}}&= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}(\hat{\mathbf{X}}\;\mathbf{\theta} + \mathbf{\epsilon}) \\
\hat{\mathbf{\theta}} &= \underbrace{\left[\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right]}_{= \mathbf{I}}\;\mathbf{\theta} + \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon}\\
\hat{\mathbf{\theta}} &= \mathbf{\theta} + \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon}\quad\blacksquare\\
\end{align*}
$$

### Key Insights:

* **Unbiased estimator**: The expected value of $\hat{\mathbf{\theta}}$ equals the true parameters: $\mathbb{E}[\hat{\mathbf{\theta}}] = \mathbf{\theta}$, making OLS an unbiased estimator, on average, we get the right answer.
* **Parameter uncertainty**: Since $\mathbf{\epsilon}$ is a random vector, $\hat{\mathbf{\theta}}$ is also a random vector with its own distribution. This fundamental uncertainty means parameter estimates have variability that must be quantified through confidence intervals and hypothesis tests.
* **Connection to Bayesian inference**: The decomposition $\hat{\mathbf{\theta}} = \mathbf{\theta} + (\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}})^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon}$ shows that even in frequentist regression, parameter estimates have distributions, providing a conceptual bridge toward [Bayesian linear regression](https://en.wikipedia.org/wiki/Bayesian_linear_regression) where parameters are explicitly treated as random variables with prior distributions.

___

## Underdetermined data matrix without regularization
Next, let's assume the data matrix $\hat{\mathbf{X}}$ is $\texttt{underdetermined}$, i.e., $n \ll p$, and the error model $\mathbf{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma^{2}\mathbf{I})$.
The ordinary least squares estimate of the unknown parameters $\mathbf{\theta}$ is the _smallest_ $\theta$ that satisfies the original equations:
$$
\begin{align*}
\text{minimize}~&  ||\,\mathbf{\theta}\,||^{2}_{2} \\
\text{subject to} & \, \hat{\mathbf{X}}\;\mathbf{\theta} = \mathbf{y}
\end{align*}
$$
The least-norm solution of the unknown parameter vector is given by:
$$
\begin{align*}
\hat{\mathbf{\theta}} &= \hat{\mathbf{X}}^{\top}\left(\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}\right)^{-1}\;\mathbf{y}
\end{align*}
$$
We can express the solution in terms of the error model $\mathbf{\epsilon}$:
$$
\begin{align*}
\hat{\mathbf{\theta}} &= \hat{\mathbf{X}}^{\top}\left(\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}\right)^{-1}\;\mathbf{y}\\
\hat{\mathbf{\theta}} &= \hat{\mathbf{X}}^{\top}\left(\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}\right)^{-1}\;(\hat{\mathbf{X}}\;\mathbf{\theta} + \mathbf{\epsilon})\\
\hat{\mathbf{\theta}} &= \underbrace{\hat{\mathbf{X}}^{\top}\left(\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}\right)^{-1}\;\hat{\mathbf{X}}}_{\text{Projection}\;\mathbf{P}}\;\mathbf{\theta} + \hat{\mathbf{X}}^{\top}\left(\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}\right)^{-1}\;\mathbf{\epsilon}\\
\hat{\mathbf{\theta}} &= \mathbf{P}\;\mathbf{\theta} +\hat{\mathbf{X}}^{\top}\left(\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}\right)^{-1}\;\mathbf{\epsilon}\quad\blacksquare\\
\end{align*}
$$

### Key Insights:

* **Non-unique solutions**: Unlike the overdetermined case, infinitely many parameter vectors satisfy $\hat{\mathbf{X}}\mathbf{\theta} = \mathbf{y}$. The least-norm solution provides a principled selection criterion by choosing the smallest $\|\mathbf{\theta}\|_2$, which inherently performs L2 regularization.
* **Projection interpretation**: The matrix $\mathbf{P} = \hat{\mathbf{X}}^{\top}(\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top})^{-1}\hat{\mathbf{X}}$ acts as a projection operator mapping true parameters to a subspace determined by the data structure. Notice $\mathbf{P} \neq \mathbf{I}$, unlike the overdetermined case, introducing bias in the estimate.
* **High-dimensional relevance**: In modern machine learning where $p > n$ is common (e.g., genomics, natural language processing), understanding underdetermined systems is crucial for developing effective modeling strategies and choosing appropriate regularization approaches.

**Important questions**: A few key considerations arise here.

* _Why do we select the smallest $\hat{\mathbf{\theta}}$?_ There is no unique solution, i.e., there are infinitely many solutions to the underdetermined system of equations. The least-norm solution is the one that minimizes the L2 norm of the parameter vector $\hat{\mathbf{\theta}}$, which is connected to the concept of regularization. 
* _What is regularization?_ Regularization is a method used to prevent overfitting in machine learning models by adding a penalty to the loss function. In linear regression, regularization helps manage the model's complexity and enhances its ability to generalize to new data. Common techniques include Lasso (L1 regularization) and Ridge (L2 regularization). 

Now let's explore an alternative computational approach using Singular Value Decomposition, which provides superior numerical stability and additional insights into the structure of our data.
___

## SVD solution for Overdetermined Systems
The singular value decomposition (SVD) of the $n\times{p}$ data matrix $\hat{\mathbf{X}}$ is given by:
$$
\begin{equation*}
\hat{\mathbf{X}} = \mathbf{U}\;\mathbf{\Sigma}\;\mathbf{V}^{\top}
\end{equation*}
$$
where $\mathbf{U} \in \mathbb{R}^{n \times n}$ is an orthogonal matrix, $\mathbf{\Sigma} \in \mathbb{R}^{n \times p}$ is a rectangular matrix with singular values on the diagonal, 
and $\mathbf{V} \in \mathbb{R}^{p \times p}$ is an orthogonal matrix. The least-squares estimate of the unknown parameter vector $\mathbf{\theta}$ is given by:
$$
\begin{equation*}
\hat{\mathbf{\theta}} = \mathbf{V}\;\mathbf{\Sigma}^{\dagger}\;\mathbf{U}^{\top}\;\mathbf{y}
\end{equation*}
$$
where $\mathbf{\Sigma}^{\dagger}$ is the Moore-Penrose pseudoinverse of $\mathbf{\Sigma}$. For practical computation, this can be written in index notation as:
$$
\boxed{
\begin{equation*}
\hat{\mathbf{\theta}} = \sum_{i=1}^{r_{\hat{X}}}\left(\frac{\mathbf{u}_{i}^{\top}\mathbf{y}}{\sigma_{i}}\right)\mathbf{v}_{i}\quad\blacksquare
\end{equation*}}
$$
where $r_{\hat{X}} = \min(n,p)$ is the rank of the data matrix $\hat{\mathbf{X}}$, $\mathbf{u}_{i}$ and $\mathbf{v}_{i}$ are the $i$-th columns of $\mathbf{U}$ and $\mathbf{V}$, respectively, and $\sigma_{i}$ is the $i$-th singular value (with $\sigma_i > 0$).

### Key Insights:

* **Mode contribution decomposition**: The solution $\hat{\mathbf{\theta}} = \sum_{i=1}^{r}\left(\frac{\mathbf{u}_{i}^{\top}\mathbf{y}}{\sigma_{i}}\right)\mathbf{v}_{i}$ reveals how each mode contributes to parameter estimates. The vectors $\mathbf{v}_{i}$ represent orthogonal directions in parameter space while $\mathbf{u}_{i}$ represent corresponding data space directions, with $\mathbf{y}$ projected onto each mode and mapped accordingly.
* **Superior numerical stability**: SVD is computationally robust compared to computing $(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}})^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y}$ directly, especially for rank-deficient or ill-conditioned matrices. It automatically handles redundant features and provides the minimum-norm solution without manual intervention.
* **Noise amplification diagnosis**: The term $\sigma_i^{-1}$ immediately reveals which modes amplify noise. When $\sigma_i$ is very small, measurement errors in $\mathbf{y}$ get magnified in those directions, helping identify unreliable components of the solution and motivating regularization techniques that dampen small singular values.

___

## SVD vs. Direct Methods: When and Why?

The SVD approach provides an alternative to the direct matrix inversion methods we studied in the previous notebook. Let's compare the two approaches. We can solve the linear regression problem using the direct approach:
$$\hat{\mathbf{\theta}} = (\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}})^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y}$$
or using the SVD approach:
$$\hat{\mathbf{\theta}} = \mathbf{V}\mathbf{\Sigma}^{\dagger}\mathbf{U}^{\top}\mathbf{y}$$

why choose one over the other? Here are some considerations:

> __SVD versus direct method?__
> 
> **Rank-deficient matrices**: When $\hat{\mathbf{X}}$ doesn't have full rank, $\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}$ becomes non-invertible. SVD handles this gracefully with the pseudoinverse.
>
> **Ill-conditioned problems**: When $\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}$ has a very large condition number (small eigenvalues), direct inversion amplifies numerical errors. SVD provides better numerical stability.
> 
> **Understanding data structure**: SVD reveals the principal directions of variation in your data, which can be valuable for interpretation and dimensionality reduction.
>
> **Regularization insights**: The SVD formulation makes it clear how regularization affects different modes, providing intuition about what Ridge regression actually does.

### Computational trade-offs

The SVD decomposition has computational complexity $O(\min(np^2, n^2p))$, while direct methods are $O(p^3)$ for the matrix inversion. For tall, thin matrices ($n \gg p$), direct methods are often faster, but SVD provides superior numerical properties when stability matters more than speed.

___

## Understanding the error model
The error model $\mathbf{\epsilon}$ captures the randomness in our observations that cannot be explained by the linear relationship in the model. In linear regression, we typically assume:
$$\begin{align*}
\mathbf{\epsilon} &\sim \mathcal{N}(\mathbf{0},\sigma^{2}\;\mathbf{I})
\end{align*}$$
This means each error term $\epsilon_i$ is independent, normally distributed with mean zero and constant variance $\sigma^2$. 

> __Normality assumption:__ The normality assumption enables the analytical OLS and ridge regression solutions we derived earlier. It also allows us to construct confidence intervals, perform hypothesis tests, and quantify uncertainty. Finally, under these conditions, OLS gives us the Best Linear Unbiased Estimator (BLUE).

> __No parameter correlation__: Our error model also implies that the errors for different observations are uncorrelated, i.e., $\text{Cov}(\epsilon_i, \epsilon_j) = 0$ for $i \neq j$. We can see this from the covariance matrix $\sigma^{2}\;\mathbf{I}$, which is diagonal with $\sigma^2$ on the diagonal and zeros elsewhere. This is crucial for the validity of many statistical inference techniques, as correlated errors can lead to biased estimates and incorrect conclusions.

**Reality check**: While the normality assumption may not always hold in practice, many results remain approximately valid thanks to the Central Limit Theorem—parameter estimates are often approximately normal for large samples, even when individual errors are not.

### Estimating the error variance
Since the __true__ variance $\sigma^2$ is unknown, we can estimate the population variance $\hat{\sigma}^2$ from the residuals $\mathbf{r} = \mathbf{y} - \hat{\mathbf{X}}\hat{\mathbf{\theta}}$ as:
$$\begin{align*}
\hat{\sigma}^{2} &= \frac{\lVert~\mathbf{r}~\rVert^{2}_{2}}{n-p} = \frac{1}{n-p}\sum_{i=1}^{n}r_i^2
\end{align*}$$
where $n$ is the number of observations, $p$ is the number of parameters, and $r_i = y_i - \hat{\mathbf{x}}_i^{\top}\hat{\mathbf{\theta}}$ is the $i$-th residual, i.e., the difference between the observed and predicted value for observation $i$.

> **Key insight**: We divide by $(n-p)$ instead of $n$ to account for the degrees of freedom "used up" by estimating $p$ parameters. This correction makes $\hat{\sigma}^2$ an unbiased estimator of the true variance.

### Parameter uncertainty (overdetermined case without regularization)
With our estimate of the variance $\hat{\sigma}^2$, we can now quantify the uncertainty in our parameter estimates $\text{Var}(\hat{\mathbf{\theta}})$. Let's consider the __overdetermined case without regularization__. In this case, the variance of the estimated parameters is given by:
$$
\begin{align*}
\hat{\mathbf{\theta}} &= \mathbf{\theta} + \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon}\\
\text{Var}(\hat{\mathbf{\theta}}) &= \text{Var}\left(\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon}\right)\quad\text{(since $\mathbf{\theta}$ is constant)}\\
\end{align*}
$$
For a random vector $\mathbf{A}\mathbf{z}$, where $\mathbf{z}$ is a random vector and $\mathbf{A}$ is a constant matrix, the variance is given by:
$$
\text{Var}(\mathbf{A}\mathbf{z}) = \mathbf{A}\;\text{Var}(\mathbf{z})\;\mathbf{A}^{\top}
$$
We assumed in our error model that $\text{Var}(\mathbf{\epsilon}) = \sigma^{2}\;\mathbf{I}$. Therefore:
$$
\begin{align*}
\text{Var}(\hat{\mathbf{\theta}}) &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\;\text{Var}(\mathbf{\epsilon})\;\hat{\mathbf{X}}\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\\
&= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\;(\sigma^{2}\;\mathbf{I})\;\hat{\mathbf{X}}\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\\
&= \sigma^{2}\;\underbrace{\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}}_{=\;\mathbf{I}}\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\\
&= \sigma^{2}\;\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\quad\blacksquare\\
\end{align*}
$$
Finally, given our value for the variance $\text{Var}(\hat{\mathbf{\theta}})$, we can compute the standard errors $\mathrm{SE}(\hat{\theta}_j) = \sqrt{\text{Var}(\hat{\theta}_j)}$  of the individual parameter estimates $\hat{\theta}_j$ (the square root of the diagonal elements of the covariance matrix):
$$
\begin{align*}
  \mathrm{SE}(\hat{\theta}_j) &= \sqrt{\;\hat{\sigma}^2\;\bigl[(\hat{\mathbf{X}}^\top\hat{\mathbf{X}})^{-1}\bigr]_{jj}\,}
\end{align*}
$$
So why is this super cool? Because there is a ton of stuff that we can do with standard errors!

> Standard errors are __essential__ for:
> 
> * **Confidence intervals**: For large samples, $\hat{\theta}_j \pm 1.96 \cdot \mathrm{SE}(\hat{\theta}_j)$ gives an approximate 95% confidence interval for parameter $\theta_j$. The 1.96 comes from the standard normal distribution. However, for finite samples, we should use the t-distribution as shown below.
> * **Hypothesis testing**: Testing whether $\theta_j = 0$ (is feature $j$ significant?). We can compute a t-statistic $t_j = \hat{\theta}_j/{\mathrm{SE}(\hat{\theta}_j)}$ and compare it to a t-distribution to get a p-value (probability of observing such an extreme value if the null hypothesis is true, i.e., if $\theta_j$ is actually 0).
> * **Prediction intervals**: Quantifying uncertainty in new predictions. We can use the standard errors to construct prediction intervals for new observations. How? By adding and subtracting a margin of error based on the standard errors from the predicted values.

### A note about confidence intervals
A __confidence interval__ gives a range of values that likely contains the true parameter value (which we don't know). For example, a 95% confidence interval means that if we were to repeat the experiment many times and compute the confidence interval each time, approximately 95% of those intervals would contain the true parameter value.

Let's dig in a bit deeper. Start with the studentized statistic $T_j$ for the __true__ parameter value $\theta_j$ (unknown):
$$
\begin{align*}
T_j \;&=\; \frac{\hat{\theta}_j-\theta_j}{\mathrm{SE}(\hat{\theta}_j)} 
\quad\text{with}\quad 
\mathrm{SE}(\hat{\theta}_j)=\hat{\sigma}\,\sqrt{\bigl[(\hat{\mathbf X}^\top\hat{\mathbf X})^{-1}\bigr]_{jj}}.
\end{align*}
$$
Under the (homoskedastic) normal-error model and $\hat\sigma^2$ estimate we developed above, the distribution of $T_j$ is a Student's $t$ with:
$$
\begin{align*}
T_j & \sim t_{\nu},\qquad \nu=n-p
\end{align*}
$$
where $n$ is the number of observations, $p$ is the number of parameters, and $\nu=n-p$ is the degrees of freedom. 
Let $c=t_{1-\alpha/2,\nu}$ where $\alpha$ is the significance level (e.g., $\alpha=0.05$ for a 95% CI) and $t_{1-\alpha/2,\nu}$ is the $(1-\alpha/2)$ quantile of the $t$-distribution with $\nu$ degrees of freedom. 
Then (given $\mathrm{SE}(\hat\theta_j)>0$):
$$
\Pr\!\big(-c \le T_j \le c\big)=1-\alpha
\quad\Longleftrightarrow\quad
\Pr\!\left(-c \le \frac{\hat\theta_j-\theta_j}{\mathrm{SE}(\hat\theta_j)} \le c\right)=1-\alpha.
$$
Now we invert the inequality to get bounds on $\theta_j$ instead of $T_j$: After some algebraic manipulation, we get:
$$
\Pr\!\big(\ \hat\theta_j - c\;\mathrm{SE}(\hat\theta_j) \;\le\; \theta_j \;\le\; \hat\theta_j + c\;\mathrm{SE}(\hat\theta_j)\ \big)=1-\alpha.
$$
Replace $c = t_{1-\alpha/2,\nu}$ and plug in $\mathrm{SE}(\hat\theta_j)=\hat\sigma\,\sqrt{\bigl[(\hat{\mathbf X}^\top\hat{\mathbf X})^{-1}\bigr]_{jj}}$ to get the familiar (two-sided) CI:
$$
\boxed{
\hat{\theta}_j \pm t_{1-\alpha/2,\nu}\; \hat{\sigma}\; \sqrt{\bigl[(\hat{\mathbf X}^\top\hat{\mathbf X})^{-1}\bigr]_{jj}}\quad\blacksquare
}
$$

__Wow!__ Mind blown! Ok, so what about one-sided intervals? Same approach for one-sided intervals; start from $\Pr(T_j \le t_{1-\alpha,\nu})=1-\alpha$ (or the lower tail) and solve for $\theta_j$. The inversion step just converts bounds on the statistic into bounds on the parameter via a monotone transformation.

Let's explore these ideas with an example.

> __Example__
>
> [▶ Let's build a linear model of housing prices](CHEME-5800-L7c-Example-HousingPriceModel-Fall-2025.ipynb). In this example, students will build a linear regression model to predict housing prices based on various features. This will help us understand how to apply ordinary and regularized least squares methods in a practical context.

___

## Lab
In lab `L7d`, we will continue with our home price example, where we contrast the direct method of computing the model parameters with using singular value decomposition. 

## Summary

In this notebook, we've explored the mathematical foundations of linear regression parameter estimation:

> __Key takeaways:__
>
> 1. **OLS solutions for different regimes**: For overdetermined systems ($n \gg p$), OLS gives $\hat{\mathbf{\theta}} = (\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}})^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y}$, an unbiased estimator with $\mathbb{E}[\hat{\mathbf{\theta}}] = \mathbf{\theta}$. For underdetermined systems ($n \ll p$), the least-norm solution $\hat{\mathbf{\theta}} = \hat{\mathbf{X}}^{\top}(\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top})^{-1}\mathbf{y}$ selects the smallest parameter vector from infinitely many solutions, inherently performing L2 regularization.
> 2. **SVD for numerical stability**: The singular value decomposition $\hat{\mathbf{X}} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{T}$ yields $\hat{\mathbf{\theta}} = \mathbf{V}\mathbf{\Sigma}^{\dagger}\mathbf{U}^{T}\mathbf{y} = \sum_{i=1}^{r}(\mathbf{u}_{i}^{T}\mathbf{y}/\sigma_{i})\mathbf{v}_{i}$, providing superior numerical stability for rank-deficient or ill-conditioned matrices while revealing data structure and noise amplification through singular values.
> 3. **Statistical inference and uncertainty quantification**: Under the error model $\mathbf{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma^{2}\mathbf{I})$, we estimate variance as $\hat{\sigma}^{2} = \|\mathbf{r}\|^{2}_{2}/(n-p)$ and compute parameter variance $\text{Var}(\hat{\mathbf{\theta}}) = \sigma^{2}(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}})^{-1}$. Standard errors $\mathrm{SE}(\hat{\theta}_j) = \hat{\sigma}\sqrt{[(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}})^{-1}]_{jj}}$ enable t-based confidence intervals and hypothesis tests for rigorous statistical inference about model parameters.

### What's Next?

In the following lessons, we'll explore practical implementation considerations, regularization techniques like Ridge regression for preventing overfitting, and model evaluation strategies, building upon these mathematical foundations to develop more robust and generalizable predictive models.

___