## Use the books!

This week deals with various mean values and variances in linear regression methods (here it may be useful to look up chapter 3, equation (3.8) of [Trevor Hastie, Robert Tibshirani, Jerome H. Friedman, The Elements of Statistical Learning, Springer](https://www.springer.com/gp/book/9780387848570)).

For more discussions on Ridge regression and calculation of expectation values, [Wessel van Wieringen's](https://arxiv.org/abs/1509.09169) article is highly recommended.

The exercises this week are also a part of project 1 and can be reused in the theory part of the project.

### Definitions

We assume that there exists a continuous function $f(\boldsymbol{x})$ and a normal distributed error $\boldsymbol{\varepsilon}\sim N(0, \sigma^2)$ which describes our data


$$
A1: \quad \boldsymbol{y} = f(\boldsymbol{x})+\boldsymbol{\varepsilon}
$$


We further assume that this continous function can be modeled with a linear model $\mathbf{\tilde{y}}$ of some features $\mathbf{X}$.


$$
A2: \quad \boldsymbol{y} = \boldsymbol{\tilde{y}} + \boldsymbol{\varepsilon} = \boldsymbol{X}\boldsymbol{\beta} +\boldsymbol{\varepsilon}
$$


We therefore get that our data $\boldsymbol{y}$ has an expectation value $\boldsymbol{X}\boldsymbol{\beta}$ and variance $\sigma^2$, that is $\boldsymbol{y}$ follows a normal distribution with mean value $\boldsymbol{X}\boldsymbol{\beta}$ and variance $\sigma^2$.


## Exercise 1: Expectation values for ordinary least squares expressions


**a)** With the expressions for the optimal parameters $\boldsymbol{\hat{\beta}_{OLS}}$ show that


$$
\mathbb{E}(\boldsymbol{\hat{\beta}_{OLS}}) = \boldsymbol{\beta}.
$$


We start with the definition of the OLS estimator

$$
\hat{\boldsymbol{\beta}}_{OLS} = (X^\top X)^{-1} X^\top \boldsymbol{y}.
$$

From the data generating process we have

$$
\boldsymbol{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad 
\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2 I).
$$

Insert this expression for $\boldsymbol{y}$ into the estimator (this step is using the assumption (A1) that the  ideal $\beta$ is expressed from $X$ where only noise is stochastic)

$$
\hat{\boldsymbol{\beta}}_{OLS} = (X^\top X)^{-1} X^\top (X \boldsymbol{\beta} + \boldsymbol{\epsilon}).
$$

Expand the product:

$$
\hat{\boldsymbol{\beta}}_{OLS} 
= (X^\top X)^{-1} X^\top X \boldsymbol{\beta} \;+\; (X^\top X)^{-1} X^\top \boldsymbol{\epsilon}.
$$

Simplify:

$$
\hat{\boldsymbol{\beta}}_{OLS} 
= \boldsymbol{\beta} + (X^\top X)^{-1} X^\top \boldsymbol{\epsilon}.
$$

Now take the expectation, using linearity of expectation and the fact that $\boldsymbol{\beta}$ is non-stochastic while $\mathbb{E}[\boldsymbol{\epsilon}] = 0$:
$$
\mathbb{E}[\hat{\boldsymbol{\beta}}_{OLS}]
= \mathbb{E}[\boldsymbol{\beta} + (X^\top X)^{-1} X^\top \boldsymbol{\epsilon}].
$$

Since $\boldsymbol{\beta}$ is non-stochastic and $\mathbb{E}[\boldsymbol{\epsilon}] = 0$:

$$
\mathbb{E}[\hat{\boldsymbol{\beta}}_{OLS}] 
= \boldsymbol{\beta} + (X^\top X)^{-1} X^\top \,\mathbb{E}[\boldsymbol{\epsilon}],
$$

$$
\mathbb{E}[\hat{\boldsymbol{\beta}}_{OLS}] = \boldsymbol{\beta}.
$$

Hence, the OLS estimator is **unbiased**:

$$
\mathbb{E}[\hat{\boldsymbol{\beta}}_{OLS}] = \boldsymbol{\beta}.
$$


**b)** Show that the variance of $\boldsymbol{\hat{\beta}_{OLS}}$ is


$$
\mathbf{Var}(\boldsymbol{\hat{\beta}_{OLS}}) = \sigma^2 \, (\mathbf{X}^{T} \mathbf{X})^{-1}.
$$


We start from the expression for the OLS estimator obtained above:

$$
\hat{\boldsymbol{\beta}}_{OLS} 
= \boldsymbol{\beta} + (X^\top X)^{-1} X^\top \boldsymbol{\epsilon}.
$$

The variance operator ignores constants, so only the second term contributes. Thus,

$$
\operatorname{Var}(\hat{\boldsymbol{\beta}}_{OLS}) 
= \operatorname{Var}\!\left( (X^\top X)^{-1} X^\top \boldsymbol{\epsilon} \right).
$$

Factor out the non-random matrices:

$$
\operatorname{Var}(\hat{\boldsymbol{\beta}}_{OLS}) 
= (X^\top X)^{-1} X^\top \, \operatorname{Var}(\boldsymbol{\epsilon}) \, X (X^\top X)^{-1}.
$$

From the data generating assumption we know

$$
\operatorname{Var}(\boldsymbol{\epsilon}) = \sigma^2 I.
$$

Substitute this:

$$
\operatorname{Var}(\hat{\boldsymbol{\beta}}_{OLS}) 
= (X^\top X)^{-1} X^\top \, (\sigma^2 I) \, X (X^\top X)^{-1}.
$$

Simplify:

$$
\operatorname{Var}(\hat{\boldsymbol{\beta}}_{OLS}) 
= \sigma^2 (X^\top X)^{-1} X^\top X (X^\top X)^{-1}.
$$

Since $X^\top X$ is symmetric and invertible (under the Gauss-Markov assumptions, i.e no perfect multicollinearity / full column rank assumption, (Wikipedia contributors, 2025)).
):

$$
\operatorname{Var}(\hat{\boldsymbol{\beta}}_{OLS}) 
= \sigma^2 (X^\top X)^{-1}.
$$

Hence, we have shown that:

$$
\operatorname{Var}(\hat{\boldsymbol{\beta}}_{OLS}) = \sigma^2 (X^\top X)^{-1}.
$$

Reference:

Wikipedia contributors. (2025, March 24). Gauss–Markov theorem. In Wikipedia. Retrieved September 10, 2025, from https://en.wikipedia.org/w/index.php?title=Gauss%E2%80%93Markov_theorem&oldid=1282157188

## Exercise 2: Expectation values for Ridge regression


**a)** With the expressions for the optimal parameters $\boldsymbol{\hat{\beta}_{Ridge}}$ show that


$$
\mathbb{E} \big[ \hat{\boldsymbol{\beta}}^{\mathrm{Ridge}} \big]=(\mathbf{X}^{T} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} (\mathbf{X}^{\top} \mathbf{X})\boldsymbol{\beta}
$$


We start with the definition of the Ridge estimator:

$$
\hat{\boldsymbol{\beta}}_{Ridge} 
= (X^\top X + \lambda I_p)^{-1} X^\top \boldsymbol{y},
$$

where $\lambda > 0$ is the regularization parameter and $I_p$ is the $p \times p$ identity matrix.

From the data generating process we have

$$
\boldsymbol{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon}, 
\quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2 I).
$$

Insert this into the estimator:

$$
\hat{\boldsymbol{\beta}}_{Ridge} 
= (X^\top X + \lambda I_p)^{-1} X^\top (X \boldsymbol{\beta} + \boldsymbol{\epsilon}).
$$

Expand:

$$
\hat{\boldsymbol{\beta}}_{Ridge} 
= (X^\top X + \lambda I_p)^{-1} X^\top X \boldsymbol{\beta} 
\;+\; (X^\top X + \lambda I_p)^{-1} X^\top \boldsymbol{\epsilon}.
$$

Now take the expectation, using linearity and $\mathbb{E}[\boldsymbol{\epsilon}] = 0$ so that last term zeros:

$$
\mathbb{E}[\hat{\boldsymbol{\beta}}_{Ridge}]
= (X^\top X + \lambda I_p)^{-1} X^\top X \boldsymbol{\beta}.
$$

Thus we have shown:

$$
\mathbb{E}[\hat{\boldsymbol{\beta}}_{Ridge}] 
= (X^\top X + \lambda I_p)^{-1} (X^\top X) \boldsymbol{\beta}.
$$

Note: 

Therefore by inspection of the expression we see
$$
\mathbb{E}[\hat{\boldsymbol{\beta}}_{Ridge}] = \mathbb{E}[\hat{\boldsymbol{\beta}}_{OLS}] \iff \lambda = 0
$$
Hence for any $\lambda > 0$ the Ridge estimator is biased.

**b)** Why do we say that Ridge regression gives a biased estimate? Is this a problem?


According to the bias–variance tradeoff (Hastie, Tibshirani, and Friedman, 2009, p.223), introducing bias can be beneficial.  
Ridge adds bias by shrinking the estimator coefficients, but in turn it reduces variance — notably when predictors are highly correlated and the feature matrix $X$ is close to singular.  

Its strength lies under conditions of multicollinearity or high dimensionality, where it effectively trades some bias for lower variance.  
In practical settings, this is valuable: measurement noise, limited sample sizes, or strongly correlated features can make the OLS estimator oversensitive and unstable. Ridge stabilizes the solution, lowering mean squared error even though it is biased.

Reference:

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 
The Elements of Statistical Learning: Data Mining, Inference, and Prediction}. 
Second Edition. Springer, 2009.



**c)** Show that the variance is


$$
\mathbf{Var}[\hat{\boldsymbol{\beta}}^{\mathrm{Ridge}}]=\sigma^2[  \mathbf{X}^{T} \mathbf{X} + \lambda \mathbf{I} ]^{-1}  \mathbf{X}^{T}\mathbf{X} \{ [  \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I} ]^{-1}\}^{T}
$$


We see that if the parameter $\lambda$ goes to infinity then the variance of the Ridge parameters $\boldsymbol{\beta}$ goes to zero.


The proof in structure is similar to the others. 
We start from the expression of the Ridge estimator:

$$
\hat{\boldsymbol{\beta}}_{Ridge} 
= (X^\top X + \lambda I_p)^{-1} X^\top \boldsymbol{y}.
$$

From the data generating process we have

$$
\boldsymbol{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon}, 
\quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2 I).
$$

Insert this into the estimator:

$$
\hat{\boldsymbol{\beta}}_{Ridge} 
= (X^\top X + \lambda I_p)^{-1} X^\top (X \boldsymbol{\beta} + \boldsymbol{\epsilon}).
$$

Expand:

$$
\hat{\boldsymbol{\beta}}_{Ridge} 
= (X^\top X + \lambda I_p)^{-1} X^\top X \boldsymbol{\beta} 
\;+\; (X^\top X + \lambda I_p)^{-1} X^\top \boldsymbol{\epsilon}.
$$

The first term is deterministic by assumption, hence only the second contributes to the variance:

$$
\operatorname{Var}(\hat{\boldsymbol{\beta}}_{Ridge}) 
= \operatorname{Var}\!\left( (X^\top X + \lambda I_p)^{-1} X^\top \boldsymbol{\epsilon} \right).
$$

Using the variance matrix rule for independent "errors" $\operatorname{Var}(A\boldsymbol{\epsilon}) = A \, \operatorname{Var}(\boldsymbol{\epsilon}) \, A^\top$:

$$
\operatorname{Var}(\hat{\boldsymbol{\beta}}_{Ridge}) 
= (X^\top X + \lambda I_p)^{-1} X^\top \, \operatorname{Var}(\boldsymbol{\epsilon}) \, X \big[(X^\top X + \lambda I_p)^{-1}\big]^\top.
$$

Since $\operatorname{Var}(\boldsymbol{\epsilon}) = \sigma^2 I$:

$$
\operatorname{Var}(\hat{\boldsymbol{\beta}}_{Ridge}) 
= \sigma^2 (X^\top X + \lambda I_p)^{-1} X^\top X \big[(X^\top X + \lambda I_p)^{-1}\big]^\top.
$$


Note:


We can now see that as $\lambda \to \infty$, the shrinkage term $(X^\top X + \lambda I_p)^{-1}$ goes to zero, so

$$
\lim_{\lambda \to \infty} \operatorname{Var}(\hat{\boldsymbol{\beta}}_{Ridge}) = 0.
$$

Such a dynamic clearly invokes the general bias-variance tradeoff, as $\lambda$ increases so does bias, concurrently the estimator variance shrinks


## Exercise 3: Deriving the expression for the Bias-Variance Trade-off


The aim of this exercise is to derive the equations for the bias-variance tradeoff to be used in project 1.

The parameters $\boldsymbol{\hat{\beta}_{OLS}}$ are found by optimizing the mean squared error via the so-called cost function


$$
C(\boldsymbol{X},\boldsymbol{\beta}) =\frac{1}{n}\sum_{i=0}^{n-1}(y_i-\tilde{y}_i)^2=\mathbb{E}\left[(\boldsymbol{y}-\boldsymbol{\tilde{y}})^2\right]
$$


**a)** Show that you can rewrite this into an expression which contains

- the variance of the model (the variance term)
- the expected deviation of the mean of the model from the true data (the bias term)
- the variance of the noise

In other words, show that:


$$
\mathbb{E}\left[(\boldsymbol{y}-\boldsymbol{\tilde{y}})^2\right]=\mathrm{Bias}[\tilde{y}]+\mathrm{var}[\tilde{y}]+\sigma^2,
$$


with


$$
\mathrm{Bias}[\tilde{y}]=\mathbb{E}\left[\left(\boldsymbol{y}-\mathbb{E}\left[\boldsymbol{\tilde{y}}\right]\right)^2\right],
$$


and


$$
\mathrm{var}[\tilde{y}]=\mathbb{E}\left[\left(\tilde{\boldsymbol{y}}-\mathbb{E}\left[\boldsymbol{\tilde{y}}\right]\right)^2\right]=\frac{1}{n}\sum_i(\tilde{y}_i-\mathbb{E}\left[\boldsymbol{\tilde{y}}\right])^2.
$$


---
We want to decompose the mean squared error

$$
\mathbb{E}[(y - \hat{y})^2].
$$

Start with the data generating process:

$$
y = f(x) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2).
$$

So,

$$
\mathbb{E}[(y - \hat{y})^2] = \mathbb{E}[(f(x) + \epsilon - \hat{y})^2].
$$

Expand:

$$
\mathbb{E}[(f(x) - \hat{y} + \epsilon)^2] 
= \mathbb{E}[(f(x) - \hat{y})^2] + 2\mathbb{E}[(f(x) - \hat{y})\epsilon] + \mathbb{E}[\epsilon^2].
$$

Since $\epsilon$ is independent with zero mean, the cross-term vanishes:

$$
\mathbb{E}[(y - \hat{y})^2] = \mathbb{E}[(f(x) - \hat{y})^2] + \sigma^2.
$$

---

Now decompose the first term by adding and subtracting $\mathbb{E}[\hat{y}]$ ((Hastie, Tibshirani, and Friedman, 2009, p.223)):

$$
\mathbb{E}[(f(x) - \hat{y})^2] 
= \mathbb{E}\big[(f(x) - \mathbb{E}[\hat{y}] + \mathbb{E}[\hat{y}] - \hat{y})^2\big].
$$

Expand:

$$
= \mathbb{E}\big[(f(x) - \mathbb{E}[\hat{y}])^2\big] 
+ \mathbb{E}\big[(\hat{y} - \mathbb{E}[\hat{y}])^2\big] 
+ 2\mathbb{E}\big[(f(x) - \mathbb{E}[\hat{y}])(\mathbb{E}[\hat{y}] - \hat{y})\big].
$$

The cross-term vanishes because $(f(x) - \mathbb{E}[\hat{y}])$ is constant w.r.t. the randomness in $\hat{y}$. So:

$$
\mathbb{E}[(f(x) - \hat{y})^2] 
= (f(x) - \mathbb{E}[\hat{y}])^2 + \operatorname{Var}(\hat{y}).
$$

---

Thus the full decomposition is

$$
\mathbb{E}[(y - \hat{y})^2] 
= (f(x) - \mathbb{E}[\hat{y}])^2 + \operatorname{Var}(\hat{y}) + \sigma^2.
$$

---

Identify terms:

- Bias: $\operatorname{Bias}[\hat{y}]^2 = (f(x) - \mathbb{E}[\hat{y}])^2$  
- Variance: $\operatorname{Var}[\hat{y}] = \mathbb{E}[(\hat{y} - \mathbb{E}[\hat{y}])^2]$  
- Nondeterministic noise: $\sigma^2$

So:

$$
\mathbb{E}[(y - \hat{y})^2] 
= \operatorname{Bias}[\hat{y}]^2 + \operatorname{Var}[\hat{y}] + \sigma^2.
$$


**b)** Explain what the terms mean and discuss their interpretations.


The decomposition above splits predictive error into three conceptual parts. 
- The irreducibile error comes from the fact of sampling, no matter how we well we estimate $f(x)$ noise is assumed and this variance cannot be reduce unless $\epsilon$ is assumed $\sim \sigma^2 = 0$
- Bias can be understood as systematic error, or the squared distance between the prediction expectation and the "true function" (underlying data-generating mechanism). Importantly, if the model class (e.g linear model) cannot represent the true function (e.g nonlinear function), bias is high. The more flexible the model is, the more it can reduce bias. 

Forexample, for basis expansions (polynomials, splines), the model bias can be made arbitrarily small by letting the number of basis functions $p$ grow, since such bases are dense in the space of continuous functions (Hastie et al., 2009, 233).

- Variance: sensitivity of the estimator to sampling fluctuations. It measures how much $\hat{y}$ (determined by the estimator and noise) varies around its mean $\mathbb{E}[\hat{y}]$ when the training data changes. High variance means the model overfits to noise in the training set.

Hence, in most cases, the total mean squared error is the main metric for model assessment. This is an overarching metric combining the tradeoffs central for model assessment