# Bias-Variance Decomposition of Mean Squared Error
* $x$ be a fixed input
* $y = f(x) + \varepsilon$ where $\varepsilon$ is noise with $\mathbb{E}[\varepsilon] = 0$ and $\operatorname{Var}(\varepsilon) = \sigma^2$
* $\hat{f}_D(x)$ be the prediction of a model trained on dataset $D$


We analyze the expected squared error:
$$
\mathbb{E}_{D,\varepsilon}[(y - \hat{f}_D(x))^2]
$$

Substitute $y = f(x) + \varepsilon$:
$$
= \mathbb{E}_{D,\varepsilon}[(f(x) + \varepsilon - \hat{f}_D(x))^2]
$$

Group terms:
$$
= \mathbb{E}_{D,\varepsilon}[(f(x) - \hat{f}_D(x) + \varepsilon)^2]
$$

Expand the square:
$$
= \mathbb{E}_{D,\varepsilon}[(f(x) - \hat{f}_D(x))^2 + 2(f(x) - \hat{f}_D(x))\varepsilon + \varepsilon^2]
$$

Use linearity of expectation:
$$
= \mathbb{E}_D[(f(x) - \hat{f}_D(x))^2] + 2\mathbb{E}_{D,\varepsilon}[(f(x) - \hat{f}_D(x))\varepsilon] + \mathbb{E}_\varepsilon[\varepsilon^2]
$$

### Vanishing Cross-Term

The cross-term:
$$
\mathbb{E}_{D,\varepsilon}[(f(x) - \hat{f}_D(x))\varepsilon]
$$

Since $\hat{f}_D(x)$ depends only on $D$ and $\varepsilon$ is independent of $D$, and $\mathbb{E}[\varepsilon] = 0$:
$$
= \mathbb{E}_D\left[(f(x) - \hat{f}_D(x)) \cdot \mathbb{E}_\varepsilon[\varepsilon]\right] = 0
$$

So:
$$
\mathbb{E}_{D,\varepsilon}[(y - \hat{f}_D(x))^2] = \mathbb{E}_D[(f(x) - \hat{f}_D(x))^2] + \sigma^2
$$

---

Now apply the following identity to the first term:

\subsection*{Theorem: Decomposition of Expected Squared Distance from Constant}

Let $Z$ be a random variable and $a \in \mathbb{R}$ a constant. Then:
$$
\mathbb{E}[(a - Z)^2] = (a - \mathbb{E}[Z])^2 + \mathbb{E}[(Z - \mathbb{E}[Z])^2]
$$

\textbf{Proof:}

Let $\mu = \mathbb{E}[Z]$, then:
\begin{align*}
\mathbb{E}[(a - Z)^2] &= \mathbb{E}[(a - \mu + \mu - Z)^2] \\
&= \mathbb{E}[(a - \mu)^2 + 2(a - \mu)(\mu - Z) + (\mu - Z)^2] \\
&= (a - \mu)^2 + 2(a - \mu)\mathbb{E}[\mu - Z] + \mathbb{E}[(\mu - Z)^2] \\
&= (a - \mu)^2 + 0 + \mathbb{E}[(Z - \mu)^2] \\
&= (a - \mathbb{E}[Z])^2 + \operatorname{Var}(Z)
\end{align*}

$\blacksquare$

---

Apply this identity with $a = f(x)$ and $Z = \hat{f}_D(x)$:
$$
\mathbb{E}_D[(f(x) - \hat{f}_D(x))^2] = (f(x) - \mathbb{E}_D[\hat{f}_D(x)])^2 + \mathbb{E}_D[(\hat{f}_D(x) - \mathbb{E}_D[\hat{f}_D(x)])^2]
$$

So the total error becomes:
$$
\mathbb{E}_{D,\varepsilon}[(y - \hat{f}_D(x))^2] = \underbrace{(f(x) - \mathbb{E}_D[\hat{f}_D(x)])^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_D[(\hat{f}_D(x) - \mathbb{E}_D[\hat{f}_D(x)])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible noise}}
$$

---

\section*{Global (Integrated) Bias-Variance Decomposition}

If $x$ is not fixed but comes from a distribution $p(x)$, the overall expected error is:

$$
\mathbb{E}_x\left[\mathbb{E}_{D,\varepsilon}[(y - \hat{f}_D(x))^2]\right]
= \mathbb{E}_x[(\text{Bias}(x))^2 + \text{Var}(x) + \sigma^2]
$$

This integral is typically approximated in practice by averaging over a large test set of inputs.