# Advanced: A note about confidence intervals
A __confidence interval__ gives a range of values that likely contains the true parameter value (which we don't know). For example, a 95% confidence interval means that if we were to repeat the experiment many times and compute the confidence interval each time, approximately 95% of those intervals would contain the true parameter value.

### Parameter uncertainty (overdetermined case without regularization)
Suppose we have an estimate of the variance $\hat{\sigma}^2$; we can now quantify the uncertainty in our parameter estimates $\text{Var}(\hat{\mathbf{\theta}})$. Let's consider the __overdetermined case without regularization__. In this case, the variance of the estimated parameters is given by:
$$
\begin{align*}
\hat{\mathbf{\theta}} &= \mathbf{\theta} + \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon}\\
\text{Var}(\hat{\mathbf{\theta}}) &= \text{Var}\left(\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon}\right)\quad\text{(since $\mathbf{\theta}$ is constant)}\\
\end{align*}
$$
For a random vector $\mathbf{A}\mathbf{z}$, where $\mathbf{z}$ is a random vector and $\mathbf{A}$ is a constant matrix, the variance is given by:
$$
\text{Var}(\mathbf{A}\mathbf{z}) = \mathbf{A}\;\text{Var}(\mathbf{z})\;\mathbf{A}^{\top}
$$
We assumed in our error model that $\text{Var}(\mathbf{\epsilon}) = \sigma^{2}\;\mathbf{I}$. Therefore:
$$
\begin{align*}
\text{Var}(\hat{\mathbf{\theta}}) &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\;\text{Var}(\mathbf{\epsilon})\;\hat{\mathbf{X}}\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\\
&= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\;(\sigma^{2}\;\mathbf{I})\;\hat{\mathbf{X}}\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\\
&= \sigma^{2}\;\underbrace{\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}}_{=\;\mathbf{I}}\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\\
&= \sigma^{2}\;\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\quad\blacksquare\\
\end{align*}
$$
Finally, given our value for the variance $\text{Var}(\hat{\mathbf{\theta}})$, we can compute the standard errors $\mathrm{SE}(\hat{\theta}_j) = \sqrt{\text{Var}(\hat{\theta}_j)}$ of the individual parameter estimates $\hat{\theta}_j$ (the square root of the diagonal elements of the covariance matrix):
$$
\begin{align*}
  \mathrm{SE}(\hat{\theta}_j) &= \sqrt{\;\hat{\sigma}^2\;\bigl[(\hat{\mathbf{X}}^\top\hat{\mathbf{X}})^{-1}\bigr]_{jj}\,}
\end{align*}
$$
So why is this super cool? Because there are many things we can do with standard errors!

> Standard errors are __essential__ for:
> 
> * **Confidence intervals**: For large samples, $\hat{\theta}_j \pm 1.96 \cdot \mathrm{SE}(\hat{\theta}_j)$ gives an approximate 95% confidence interval for parameter $\theta_j$. The 1.96 comes from the standard normal distribution. However, for finite samples, we should use the t-distribution as shown below.
> * **Hypothesis testing**: Testing whether $\theta_j = 0$ (is feature $j$ significant?). We can compute a t-statistic $t_j = \hat{\theta}_j/{\mathrm{SE}(\hat{\theta}_j)}$ and compare it to a t-distribution to get a p-value (the probability of observing such an extreme value if the null hypothesis is true, i.e., if $\theta_j$ is actually 0).
> * **Prediction intervals**: Quantifying uncertainty in new predictions. We can use the standard errors to construct prediction intervals for new observations. How? By adding and subtracting a margin of error based on the standard errors from the predicted values.

### Derivation of confidence intervals
Let's dig in a bit deeper. Start with the studentized statistic $T_j$ for the __true__ parameter value $\theta_j$ (unknown):
$$
\begin{align*}
T_j \;&=\; \frac{\hat{\theta}_j-\theta_j}{\mathrm{SE}(\hat{\theta}_j)} 
\quad\text{with}\quad 
\mathrm{SE}(\hat{\theta}_j)=\hat{\sigma}\,\sqrt{\bigl[(\hat{\mathbf X}^\top\hat{\mathbf X})^{-1}\bigr]_{jj}}.
\end{align*}
$$
Under the (homoskedastic) normal-error model and the $\hat\sigma^2$ estimate we developed above, the distribution of $T_j$ is a Student's $t$ with:
$$
\begin{align*}
T_j & \sim t_{\nu},\qquad \nu=n-p
\end{align*}
$$
where $n$ is the number of observations, $p$ is the number of parameters, and $\nu=n-p$ is the degrees of freedom. 
Let $c=t_{1-\alpha/2,\nu}$, where $\alpha$ is the significance level (e.g., $\alpha=0.05$ for a 95% CI), and $t_{1-\alpha/2,\nu}$ is the $(1-\alpha/2)$ quantile of the $t$-distribution with $\nu$ degrees of freedom. 
Then (given $\mathrm{SE}(\hat\theta_j)>0$):
$$
\Pr\!\big(-c \le T_j \le c\big)=1-\alpha
\quad\Longleftrightarrow\quad
\Pr\!\left(-c \le \frac{\hat\theta_j-\theta_j}{\mathrm{SE}(\hat\theta_j)} \le c\right)=1-\alpha.
$$
Now we invert the inequality to get bounds on $\theta_j$ instead of $T_j$. After some algebraic manipulation, we get:
$$
\Pr\!\big(\ \hat\theta_j - c\;\mathrm{SE}(\hat\theta_j) \;\le\; \theta_j \;\le\; \hat\theta_j + c\;\mathrm{SE}(\hat\theta_j)\ \big)=1-\alpha.
$$
Replace $c = t_{1-\alpha/2,\nu}$ and plug in $\mathrm{SE}(\hat\theta_j)=\hat\sigma\,\sqrt{\bigl[(\hat{\mathbf X}^\top\hat{\mathbf X})^{-1}\bigr]_{jj}}$ to get the familiar two-sided CI:
$$
\boxed{
\hat{\theta}_j \pm t_{1-\alpha/2,\nu}\; \hat{\sigma}\; \sqrt{\bigl[(\hat{\mathbf X}^\top\hat{\mathbf X})^{-1}\bigr]_{jj}}\quad\blacksquare
}
$$

__Wow!__ Mind blown! Ok, so what about one-sided intervals? Same approach for one-sided intervals: start from $\Pr(T_j \le t_{1-\alpha,\nu})=1-\alpha$ (or the lower tail) and solve for $\theta_j$. The inversion step just converts bounds on the statistic into bounds on the parameter via a monotone transformation.

___