## Understanding the OLS Summary Table

- In the previous section, we showed theoretically and practically how we can derive a coefficient matrix $\beta$, just from the objective function of minimising the mean squared error (MSE)

- But you should notice something odd about our results. Our matrix algebra gave us only coefficient values

- But the OLS table actually gives us so much more than this! 

- How can we derive every part of the OLS Summary table? Let's find out

In [18]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
# import statsmodels.formula.api as smf
import statsmodels.api as sm

x,y = make_regression(
    n_samples=500, 
    n_features=5, 
    n_informative=2, 
    n_targets=1, 
    noise=5, 
    bias=5,
    random_state=123
)
x = np.append(x, np.ones((500,1)), axis = 1)
print(x.shape)

betas = np.linalg.inv((x.transpose() @ x)) @ x.transpose() @ y
np.set_printoptions(suppress=True)
print(betas)

print('='*50)
res = sm.OLS(exog=x, endog=y, hasconst=True).fit()
res.summary()

(500, 6)
[-0.16521089  0.2381359   0.00976686 60.45175552 26.46640238  4.8924384 ]


0,1,2,3
Dep. Variable:,y,R-squared:,0.994
Model:,OLS,Adj. R-squared:,0.994
Method:,Least Squares,F-statistic:,17750.0
Date:,"Mon, 07 Oct 2024",Prob (F-statistic):,0.0
Time:,14:49:09,Log-Likelihood:,-1508.0
No. Observations:,500,AIC:,3028.0
Df Residuals:,494,BIC:,3053.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,-0.1652,0.233,-0.710,0.478,-0.622,0.292
x2,0.2381,0.236,1.008,0.314,-0.226,0.702
x3,0.0098,0.222,0.044,0.965,-0.426,0.445
x4,60.4518,0.222,272.722,0.000,60.016,60.887
x5,26.4664,0.227,116.601,0.000,26.020,26.912
const,4.8924,0.223,21.982,0.000,4.455,5.330

0,1,2,3
Omnibus:,1.207,Durbin-Watson:,1.76
Prob(Omnibus):,0.547,Jarque-Bera (JB):,1.202
Skew:,0.028,Prob(JB):,0.548
Kurtosis:,2.766,Cond. No.,1.11


### F-Statistic

- The F-statistic is very very closely related to the $R^2$, and is defined using similar components

- If you are unfamiliar with the terms below, see the `Concept` section of `3. Summary Table: R Square.ipynb` for a quick revision

$$\begin{aligned}
    \text{F-Stat} &= \frac{\frac{ESS}{k-1}}{\frac{RSS}{n - k}} \\ \\
    \text{where } \\
    ESS &= \sum_i (\hat{y}_i - \bar{y}_i)^2 \\
    RSS &= \sum_i (y_i - \hat{y}_i)^2 \\
    k &= \text{Number of features in the regression} \\
    n &= \text{Number of observations in the regression}
\end{aligned}$$


- The intuition here is straightforward; if my explained sum of squares is much higher than the residual sum of squares, my F-statistic will be higher

- At very high levels of F-statistic, the model you fit must be "good" in the sense that most of the variance is explained by the model! Hence the conclusion that the variables are "jointly significant", even if the individual variables are not significant

In [29]:
ypred = x @ betas
ess = np.sum((ypred - np.mean(y))**2)
rss = np.sum((y - ypred)**2)
k = x.shape[1]
n = x.shape[0]

fstat = (ess/(k-1)) / (rss/(n-k))
fstat

np.float64(17745.272222202613)

#### Checking p-value of F-statistic

In [31]:
from scipy.stats import f
# Calculate the p-value (1 - CDF of F-distribution at the observed F-statistic)
p_value = 1 - f.cdf(fstat, k-1, n-k)
print("p-value:", p_value) ## Statistically significant if < 0.05

p-value: 1.1102230246251565e-16


### Deep dive: Why is the F-Distribution Valid?

- We've gone through how to compute the F statistic using the adjust ESS and RSS from the regression.

- However, it is not clear why we assumed at the outset that the formula above should follow an F distribution. 

- An F distribution is defined by 
$$\begin{aligned}
    F &= \frac{U_1 / d_1}{U_2 / d_2} \\ \\
    \text{where} \\
    &U_1, U_2 \sim \chi^2_{d_1}, \chi^2_{d_2} \\
    &U_1 \perp U_2 \\
\end{aligned}$$

- We claim that this expression follows an F distribution:
$$
    \frac{\frac{ESS}{k - 1}}{\frac{RSS}{n - k}}
$$

- Therefore, for this to be true, it must prove that these 2 statements are true
$$\begin{aligned}
    ESS \sim \chi^2_{k-1} \\ 
    RSS \sim \chi^2_{n-k} \\
\end{aligned}$$

#### Proof that $ESS \sim \chi^2_{k-1}$

- Let's start from the definition of ESS. If this is unclear, see the `Concepts` section of the R Square notebook
$$\begin{aligned}
    ESS &= (\hat{y}_i - \bar{y})^T (\hat{y}_i - \bar{y})
\end{aligned}$$

- We'll rewrite the value of $\hat{y}_i$ in another way:
$$\begin{aligned}
    \hat{y}_i &= X \hat{\beta} \\
    &= X (X^TX)^{-1} X^T y \\
    &= Hy & H = X (X^TX)^{-1} X^T
\end{aligned}$$

- Let's assume that the variable $y$ is de-meaned, so $\bar{y} = 0$. Then: 
$$\begin{aligned}
    ESS &= (\hat{y}_i - \bar{y})^T (\hat{y}_i - \bar{y}) \\
    &= \hat{y}_i^T \hat{y}_i \\
    &= (Hy)^T Hy \\
    &= y^T H^T H y
\end{aligned}$$

- $H$ is symmetric and and idempotent (proven in the later section, scroll down). Therefore:
    - Symmetric: $H^T = H$
    - Idempotent: $H \cdot H = H$
$$\begin{aligned}
    ESS &= (\hat{y}_i - \bar{y})^T (\hat{y}_i - \bar{y}) \\
    &= \hat{y}_i^T \hat{y}_i \\
    &= (Hy)^T Hy \\
    &= y^T H^T H y \\
    &= y^T H H y \\
    &= y^T H y \\
\end{aligned}$$

- We know from the regular OLS expression that 
$$\begin{aligned}
    y &= X\beta + \epsilon \quad \text{where} \quad \epsilon \sim N(0, \sigma^2 I)
\end{aligned}$$

- Since $X\beta$ is a fixed set of values, and $\epsilon$ is normally distributed, it implies that $y$ must also be normally distributed!   
    - Why?
    - By assumption in the OLS, error $e$ must follow a normal distribution
    - $\beta X$ is deterministic
    - Therefore, $y \sim N(\beta X, \sigma^2 I)$

- So if $y$ is normally distributed, and $ESS = y^T H y$ is a quadratic form of a normally distributed variable
    - The quadaratic form of any normally distributed variable must follow a chi-square distribution, with degrees of freedom given by the rank of $H$

- The rank of $H$ is $k$, where $k$ is the number of predictors in $X$
    - $X$ is $n \times m$, so the rank is $\min(n,m)$
    - $(X^TX)^{-1}$ is $m \times m$, so the rank is $m$
    - $X^T$ is $m \times n$, so the rank is $\min(n,m)$
    - $H = X (X^TX)^{-1} X^T$, and rank of $H$ must be 
$$\begin{aligned}
    \text{rank}(H) &\le \min(\text{rank}(X), \text{rank}((X^TX)^{-1}), \text{rank}(X^T)) \\
    \text{rank}(H) &\le\min(k, k, k) \\
    \therefore \text{rank}(H) &= k
\end{aligned}$$

- We know that among the 6 columns in $X$, one of them is an intercept parameter that does not contribute to the degrees of freedom for ESS. Thus, we exclude this, giving us $k-1$ degrees of freedom

- Therefore, we have shown that $ESS \sim \chi^2_{k-1}$ 

#### Proof that $RSS \sim \chi^2_{n-k}$

- Similar to ESS, let's start from definitions;
$$\begin{aligned}
    RSS &= \sum_i (y_i - \hat{y}_i)^2 \\
    &= (y - \hat{y})^T (y - \hat{y})
\end{aligned}$$

- By definition, $y_i - \hat{y_i}$ is simply the residual after OLS, $e_i$. Therefore
$$\begin{aligned}
    RSS &= \sum_i (y_i - \hat{y}_i)^2 \\
    &= (y - \hat{y})^T (y - \hat{y}) \\
    &= \hat{e}^T \hat{e}
\end{aligned}$$

- In OLS, we always assume that the **true error** $\hat{\epsilon} \sim N(0, \sigma^2$. Here, we find that $RSS = \hat{e}^T \hat{e}$, and note that $\hat{e} \neq e$

- However, we can still derive some insights about the distribution of $\hat{e}$ by rewriting it in the following way, where $H = X (X^TX)^{-1} X^T$ represents the hat matrix of y (derivation below)
$$\begin{aligned}
    \hat{e} &= y - \hat{y} \\
    &= y - Hy \\
    &= (I - H) y \\ \\

    \hat{y} &= Hy \\
    &= X \hat{\beta} \\
    &= X (X^TX)^{-1} X^Ty
\end{aligned}$$

- Since $\hat{e}$ can be written as a function of $y$, and we know that $y = X \beta + \epsilon$, therefore:
$$y \sim N(X \beta, \sigma^2 I)$$

- As such, it must thus be true that, because when applying a vector to normal random variable, you scale the variance by the squared value
$$\begin{aligned}
    \hat{\epsilon} &= (I - H)y \sim N(0, \sigma^2 (I - H) (I - H)^T)
\end{aligned}$$

- However, we know that $H$ is symmetric (so $H = H^T$) and idempotent (so $H \cdot H = H$). 
    - See section below for symmetry and idempotence test for $H$

- Therefore, it must be true that $I - H$ is also symmetric and idempotent. Consequently:
$$\begin{aligned}
    \hat{\epsilon} &= (I - H)y \sim N(0, \sigma^2 (I - H) (I - H)^T) \\
    \therefore \hat{\epsilon} &\sim N(0, \sigma^2 (I - H))
\end{aligned}$$

- Since $\hat{\epsilon} \sim N(0, \sigma^2 (I - H))$, the term $\hat{\epsilon}^T \hat{\epsilon}$ is the quadratic form of a normal random variable, which has a $\chi^2$ distribution!

- Since the covariance matrix is multiplied by $(I - H)$, which has rank $n-k$, $RSS \sim \chi^2_{n-k}$
    - See section below for why $(I - H)$ has rank $n-k$

#### Proof that $H$ is symmetric and idempotent

- Proof of Symmetry
$$\begin{aligned}
    H^T &= (X (X^TX)^{-1} X^T)^T \\
    &= (X^T)^T ((X^TX)^{-1})^T X^T \\
    &= X ((X^TX)^{-1})^T X^T \\
    &= X ((X^TX)^T)^{-1} X^T & (A^{-1})^T = (A^T)^{-1} \\
    &= X (X^TX)^{-1} X \\
    &= H
\end{aligned}$$

- Proof of Idempotence
$$\begin{aligned}
    H \cdot H &= (X (X^TX)^{-1} X^T) \cdot (X (X^TX)^{-1} X^T) \\
    &= X (X^TX)^{-1} X^TX (X^TX)^{-1} X^T \\
    &= X (X^TX)^{-1} X^T & (X^TX)^{-1} X^TX = I \\
    &= H
\end{aligned}$$


#### Why does $I - H$ have rank $n - k$?

- Remember the definition of $H$ is such that $\hat{y} = Hy$
    - That is, $H$ "puts" a hat on $y$, thus it is called the hat matrix

- Geometrically, $H$ projects observations $y$ onto the space spanned by the design matrix $X$
    - This entire possible space spanned by the dimensions of X is known as the **range** of $H$
    - Relatedly, the number of linearly independent (i.e. orthogonal) vectors in $X$ and $H$ is known as the **rank** of $H$

- Since $H$ is symmetric and idempotent, it is a **projection matrix**

- $I - H$ is also a **projection matrix**
    - But it projects onto the orthogonal complement to the column space of $X$ / $H$
    - This is known as the **residual space**

- How do we know that $I - H$ and $H$ are complementary projection matrices? That is, show that any vector projected by $H$ lies in a subspace orthogonal to any vector projected by $I - H$
    - Take some arbitrary vector $V$
        - When projected onto the column space of $H$, we have $HV$
        - When projected onto the orthogonal column space, we have $(I-H)V$

    - To prove that $HV$ and $(I-H)V$ are orthogonal, we need to show that $(HV)^T \cdot (I - H)V = 0$
    $$\begin{aligned}
        (HV)^T \cdot (I - H)V &= (HV)^T \cdot (V - HV) \\
        &= (HV)^T V - (HV)^T HV \\
        &= V^T H^T V - V^T H^T H V \\
        &= V^T H V - V^T H V & \because H \text{ is idempotent + symmetric}\\
        &= 0
    \end{aligned}$$

    - Therefore, $H$ and $I - H$ project onto orthogonal subspaces, because their dotproduct with any arbitrary vector $V$ is 0

- So since $H$ and $I - H$ project onto complementary subspaces, and with an $n$ dimensional $y$, if $H$ has rank $k$, then $I - H$ must account for the rest of the dimensions, which are $n-k$