## Understanding the OLS Summary Table

- In the previous section, we showed theoretically and practically how we can derive a coefficient matrix $\beta$, just from the objective function of minimising the mean squared error (MSE)

- But you should notice something odd about our results. Our matrix algebra gave us only coefficient values

- But the OLS table actually gives us so much more than this! 

- How can we derive every part of the OLS Summary table? Let's find out

In [3]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
# import statsmodels.formula.api as smf
import statsmodels.api as sm

x,y = make_regression(
    n_samples=500, 
    n_features=5, 
    n_informative=2, 
    n_targets=1, 
    noise=5, 
    bias=5,
    random_state=123
)
x = np.append(x, np.ones((500,1)), axis = 1)
print(x.shape)

betas = np.linalg.inv((x.transpose() @ x)) @ x.transpose() @ y
np.set_printoptions(suppress=True)
print(betas)

print('='*50)
res = sm.OLS(exog=x, endog=y, hasconst=True).fit()
res.summary()

(500, 6)
[-0.16521089  0.2381359   0.00976686 60.45175552 26.46640238  4.8924384 ]


0,1,2,3
Dep. Variable:,y,R-squared:,0.994
Model:,OLS,Adj. R-squared:,0.994
Method:,Least Squares,F-statistic:,17750.0
Date:,"Wed, 09 Oct 2024",Prob (F-statistic):,0.0
Time:,09:41:36,Log-Likelihood:,-1508.0
No. Observations:,500,AIC:,3028.0
Df Residuals:,494,BIC:,3053.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,-0.1652,0.233,-0.710,0.478,-0.622,0.292
x2,0.2381,0.236,1.008,0.314,-0.226,0.702
x3,0.0098,0.222,0.044,0.965,-0.426,0.445
x4,60.4518,0.222,272.722,0.000,60.016,60.887
x5,26.4664,0.227,116.601,0.000,26.020,26.912
const,4.8924,0.223,21.982,0.000,4.455,5.330

0,1,2,3
Omnibus:,1.207,Durbin-Watson:,1.76
Prob(Omnibus):,0.547,Jarque-Bera (JB):,1.202
Skew:,0.028,Prob(JB):,0.548
Kurtosis:,2.766,Cond. No.,1.11


### Log Likelihood

- Log likelihood is computed using the following formula. We will show the derivation of this formula in a later section
$$\begin{aligned}
    LL &= -\frac{n}{2} (\log(2\pi) + \log(\sigma^2) + \frac{1}{\sigma^2} \sum_{i=1}^{n} (y_i - \bar{y}_i)^2) \\
    \text{where} \\
    \sigma^2 &= \frac{1}{n-k} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
\end{aligned}$$

In [4]:
import numpy as np

ypred = x @ betas
n = len(y)
k = x.shape[1]
rss = (y - ypred)**2
sigma_square = (1 / (n-k)) * np.sum((y - ypred)**2)

ll = (
    -n/2 * (
        np.log(2 * np.pi) +
        np.log(sigma_square) +
        ((1/(n*sigma_square)) * np.sum(rss))
    )
)
ll

np.float64(-1508.0629969883523)

#### How do I derive the formula for Log Likelihood?

##### Definition of Likelihood

- The likelihood in OLS is simply the probability of observing the random vector $\hat{\epsilon}_i$, or equivalently, $y_i$

- In OLS, we assume that residuals $\hat{\epsilon}_i \sim N(0, \sigma^2)$ 

- We also know that $\hat{\epsilon}_i = y_i - \hat{y}_i$

- Therefore, the PDF of $\hat{\epsilon}_i$ is 
$$\begin{aligned}
    f(\hat{\epsilon}_i) &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp(- \frac{1}{2 \sigma^2} \epsilon_i^2) \\
    &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp(- \frac{1}{2 \sigma^2} (y - \hat{y_i})^2) \\
    &= f(y | X, \beta) & \because \hat{y_i} \text{ is deterministic given X and } \beta
\end{aligned}$$

- Suppose we have $N$ independent observations

- Then, the likelihood of observing these $N$ observations must be, by definition:
$$\begin{aligned}
    \prod_i f(\hat{\epsilon}_i) &= \prod_i f(y_i | X, \beta) \\
    &= \prod_i \frac{1}{\sqrt{2\pi\sigma^2}} \exp(- \frac{1}{2 \sigma^2} (y_i - \hat{y}_i)^2) \\
    &= L(\beta, \sigma^2, X)
\end{aligned}$$

- The product of independent probabilities for each observation $\epsilon_i$ (or equivalently, the outcome $y_i$) is simply product of their probabilities, drawn from a normal distribution! 

- Thus, we have shown how to compute **Likelihood**

##### Derivation of Log Likelihood

- As you might have deduced from the product above, computing likelihood is annoying 

- Because we have to loop over every residual $\epsilon_i$ to compute its probability, and take a product
    - Taking products is very annoying, because it can lead to exceedingly small values, which can be numerically unstable

- Thus, we usually modify the **Likelihood** slightly by taking $\log(L(\beta, \sigma^2, X))$, which gives us **Log Likelihood**
    - Why?
    - Because the $\log$ of a product turns it into a summation! And working with summations is MUCH easier

- Therefore:
$$\begin{aligned}
    \log(L(\beta, \sigma^2, X)) &= LL(\beta, \sigma^2, X) \\
    &= \log(\prod_i \frac{1}{\sqrt{2\pi\sigma^2}} \exp(- \frac{1}{2 \sigma^2} (y_i - \hat{y}_i)^2)) \\
    &= \sum_i \log(\frac{1}{\sqrt{2\pi\sigma^2}}) + \sum_i \log(\exp(- \frac{1}{2 \sigma^2} (y_i - \hat{y}_i)^2)) \\
    &= \sum_i \log((2\pi\sigma^2)^{-\frac{1}{2}}) + \sum_i -\frac{1}{2 \sigma^2} (y_i - \hat{y}_i)^2 \\
    &= \sum_i -\frac{1}{2} \log(2\pi\sigma^2) + \sum_i -\frac{1}{2 \sigma^2} (y_i - \hat{y}_i)^2 \\
    &= -\frac{n}{2} \log(2\pi\sigma^2) + -\frac{1}{2 \sigma^2} \sum_i (y_i - \hat{y}_i)^2 \\
    &= -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{RSS}{2 \sigma^2}
\end{aligned}$$

### Log-Likelihood Extension: Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC)

- The AIC is an extension of Log-Likelihood, and is computed using 
$$\begin{aligned}
    AIC &= 2k - 2 \log(L)
\end{aligned}$$

In [6]:
aic = (2*k) - (2*ll)
aic

np.float64(3028.1259939767046)

- The BIC is an extension of log-likelihood and is computed using
$$\begin{aligned}
    BIC = - 2 \log (L) + K \log (N)
\end{aligned}$$

In [7]:
import numpy as np
bic = (-2 * ll) + (k * np.log(n))
bic

np.float64(3053.4136425672377)

#### Derivation of AIC and BIC

In [8]:
## TODO