## Understanding the OLS Summary Table

- In the previous section, we showed theoretically and practically how we can derive a coefficient matrix $\beta$, just from the objective function of minimising the mean squared error (MSE)

- But you should notice something odd about our results. Our matrix algebra gave us only coefficient values

- But the OLS table actually gives us so much more than this! 

- How can we derive every part of the OLS Summary table? Let's find out

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
# import statsmodels.formula.api as smf
import statsmodels.api as sm

x,y = make_regression(
    n_samples=500, 
    n_features=5, 
    n_informative=2, 
    n_targets=1, 
    noise=5, 
    bias=5,
    random_state=123
)
x = np.append(x, np.ones((500,1)), axis = 1)
print(x.shape)

betas = np.linalg.inv((x.transpose() @ x)) @ x.transpose() @ y
np.set_printoptions(suppress=True)
print(betas)

print('='*50)
res = sm.OLS(exog=x, endog=y, hasconst=True).fit()
res.summary()

(500, 6)
[-0.16521089  0.2381359   0.00976686 60.45175552 26.46640238  4.8924384 ]


0,1,2,3
Dep. Variable:,y,R-squared:,0.994
Model:,OLS,Adj. R-squared:,0.994
Method:,Least Squares,F-statistic:,17750.0
Date:,"Fri, 17 Jan 2025",Prob (F-statistic):,0.0
Time:,17:09:58,Log-Likelihood:,-1508.0
No. Observations:,500,AIC:,3028.0
Df Residuals:,494,BIC:,3053.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,-0.1652,0.233,-0.710,0.478,-0.622,0.292
x2,0.2381,0.236,1.008,0.314,-0.226,0.702
x3,0.0098,0.222,0.044,0.965,-0.426,0.445
x4,60.4518,0.222,272.722,0.000,60.016,60.887
x5,26.4664,0.227,116.601,0.000,26.020,26.912
const,4.8924,0.223,21.982,0.000,4.455,5.330

0,1,2,3
Omnibus:,1.207,Durbin-Watson:,1.76
Prob(Omnibus):,0.547,Jarque-Bera (JB):,1.202
Skew:,0.028,Prob(JB):,0.548
Kurtosis:,2.766,Cond. No.,1.11


### Other Statistical Tests

In [17]:
n = x.shape[0]
epsilon = y - x @ betas
s = np.std(epsilon)

#### Skew and Kurtosis

In [34]:
from scipy.stats import skew, kurtosis

_skew = skew(epsilon)
_kurtosis = kurtosis(epsilon) ##+3 if you want regular kurtosis because normal distribution has kurtosis of 3, and scipy's implementation computes excess kurtosis
_skew, _kurtosis

(np.float64(0.02770497453847136), np.float64(-0.2336862603260439))

#### Omnibus Normality Test

- This tests the hypothesis that there is no "pattern" left in the residuals after fitting the model. Usually used to detect issues with model fit
- Specifically, we can use this to check the residuals for
    - Non-normality: errors are assumed normal after fitting the model
    - Homoscedasticity: errors are assumed homoscedastic, see 6. Summary Table: Standard Error of Coefficients and t-test
    - Independence: errors are assumed independent

- To compute the Omnibus statistic:
    - Compute residuals
    - Compute skewness and kurtosis of residuals
    $$\begin{aligned}
        O = \frac{n}{6} \cdot (\text{skew}^2 + \frac{1}{4} \cdot (\text{kurtosis} - 3)^2)
    \end{aligned}$$
    - $n$ is sample size, and $\text{kurtosis} - 3$ because normal distribution has a kurtosis of 3

In [48]:
from scipy.stats import normaltest
normaltest(epsilon)

NormaltestResult(statistic=np.float64(1.2073378518941995), pvalue=np.float64(0.5468017761105753))

#### Jarque-Bera Normality Test

- Very closely related to the Omnibus test, we are testing for normality in the residuals
- Only different in how the skew/kurtosis statistics are adjusted

In [None]:
jb_test_statistic = (_skew**2 + 0.25 * (_kurtosis)**2) * n/6
jb_test_statistic

np.float64(1.2016568900391862)

In [None]:
from scipy.stats import chi2
jb_p_value = 1 - chi2.cdf(jb_test_statistic, df=2)
jb_p_value

np.float64(0.5483571641059298)

#### Durbin Watson Statistic

- This is used to detect autocorrelation in residuals, and is probably more relevant if you are fitting ARIMA type models

$$\begin{aligned}
    DW &= \frac{\sum_{t=2}^{n} (e_t - e_{t-1})^2}{\sum_{t=1}^{n} e_t^2}
\end{aligned}$$

In [52]:
eps_sq = np.sum(epsilon**2)
eps_autocorr = np.sum([(epsilon[i] - epsilon[i-1])**2 for i in range(1, x.shape[0])])
dw_stat = eps_autocorr/eps_sq
dw_stat

np.float64(1.7599039565117907)

#### Condition Number

- This is a reflection of the multicollinearity seen in the variables

- It is computed by taking the maximum singular value divided by the minimum singular value

- Large condition number implies higher multicollinearity

In [None]:
u, s, vt = np.linalg.svd(x)
condition_number = s.max()/s.min()
condition_number

np.float64(1.1103349968531293)

- It is useful to think about the intuition of this:

- When taking an SVD of some design matrix $X$, you get 3 matrixes
    - $U$ --> "direction" in row space
    - $V^T$ --> "direction" in column space
    - $\sum$ --> diagonal matrix containing the stretching and/or shrinking magnitudes along the principal directions

- Ideally, if all your features are uncorrelated, you should have similarly large singular values

- But if there is high correlation, intuitively there is one "direction" that captures most of the variation, with the rest being super small.

- As such, the ratio of largest to smallest singular values will be large!