## Ordinary Least Squares (OLS)

- OLS is the starting point of all machine learning, mainly because it has been around longest
- It is extraordinarily useful, and many ML approaches can be implemented in OLS

### Theory

- In linear regression we want to minimise the **Mean Squared Error (MSE)** given some data. Assume that 
$$\begin{aligned} 
    \text{MSE} &= (Y_i - X_{i} \cdot \beta)^2 \\
    &= (Y_i - X_{i} \cdot \beta)^T \cdot (Y_i - X_{i} \cdot \beta) \\
    &= Y_i^T Y_i - Y_i^T X_i \beta - (X_i \beta)^T Y_i + (X_i \beta)^T (X_i \beta) \\
    &= Y_i^T Y_i - ((X_i \beta)^T Y_i)^T - (X_i \beta)^T Y_i + (X_i \beta)^T (X_i \beta) & \because B^T A^T = (AB)^T \\
    &= Y_i^T Y_i - (X_i \beta)^T Y_i - (X_i \beta)^T Y_i + (X_i \beta)^T (X_i \beta) & \because (X_i \beta)^T Y_i \text{ is scalar}\\
    &= Y_i^T Y_i - 2 (X_i \beta)^T Y_i + (X_i \beta)^T (X_i \beta) \\
    &= Y_i^T Y_i - 2 (X_i \beta)^T Y_i + (X_i \beta)^T (X_i \beta) \\
    &= Y_i^T Y_i - 2 (X_i \beta)^T Y_i + \beta^T X_i^T X_i \beta 
\end{aligned}$$

- Explaining each term
    - $Y_i$ is an $n \times 1$ vector that represents the dependent variable
    - $X_i$ is an $n \times m$ matrix that represents the independent variables
    - $\beta$ is an $m \times 1$ matrix that represents the coefficients to the regression

### Solving for Coefficients $\beta$

- In OLS, we try to choose $\beta$ such that minimise the MSE

- That is, we want to find $\beta^{*}$ such that
$$\begin{aligned}
    \frac{\partial (\text{E[MSE]})}{\partial (\beta)} &= 0
\end{aligned}$$

- Combining these 2 expressions:
$$\begin{aligned}
    \frac{\partial (\text{MSE})}{\partial (\beta)} &= \frac{\partial (Y_i^T Y_i - 2 (X_i \beta)^T Y_i + \beta^T X_i^T X_i \beta)}{\partial (\beta)} \\
    &= \frac{\partial}{\partial \beta} (Y_i^T Y) - \frac{\partial}{\partial \beta} (2 (X_i \beta)^T Y_i) + \frac{\partial}{\partial \beta} (\beta^T X_i^T X_i \beta) \\
    &= -2 X_i^T Y_i + 2 X_i^T X_i \beta \\
    &= 0 \\ \\

    \therefore -2 X_i^T Y_i + 2 X_i^T X_i \beta &= 0 \\
    X_i^T Y_i &= X_{i}^T X_{i} \beta \\
    \beta &= (X_{i}^T \cdot X_{i})^{-1} \cdot X_i^T Y_i 
\end{aligned}$$

- Solving this gives us exactly the OLS coefficients $\beta$! Let's test this out in code. You will see that all the coefficients of the regression are exactly the same!!

In [47]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
# import statsmodels.formula.api as smf
import statsmodels.api as sm

x,y = make_regression(n_samples=500, n_features=5, n_informative=2, n_targets=1, noise=5, random_state=123)

betas = np.linalg.inv((x.transpose() @ x)) @ x.transpose() @ y
np.set_printoptions(suppress=True)
print(betas)

[-0.16118975  0.23764784  0.01202546 60.45176247 26.46231774]


In [50]:
res = sm.OLS(exog=x, endog=y, hasconst=True).fit()
res.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.994
Model:,OLS,Adj. R-squared:,0.994
Method:,Least Squares,F-statistic:,22220.0
Date:,"Sat, 05 Oct 2024",Prob (F-statistic):,0.0
Time:,14:16:46,Log-Likelihood:,-1508.2
No. Observations:,500,AIC:,3026.0
Df Residuals:,495,BIC:,3047.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,-0.1612,0.232,-0.694,0.488,-0.618,0.295
x2,0.2376,0.236,1.007,0.315,-0.226,0.702
x3,0.0120,0.222,0.054,0.957,-0.423,0.447
x4,60.4518,0.221,272.934,0.000,60.017,60.887
x5,26.4623,0.227,116.754,0.000,26.017,26.908

0,1,2,3
Omnibus:,1.208,Durbin-Watson:,1.759
Prob(Omnibus):,0.547,Jarque-Bera (JB):,1.202
Skew:,0.028,Prob(JB):,0.548
Kurtosis:,2.766,Cond. No.,1.11


### Adding in a constant term $\beta_0$

- This is a trivial extension of the theory above

- We simply add in another feature column that are all `1`s. 
    - The coefficient to this  will give us the constant marginal value of an observation when all other features are 0

- Solving this is trivially taking the same matrix multiplication, but with 1 extra column

- You can play around with the dataset generation `bias` term to see how well the new coefficient captures the bias

In [10]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
# import statsmodels.formula.api as smf
import statsmodels.api as sm

x,y = make_regression(n_samples=500, n_features=5, n_informative=2, n_targets=1, bias=5, noise=5, random_state=123)
x = np.append(x, np.ones((500,1)), axis = 1)

betas = np.linalg.inv((x.transpose() @ x)) @ x.transpose() @ y
np.set_printoptions(suppress=True)
print(betas)

[-0.16521089  0.2381359   0.00976686 60.45175552 26.46640238  4.8924384 ]
