## Gradient Descent in OLS

- Gradient descent is quite a basic idea
    - Define a loss function $L$ that is a function of parameters $\beta$; i.e. $L(\beta)$
    - Find the Jacobian $L'(\beta)$
        - If $L'(\beta_i) < 0$, then the gradient of the loss function is negative. Thaat is, loss decreases as $\beta$ increases. Thus, we want to increase $\beta$ 
        - If $L'(\beta_i) > 0$, then the gradient of the loss function is positive. Thaat is, loss increases as $\beta$ increases. Thus, we want to decrease $\beta$
    - Therefore, we can update $\beta_i$ by taking
        $$\begin{aligned}
            \beta_{t+1} &= \beta_{t} - \eta \cdot \frac{\partial L}{\partial \beta_i} & \qquad \eta \text{ is a constant representing learning rate}
        \end{aligned}$$

    - This is all gradient descent is, at its most basic. All other more complicated optimisers (ADAM, ADAGRAD, etc) are just modifications of this basic idea
 

### Implementation

- Let's suppose we have

$$\begin{aligned}
    y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2
\end{aligned}$$

In [81]:
import numpy as np

BETA = np.array([0.5,5,20]).reshape(-1, 1)
X = np.concatenate(
    (
        np.ones((300,1)), 
        np.random.uniform(0,1,600).reshape(300,-1)
    ),
    axis=1
)
y = X @ BETA + np.random.normal(0, 1, 300).reshape(-1, 1)
print(X.shape, y.shape)

from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=False)
lr.fit(X, y)
lr.coef_ ## Able to find "true" coefficients

(300, 3) (300, 1)


array([[ 0.6580363 ,  4.8536405 , 19.88349535]])

- Now that we have set up the problem, let's try to use gradient descent to solve it instead

- From the equation's assumed functional form, let's define our loss function $L$ as

$$\begin{aligned}
    L &= (y - \hat{y})^2 \\
    &= (y - X \hat{\beta})^2 \\ \\

    \frac{\partial L}{\partial \hat{\beta}} &= \frac{\partial}{\partial \hat{\beta}} (y - X \hat{\beta})^2 \\
    &= \frac{\partial}{\partial \hat{\beta}} (y - X \hat{\beta}^T)^T (y - X \hat{\beta}^T)\\
    &= \frac{\partial}{\partial \hat{\beta}} (y^Ty - 2 y^T X \hat{\beta} + \hat{\beta}^T X^T X \hat{\beta}) \\
    &= - 2 X^T y + X^T X \hat{\beta} \\
    &= 2 X^T (X \hat{\beta} - y) 
\end{aligned}$$

- Then, we iteratively update our $\beta$ by 
$$\begin{aligned}
    \beta_{t+1} &= \beta_{t} - \eta (2 X^T (X \hat{\beta} - y) )
\end{aligned}$$

- Assuming we want to find the gradient descent fo

In [114]:
class OLSWithGradientDescent:
    def __init__(self):
        self.iter = 2000
        self.eta = 5e-4

    def fit(self, X, y):
        assert X.shape[0] == y.shape[0]
        
        y = y.reshape(-1,1)
        assert y.shape[1] == 1

        self.coefs_ = np.round(np.ones((X.shape[1], 1)), 3)

        for _ in range(self.iter):
            jacobian = X.T @ ((X @ self.coefs_) - y)
            self.coefs_ = np.round(self.coefs_ - (self.eta * jacobian), 3)

    def predict(self, X):
        return X @ self.coefs_


ols_gd = OLSWithGradientDescent()
ols_gd.fit(X, y)
ols_gd.coefs_

array([[ 0.061],
       [ 4.94 ],
       [19.943]])

In [101]:
1.91534986e-06

1.91534986e-06