# MACHINE LEARNING: STOCHASTIC GRADIENT DESCENT

This notebook we are going to discuss stochastic gradient descent from scratch. We are doing to 

- build toy data
- walk though theory 
- build SGD algorithm

## Toy Data

In [1]:
import numpy as np

In [2]:
X = 2 * np.random.rand(100, 1)

In [4]:
theta1 = 2
theta2 = 3
noise = np.random.randn(100, 1)
y = theta1 + theta2 * X + noise

In [5]:
X_b = np.c_[np.ones((100, 1)), X]

In [6]:
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]

## Math

From above data, we define loss function using MSE:
$$\text{MSE} = \frac{1}{m} \sum_{i=1}^m (\hat{y}_i - y)^2$$

To conduct gradient descent, we compute gradient of MSE
$$\nabla \text{MSE} = \frac{\partial}{\partial \theta} = \frac{2}{m} \sum_{i=1}^m (\theta \cdot X - y) X$$

and hence we have the algorithm. For each iteration, update weight parameter $\theta$ by
$$\theta = \theta_{\text{old} - \eta \cdot \nabla \text{MSE}$$
while $\eta > 0$ is a learning rate and $m$ is the sample size.

## SGD Algorithms

In [151]:
n_epochs = 20
t0, t1 = 10, 20

In [152]:
def learning_rate(t):
    return t0/(t + t1)

In [153]:
theta = np.random.randn(2,1)

In [154]:
for epoch in range(n_epochs):
    for i in range(m):
        random_index = np.random.randint(m)
        xi = X_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
        eta = learning_rate(epoch*m+i)
        theta = theta - eta*gradients
        # print("................ theta1 loss:", np.round(gradients[0], 2), "... theta2 loss:", np.round(gradients[1], 2))

In [155]:
theta

array([[2.06101575],
       [3.07690324]])

Investigation ends here.