# Stochastic Gradient Descent

Before we start to talk about the SGD, first we need to have a rough idea about Gradient Descent. Here is a sample code for Gradient Descent.

In [1]:
import numpy as np

# generate data
np.random.seed(42)
X = 2 * np.random.rand(100, 10)
y = 4 + 3 * X.dot(np.random.randn(10, 1)) + np.random.randn(100, 1)

# init
theta = np.random.randn(11, 1)

# adding intercept
X_b = np.c_[np.ones((100, 1)), X]

learning_rate = 0.1
num_iterations = 1000

for iteration in range(num_iterations):

    y_pred = X_b.dot(theta)
    
    # calculate loss
    loss = np.mean((y_pred - y)**2)
    
    # calculate gradient
    gradient = 2 * X_b.T.dot(y_pred - y) / len(y)
    
    # update parameter
    theta = theta - learning_rate * gradient

# 输出最终优化的模型参数
print("Optimized parameters:", theta)

Optimized parameters: [[-4.12341880e+75]
 [-3.88876855e+75]
 [-4.76192661e+75]
 [-3.87814045e+75]
 [-4.31970070e+75]
 [-3.81708271e+75]
 [-4.50407650e+75]
 [-3.94387939e+75]
 [-4.28411920e+75]
 [-4.33983404e+75]
 [-3.97085642e+75]]


We can see that the calculation is related to all 10 features. When the number of features go up, the speed of calculation will be incredibly slow. Thus, SGD is created to solve these kind of problem.

## Basic Idea

The basic idea of SGD is randomly choosing one feature for each iteration and use the gradient of that feature to substitute the gradient of all features. This means that the calculation speed will be n times faster for n features.

## Math Proof

Instead of providing strict math proof, I will only provide some qualitative analysis limited by my poor math ability.

Here is a illustration of gradient descent algorithm with different iteration step. As we can see, when the iteration step is small, the optimization goes nice and smooth. However, when the iteration step becomes large, the curve goes wildly.

![gds](../../../src/gd_s.png)

![gdl](../../../src/gd_l.png)

For SGD algorithm, the model is far more sensitive to the iteration step. When iteration step becomes large, the fluctuation will become much bigger.

![sgds](../../../src/sgd_s.png)

![sgdl](../../../src/sgd_l.png)

One of important pattern of SGD is that the curve is smooth at beginning which is similar to the gradient descent. However, the fluctuation will appear when curve is close to the optimal target. This pattern will help us prevent overfit and make the mdoel more robust to the noise in the perspective of machine learning.