## Stochastic Gradient Descent

In this suplementary section, we introduce the powerful idea of stochastic gradient descent. At each iteration  (learning step) we compute the gradient direction based on a subset of the data, which we call the training data, rather than on the entire data set. Thus, the decrease in the loss fsunction is not guaranteed at each iteration. This may seem like a disadvantage at first, since there may be more iterations needed to achieve the same level of accuracy, but there is interesting theory to back up such a strategy which benefits from decrease in the compute required, and also it assures avoiding the possibility of overfitting the data.

### Software Prerequisites

The following Python libraries are prerequisites to run this notebook; simply run the following code block to install them. They are also listed in the `requirements.txt` file in the root of this notebook's [GitHub repository](https://github.com/uccs-math-clinic/mc-workshops).

When you run this code block for the first time, you will need to restart the kernel by navigating to `Kernel -> Restart` in the menu bar.

In [None]:
%pip install matplotlib==3.5.1 \
             numpy==1.21.5

With our package dependencies installed, we can run the following cell in order to import the packages needed for this notebook:

In [None]:
# We import a few different libraries that make our work a bit easier.
# We also give them each aliases (the part after "as") which make them
# a little easier to remember; the ones shown below are those used 
# most commonly, but you can call these whatever you want! This practice
# is quite prevalent in Python as a whole, and doubly so in data science.
#
import numpy as np
import matplotlib.pyplot as plt
import time

# In addition to our usual imports, we add a tool to split our dataset into
# training data (used to compute the gradient direction) and testing data 
# (which can be used for validation, although it will not happen here).

from sklearn.model_selection import train_test_split

# This allows us to run animations in this notebook; this isn't necessary
# for the vast majority of notebooks, but it does serve as a useful teaching
# tool.
#
%matplotlib notebook
plt.ion()

### SGD Implementation

The following code block implements stochastic gradient descent by calculating the gradient from a subset of the data at each iteration; 

In [None]:
# The next few lines of code simply are reproduced from the gradient descent cell above, as initiating the steps
%matplotlib notebook
plt.ion()
fig, ax = plt.subplots()

# Since our gradient calculation is a bit non-trivial, we're
# going to put that calculation into its own function. This
# code is nothing more than the gradient function defined
# above.
#
def calculate_gradient(slope, intercept, x_vals, y_vals):
    # Calculate mean standard error gradient
    #
    abs_error = (slope * x_vals + intercept) - y_vals
    d_slope = np.sum(2 * abs_error * x_vals) / len(x_vals)
    d_intercept = np.sum(2 * (abs_error)) / len(x_vals)
    
    # Calculate mean standard error value
    #
    mse = np.sum(np.power(abs_error, 2)) / len(x_vals)
    
    return (d_slope, d_intercept, mse)

left_xlim = -5
right_xlim = 5

theta_1 = 0
theta_2 = 0

convergence_error_threshold = 0.1

learning_rate = 0.05

[m, b] = np.random.randint(1, 10, size=2)
x = np.linspace(left_xlim, right_xlim, 100)[:, np.newaxis]
y = m * x + b + (2 * np.random.randn(100, 1))
ax.scatter(x, y, color='black')

dtheta_1, dtheta_2, err = calculate_gradient(theta_1, theta_2, x, y)

y_train_predicted = theta_1 * x + theta_2
z, = ax.plot(x, y_train_predicted, color='orange')

ax.set_ylim(bottom=-5)

loss_old = err
loss_val=loss_old

# Now is the iterative step, where at each iteration we split the data.

for i in range(1000):
    
    x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.15)

    ax.scatter(x_train, y_train, color='black')
    ax.scatter(x_test, y_test, color='red')
    
    # Calculate the new gradient and corresponding line based on the training data only
    
    dtheta_1, dtheta_2, err = calculate_gradient(theta_1, theta_2, x_train, y_train)
          
    theta_1 = theta_1 - (learning_rate * dtheta_1)
    theta_2 = theta_2 - (learning_rate * dtheta_2)
    
    y_train_predicted = theta_1 * x_train + theta_2
    
    # Plot new line values - as before, these lines are mostly for
    # animation purposes.
    #
    z.set_xdata(x_train)
    z.set_ydata(y_train_predicted)
    
    fig.canvas.draw()
    fig.canvas.flush_events()
    
    # Comment out this line if you want to see how fast this can converge!
    time.sleep(0.01)
    
    # For the termination condition, we need to compute the value of the loss function
    # and also set the acceptable min loss change
    
    loss_new = err

    acceptable_min_delta_loss = 0.005*loss_new
    
    if abs(loss_new - loss_old) < acceptable_min_delta_loss:
        print('We converged to our specified tolerance in {} iterations!'.format(i))
        break
           
    loss_old = loss_new       

    loss_val= np.append(loss_val, loss_old)
    
plt.figure()
plt.plot(loss_val)
