## Stochastic Gradient Descent

In this suplementary section, we introduce the powerful idea of stochastic gradient descent. At each iteration  (learning step) we compute the gradient direction based on a subset of the data, which we call the training data, rather than on the entire data set. Thus, the decrease in the loss fsunction is not guaranteed at each iteration. This may seem like a disadvantage at first, since there may be more iterations needed to achieve the same level of accuracy, but there is interesting theory to back up such a strategy which benefits from decrease in the compute required, and also it assures avoiding the possibility of overfitting the data.

### Software Prerequisites

The following Python libraries are prerequisites to run this notebook; simply run the following code block to install them. They are also listed in the `requirements.txt` file in the root of this notebook's [GitHub repository](https://github.com/uccs-math-clinic/mc-workshops).

In [None]:
%pip install matplotlib==3.5.1 \
             numpy==1.21.5 \
             scikit-learn==1.0.2

The Python kernel must be restarted after running the above code block for the first time within a particular virtual environment. This may be accomplished by navigating to `Kernel -> Restart` in the menu bar.

With our package dependencies installed, we can run the following cell in order to import the packages needed for this notebook:

In [None]:
# We import a few different libraries that make our work a bit easier.
# We also give them each aliases (the part after "as") which make them
# a little easier to remember; the ones shown below are those used 
# most commonly, but you can call these whatever you want! This practice
# is quite prevalent in Python as a whole, and doubly so in data science.
#
import numpy as np
import matplotlib.pyplot as plt
import time

# In addition to our usual imports, we add a tool to split our dataset into
# training data (used to compute the gradient direction) and testing data 
# (which can be used for validation, although it will not happen here).

from sklearn.model_selection import train_test_split

# This allows us to run animations in this notebook; this isn't necessary
# for the vast majority of notebooks, but it does serve as a useful teaching
# tool.
#
%matplotlib notebook
plt.ion()

### SGD Implementation

The following code block implements stochastic gradient descent by calculating the gradient from a subset of the data at each iteration; the data used to calculate the gradient is colored with red dots in the animation.

In [None]:
# This is our standard MSE gradient calculation.
#
def calculate_gradient(slope, intercept, x_vals, y_vals):
    # Calculate mean standard error gradient
    #
    abs_error = (slope * x_vals + intercept) - y_vals
    d_slope = np.sum(2 * abs_error * x_vals) / len(x_vals)
    d_intercept = np.sum(2 * (abs_error)) / len(x_vals)

    # Calculate mean standard error value
    #
    mse = np.sum(np.power(abs_error, 2)) / len(x_vals)

    return (d_slope, d_intercept, mse)

def sgd(m, b, left_xlim=-3, right_xlim=3, train_iterations=50, test_size=0.15):

    fig, [ax, err_ax] = plt.subplots(2, 1)
    ax.set_title('Predicted Line')
    err_ax.set_title('Error (MSE)')
    
    fig.tight_layout()


    # Initial parameter (slope and y-intercept) guesses
    #
    theta1 = 0
    theta2 = 0

    convergence_error_threshold = 0.1

    learning_rate = 0.05

    x = np.linspace(left_xlim, right_xlim, 100)[:, np.newaxis]
    y = m * x + b + (2 * np.random.randn(100, 1))
    ax.scatter(x, y, color='black')

    dtheta1, dtheta2, err = calculate_gradient(theta1, theta2, x, y)
    
    err_vals = [err]

    y_train_predicted = theta_1 * x + theta_2
    z, = ax.plot(x, y_train_predicted, color='orange')
    e, = err_ax.plot(np.arange(0, len(err_vals)), err_vals)
    
    err_ax.set_ylim(bottom=0)
    err_ax.set_xlim(left=0, right=train_iterations)
    
    previous_loss = err

    # Now is the iterative step, where at each iteration we split the data and train
    # only on the split data.
    #
    for i in range(train_iterations):
        # Split training data
        #
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size)

        ax.scatter(x_train, y_train, color='black')
        ax.scatter(x_test, y_test, color='red')

        # Calculate the new gradient and corresponding line based on the training data only
        # (i.e., _not_ the entire data set)
        #
        dtheta1, dtheta2, err = calculate_gradient(theta1, theta2, x_train, y_train)

        theta1 = theta1 - (learning_rate * dtheta1)
        theta2 = theta2 - (learning_rate * dtheta2)

        y_train_predicted = theta1 * x_train + theta2
        err_vals.append(err)

        # Plot new line values - as before, these lines are mostly for
        # animation purposes.
        #
        z.set_xdata(x_train)
        z.set_ydata(y_train_predicted)
        e.set_xdata(np.arange(0, len(err_vals)))
        e.set_ydata(err_vals)

        fig.canvas.draw()
        fig.canvas.flush_events()

        # Comment out this line if you want to see how fast this can converge!
        time.sleep(0.05)

        # We may also choose to terminate if the loss changes by less than 5%.
        # Just set terminate_early to True to see how this affects training.
        #
        terminate_early = False
        if abs(err - previous_loss) < 0.005 * err and terminate_early:
            print('We converged to our specified tolerance in {} iterations!'.format(i))
            break
        
    print('Learned slope after {} iterations is {}; control value was {}.'.format(
        train_iterations,
        theta1,
        m,
    ))

    print('Learned y-intercept after {} iterations is {}; control value was {}.'.format(
        train_iterations,
        theta2,
        b,
    ))
    
    return fig, ax

sgd_fig, sgd_ax = sgd(2, 1, train_iterations=30)