In previous notebooks, I've implemented several machine learning algorithms from scratch and applied them to various datasets in an attempt to uncover their strengths and weaknesses and to determine whether or not a specific algorithm is suited for a certain task and why. In the last notebook, for example, I wrote a [K-Means Clustering library and applied it to a housing prices dataset with mixed results](https://www.kaggle.com/bullardla/k-means-from-scratch-instructional-notebook), but those mixed results revealed a shortcoming of K-Means Clustering - it's not always clear which datasets will have underlying clusters, and applying a clustering algorithm to a dataset without clusters is typically fruitless.

A simple and popular machine learning algorithm that I have yet to cover will be the focus of this notebook: Linear Regression. It works on the assumption that one or more variables are linearly correlated with the variable to predict, and predicts a specific value based on the values of the variables in the input vector. 

As a basic example, humidity, temperature, and cloud coverage may be used to predict the probability that it is currently raining. One can expect that as humidity and cloud coverage increase, the probability that it is raining will likewise increase. Thus, with humidity at 90% and cloud coverage at 80%, a certain Linear Regression model may predict a 75% chance that it is raining, while it may only predict a 20% chance of rain if the humidity is 30% and cloud coverage is 5%.

For this notebook, a model will be trained to use the qualities of a diamond to predict its expected retail price. 

Intuitively, there should be some correlation between various diamond qualities and its price. The bigger the diamond (or higher the `carat` value), the more expensive a diamond should be. A diamond with a good `cut` should also be more expensive than a diamond with a worse `cut`. 

First, before starting any work on the model, we'll take a look at the various features of the dataset.

In [None]:
import copy
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# Make this notebook deterministic.
rs = np.random.RandomState(np.random.MT19937(np.random.SeedSequence(11111)))

# Increase the font size of the graphs.
plt.rcParams.update({'font.size': 16})


df = pd.read_csv("/kaggle/input/diamonds/diamonds.csv")
df = df.drop(["Unnamed: 0"], axis=1)
print("Number of dataset entries: {}".format(len(df)))
# Ordinal Features
for column in ["cut", "color", "clarity"]:
    print("Unique values in '{}': {}".format(column, df[column].unique()))
# Continuous Variables
for column in ["carat", "depth", "table", "x", "y", "z"]:
    print("'{}': [{} - {}]".format(column, df[column].min(), df[column].max()))

Ignoring `index`, which acts as a primary key for the entries, and `price`, which is the variable the Linear Regression model will be trained to predict, there are nine features in the dataset. Six of the variables - `["carat", "depth", "table", "x", "y", "z"]` - are continuous, while the reminaing three - `["cut", "color", "clarity"]` - are ordinal. 

All of the ordinal variables will be made numerical as part of pre-processing, as the model will only be able to train on numerical data.

In [None]:
cut = ["Ideal", "Premium", "Very Good", "Good", "Fair"]
color = ["D", "E", "F", "G", "H", "I", "J"]
clarity = ["IF", "VVS1", "VVS2", "VS1", "VS2", "SI1", "SI2", "I1"]

df["cut"] = df["cut"].map(cut.index)
df["color"] = df["color"].map(color.index)
df["clarity"] = df["clarity"].map(clarity.index)

Now, for the actual model.

Recall the simple formula $y = mx + b$ for plotting a line on a graph. If applied to this particular dataset, the actual formula would be $price = (carat * coeff_carat) + (cut * coeff_cut) + (color * coeff_color) ... bias$, which might predict the `price` of a diamond given its features by multiplying them by some learned coefficients and some learned bias `bias`. In theory, the coefficients and bias would equal the value that leads to the greatest accuracy of the model's predictions, but how these values are learned varies depending on the specific type of linear regression algorithm being used. 

Given that the Linear Regression formula is $y = m1x1 + m2x2 + m3x3 + ... + b$, consider for a moment if an additional feature was added to `b` at the end. Then, the formula would become `y = m1x1 + m2x2 + m3x3 + ... + bxn`, and the formula could be rewritten as `y = MX`, where `y` represents `price`, `M` is a vector of coefficients, and `X` is the input vector of features. Rearranging this yields $M = (X^TX)^{-1}X^Ty$, which provides a formula to use for the model to learn the desired coefficients.

Note, however, that there isn't technically a feature $xn$, so in order for this formula to work, a dummy column of ones will need to be appended to the input vector. Therefore, given that `xn == 1` for all input vectors, $y = MX -> y = m1x1 + m2x2 ... bxn -> y = m1x1 + m2x2 ... (1)b -> y = m1x1 + m2x2 ... b$.

Now that we have that formula, one simple technique is to use Ordinary Least Squares to fit the model.

In [None]:
class OLS:
    def __init__(self):
        self.weights = []
        self.bias = 0
        
    def fit(self, X, y):
        X = np.c_[X, np.ones(len(X))]   
        coeffs = np.linalg.inv(X.T @ X) @ (X.T @ y)
        self.weights = coeffs[:-1]
        self.bias = coeffs[-1]
        
    def predict(self, x):
        return (x @ self.weights) + self.bias
        
    def print_debug(self):
        print("Bias: {}".format(self.bias))
        print("Model weights: {}".format(self.weights))

model = OLS()
model.fit(df.drop(["price"], axis=1), df["price"])
model.print_debug()

Training the model above is simple, fast (at least compared to how long other types of models might take), and requires only a couple of lines of code. However, it's not clear how accurate the model is. Metrics will need to be chosen to evaluate the model's performance.

Before that, however, the same model will be trained but only on a single feature; this is so we can graph the model to provide some basic visualization of both the data and the resulting model.

In [None]:
X = df["carat"]
Y = df["price"]

model = OLS()
model.fit(X, Y)

plt.figure(figsize=(6, 6))
plt.scatter(X, Y, color="blue")
plt.plot(X, model.weights[0] * X + model.bias, color="black")
plt.title("Line of Best Fit - OLS Model")
plt.xlabel("Carat")
plt.ylabel("Price")

# The plotted line extends far past the maximum price, so 
# this limit is to focus the graph on the data.
plt.ylim((0, Y.max() * 1.1))
plt.grid()
plt.show()

plt.figure(figsize=(6, 6))
plt.scatter(X, Y, color="blue")
plt.plot(X, np.poly1d(np.polyfit(X, Y, 1))(X), color="red")
plt.title("Line of Best Fit - Numpy")
plt.xlabel("Carat")
plt.ylabel("Price")
plt.ylim((0, Y.max() * 1.1))
plt.grid()
plt.show()

The scatter plot above shows some correlation between `carat` and `price`, but with high variance that suggests the model would have a high error regardless of the metric used to calculate the error. 

As an aside, note that the line of best fit calculated by the Ordinary Least Squares model is the same as the line of best fit calculated using `numpy.polyfit()`. Since `numpy.polyfit()` uses Ordinary Least Squares to calculate its line of best fit, this observation is expected, but it should highlight that the model above is simply computing a classical line of best fit for the data.

To evaluate how well this line of best fit models the data, additional metrics will be needed besides those covered in my previous notebooks. All of the past models that I worked on, apart from K-Means Squared, were classification models that assigned a class to a given input vector from a finite list of possible classes. This, meanwhile, is a regression model, with an infinite amount of possible output values instead of just some subset of possible output values.

To explain further, model accuracy in the context of classification measures what percentage of the time a model predicts the expected value. Such a strict definition of accuracy cannot be reasonably applied to a regression model. Imagine that a model which predicted the price of a diamond was consistently one cent off of the expected value - should such a model have 0% accuracy? Likewise, should a model that predicts the actual price of the diamond 5% of the time, but which has predictions over thousands of dollars off the expected price the other 95% of the time, have a 5% accuracy?

Metrics for evaluating regression models must measure how close a model's predictions are to its actual values. In other terms, metrics for regression models should measure the variance of a model's predictions from its expected values.

With that in mind, various metrics commonly used for evaluating regression models will be introduced and used to evaluate the above line of best fit below.

In [None]:
class ModelEvaluator:
    def __init__(self, model):
        self.model = model
    
    # Mean Squared Error
    def mse(self, X, y):
        y_hat = self.model.predict(X)
        return np.mean((y - y_hat)**2)
    
    # Mean Absolute Error
    def mae(self, X, y):
        y_hat = self.model.predict(X)
        return np.mean(np.abs(y - y_hat))
    
    # Root Mean Squared Error
    def rmse(self, X, y):
        return np.sqrt(self.mse(X, y))
    
    # R2
    def r2(self, X, y):
        y_mean = np.mean(np.array([X]), axis=1) if len(X.shape) == 1 else np.mean(X, axis=1)
        mse_baseline = np.mean((y - y_mean)**2)
        mse_model = self.mse(X, y)
        return 1 - (mse_model / mse_baseline)
    
    # Adjusted R2
    def r2_adjusted(self, X, y):
        num_rows, num_cols = X.shape if len(X.shape) != 1 else (len(X), 1)
        return 1 - (((num_rows - 1) / (num_rows - num_cols - 1)) * (1 - self.r2(X, y)))
    
    def huber(self, X, y):
        y_hat = self.model.predict(X)
        alpha = 1
        f = lambda error: 0.5 * np.sqrt(error) if error < alpha else (alpha * error) - 0.5 
        error = np.abs(y - y_hat)
        return np.mean(np.vectorize(f)(error))
    
    
    def print_metrics(self, X, y):
        print("Metrics:")
        print("MSE: {}".format(self.mse(X, y)))
        print("MAE: {}".format(self.mae(X, y)))
        print("RMSE: {}".format(self.rmse(X, y)))
        print("R-Squared: {}".format(self.r2(X, y)))
        print("R-Squared Adjusted: {}".format(self.r2_adjusted(X, y)))
        print("Huber Loss: {}".format(self.huber(X, y)))

model_eval = ModelEvaluator(model)
model_eval.print_metrics(np.array([X]).T, Y)

Unfortunately, unlike with metrics used to evaluate classification models, the values of the above metrics aren't readily comprehensible.

Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) each calculate some variation of the average difference between the predicted and actual values. That is, the value of MSE for this model is the average squared difference between the predicted and actual prices of diamonds. MAE is similar, but is the average _absolute_ difference between the predicted and actual prices. RMSE is simply the square root of MSE, and Huber Loss is a variation of MAE with a motivation that will be discussed later. For all three of these metrics, a value of zero indicates a perfect model.

Given that all three of these metrics are motivated by a similar intuition, the differences between them are subtle yet important. Imagine two models: one which always makes a prediction one dollar off the expected price, and another which is accurate 90% of the time, but which makes a prediction ten dollars off the expected price the other 10% of the time. The first model would have a MSE of `(1.0 ** 2) * 1.0) == 1.0`, while the second model would have a MSE of `((10.0 ** 2) * 0.1) + ((0.0 ** 2) * 0.9) == 10.0`, despite them both having an MAE of `1.0`. Whether these models should be equal in performance or not is up to the judgement of the model's architect.

$R^2$ - and adjusted $R^2$ by extension - are unique from the other three in that they compare a model to a baseline that always outputs the mean of the value to be predicted. An $R^2$ of 1.0 indicates the model always makes an accurate prediction, 0.0 indicates that the model is equal in performance to the baseline, and a negative $R^2$ indicates the that model is worse that the baseline. 

Adjusted $R^2$ is a varation of $R^2$ that avoids a phenomenon with $R^2$ that the $R^2$ value of a model will never decrease as more dimensions are added to the input vector. At worse, $R^2$ will remain the same, and will often times increase due to chance alone. Adjusted $R^2$ punishes the performance of a model for additional features that aren't actually improving the model's accuracy.

Given the above, the $R^2$ values for the model seem to suggest a good performance, but the MAE shows that on average, the model is just over a thousand dollars off with its predictions, and the RMSE suggests that there are a considerable amount of outliers that the model fails to account for.

Now, let's see how the model improves once the remaining features are added.

In [None]:
X = df.drop(["price"], axis=1).values
Y = df["price"]

model = OLS()
model.fit(X, Y)
ModelEvaluator(model).print_metrics(X, Y)

print("Diamond price standard deviation: {}".format(np.std(Y)))
print("Diamond price mean: {}".format(np.mean(Y)))
print("Diamond price min: {}".format(np.min(Y)))
print("Diamond price max: {}".format(np.max(Y)))

All of the evaluation metrics improve, but the model is still on average over `~$800` off with each estimate. So how can the model be improved further? 

Depending on the metric being used to evaluate the model, further improvement of a linear regression model may be impossible. OLS is a closed-form solution to the problem of minimizing the mean-squared error of a linear regression model. Phrased differently, the output of OLS is the line with the minimum MSE for the given dataset. Other machine learning models that aren't linear regression may be able to obtain a smaller MSE, but no linear regression model will have a smaller MSE than the one calcualted by OLS.

But there may still be a linear regression model with a smaller MAE which, as shown in the example above, may be preferred over OLS linear regression depending on what the model's architect believes should be optimized for. If outliers aren't as important to account for, MAE might make more sense.

But unlike with OLS regression, which produces a linear regression model with minimum MSE, there is no analytical method for solving Least Absolute Deviations regression, which would produce a linear regression model that minimizes MAE. Thus, a different, iterative approach will need to be taken to calculate such a model.

Which introduces gradient descent, which will be introduced, implemented and explained below.

In [None]:
def mean_error(y, y_pred):
    return np.mean(y - y_pred)

def mse(y, y_pred):
    return np.mean((y - y_pred)**2)

def mae(y, y_pred):
    return np.mean(np.abs(y - y_pred))

def huber(y, y_pred):
    alpha = 1
    f = lambda error: 0.5 * np.sqrt(error) if error < alpha else (alpha * error) - 0.5 
    error = np.abs(y - y_pred)
    return np.mean(np.vectorize(f)(error))


def plot_error_functions(means, sigma, count, points=[]):
    '''
    Plots the values of MSE, MAE, and Huber Loss error functions for a given when applied to mock predicted values.
    This function assumes that the correct value for a given input is always 0, and generates mock guesses for each
    mean in mean by generating `count` normally distributed predictions around `mean` with a standard deviation of
    `sigma`.
    
    The purpose of this function is to show how the values of the error functions are influnced by the standard
    deviation of individual error.
    
    :param means: An array of means to use for generating normally distributed mock predicted values around.
    :param sigma: The standard deviation to use for generating the normally distributed mock predicted values.
    :param count: The number of mock predicted values to generate for each mean.
    :param points: An array of (x, y) tuples to plot as dots on the graph.
    '''
    mse_vals, mae_vals, huber_vals = [], [], []
    y_actual = np.zeros(count)
    for mean in means:
        y_preds = rs.normal(mean, sigma, count)
        mse_vals.append(mse(y_actual, y_preds))
        mae_vals.append(mae(y_actual, y_preds))
        huber_vals.append(huber(y_actual, y_preds))

    plt.figure(figsize=(6, 6))
    plt.plot(means, mse_vals, color="blue", label="MSE")
    plt.plot(means, mae_vals, color="red", label="MAE")
    plt.plot(means, huber_vals, color="black", label="Huber")
    if points:
        plt.scatter([x[0] for x in points], [y[1] for y in points], color="green")
    plt.grid()
    plt.legend()
    plt.title("Standard Deviation: {} Count: {}".format(sigma, count))
    plt.xlabel("Mean error")
    plt.ylabel("Error value")
    plt.show()  

    
plot_error_functions(np.arange(-2, 2.1, 0.1), 0.0, 10_000)
plot_error_functions(np.arange(-2, 2.1, 0.1), 1.0, 10_000)
plot_error_functions(np.arange(-100, 101), 0.0, 10_000)
plot_error_functions(np.arange(-100, 101), 100.0, 10_000)

In the top graph above, three of the common error functions are graphed, showing how each error value changes depending on how off the model's predictions were. For example, if the model made a prediction that was two less than the actual value, the MSE would be 4.0, the MAE would be 2.0, and the Huber Loss would be 1.5. All metrics have a value of zero when the model makes an accurate prediction, and that's the only point at which all three metrics evaluate to the same value.

But important to note is how the values of the error functions are affected by the standard deviation of the errors. When a model's errors have a mean of -2.0, but when the standard deviation increases from 0.0 to 1.0, the value of MSE goes up from 4.0 to 5.0, while the values of MAE and Huber remain effectively equivalent. Similarly, note the effect that standard deviation has when the mean error is -100.

In general, observe that MSE forms a parabola, MAE forms a "V", and Huber forms a "V" past a certain error value specified by a hyperparameter (in this case, 1) but has a unique shape within that error value.

Previously, it was shown that there exists a closed-form solution for generating weights for parameters that will result in a linear model that minimizes MSE. But, in absence of, or as an alternative to, such closed-form solutions, gradient descent is another method which can be used to minimize the error of machine learning models. It's applicable to many types of models, from neural networks to support vector machines, but here it will be used to train a linear regression model.

To demonstrate gradient descent, imagine we have a linear regression model for this dataset with weights that poorly model the actual data. Such a model will be initialized and graphed below.

In [None]:
X = df["carat"]
Y = df["price"]

gradient_descent_model = OLS()
gradient_descent_model.weights = [1000.0]
gradient_descent_model.bias = 5000.0

closed_form_model = OLS()
closed_form_model.fit(X, Y)

plt.figure(figsize=(8, 8))
plt.scatter(X, Y, color="blue")
plt.plot(X, gradient_descent_model.weights[0] * X + gradient_descent_model.bias, color="black", label="Weights arbitrarily initialized")
plt.plot(X, closed_form_model.weights[0] * X + closed_form_model.bias, color="red", label="Closed-form solution")
plt.legend()
plt.title("Line of Best Fit - OLS Models")
plt.xlabel("Carat")
plt.ylabel("Price")
plt.ylim((0, Y.max() * 1.1))
plt.grid()
plt.show()

gdmodel_eval = ModelEvaluator(gradient_descent_model)
cfmodel_eval = ModelEvaluator(closed_form_model)
print("Arbitrary Weights")
gdmodel_eval.print_metrics(np.array([X]).T, Y)
print("\nClosed-Form Solution")
cfmodel_eval.print_metrics(np.array([X]).T, Y)

As expected, the model with arbitrary weights performs significantly worse across all metrics compared to the model fit using the closed-form solution. A quick look at the graph details why - the vast majority of data points fall far from the line.

But how can gradient descent help here? And what is gradient descent?

Gradient descent, in short, refers to an iterative process of improving a given model by updating the weights in small increments in some direction that should lead to better results. This is achieved by taking the derivative of the cost function being used, and modifying the weights by their partial derivatives. 

Intuitively, gradient descent can be thought of as a robot trying to reach the bottom of a valley, but all the robot knows about the environment is from a signal of how steep the slope directly below its treads is every minute. To reach the bottom of the valley, it understands that it should move down the slope, not up it, and it should keep moving down the path of greatest descent every time it gets another signal. 

What's not clear, however, is how far it should move in between each signal. If the robot moves continuously between each signal, it may reach a state that makes it impossible to stop at the bottom of the valley. Imagine that the robot gets a signal, begins moving, but reaches the bottom of the valley before receiving another signal. The robot would then start moving _back up_ the other side of the valley until the next signal is processed, upon receiving which it will turn around and repeat the same mistake in the other direction. If the robot only moves a few steps in between each signal, however, it may take years for it to ever reach the bottom.

These same issues apply to gradient descent. When trying to minimize MSE or whatever other error function is being used to evaluate the model, care has to be taken to decide how much the weights should be updated by their calculated gradients.

With the basics of gradient descent explained, let's get a rough idea of where our model currently lies on the error function.

In [None]:
gd_mean_prediction = np.mean(gradient_descent_model.weights[0] * X + gradient_descent_model.bias)
cf_mean_prediction = np.mean(closed_form_model.weights[0] * X + closed_form_model.bias)
mean_actual = np.mean(Y)
gd_mean_difference = gd_mean_prediction - mean_actual
cf_mean_difference = cf_mean_prediction - mean_actual

points = [(gd_mean_difference, gdmodel_eval.mse(np.array([X]).T, Y)), (cf_mean_difference, cfmodel_eval.mse(np.array([X]).T, Y))]
plot_error_functions(np.arange(-2000, 2010, 10.0), 3000.0, 10_000, points)

The top-right point on the graph indicates the MSE of the gradient descent model, while the point in the center-bottom represents the MSE of the closed-form solution. The blue parabola attempts to show the MSE function for this dataset, but if it were truly the MSE function, both points would lie on the line. In fact, there is no standard deviation value which would produce such a line using `plot_error_functions()` function, as this dataset isn't normally distributed.

However, the error function suggests that, on average, the model significantly overestimated the price of diamonds, and that the model's weights will need to be appropriately changed. To do so, the partial derivatives and gradients of the weights will be computed.

In [None]:
# There's lots of resources online for how these gradients are calculated, but in general they're obtained by using the
# chain rule: (y_actual - mx + b)**2 -> -2x * (y_actual - mx + b) with respect to x, and -2 (y_actual - mx + b) with 
# respect to b.
weight_gradient = np.mean((-2.0 * X) * (Y - ((gradient_descent_model.weights[0] * X) + gradient_descent_model.bias)))
bias_gradient = np.mean(-2.0 * (Y - ((gradient_descent_model.weights[0] * X) + gradient_descent_model.bias)))
print("Weight Gradient = {}".format(weight_gradient))
print("Bias Gradient = {}".format(bias_gradient))

Currently, the model calculates `price` using the following formula:

$price = 1000carat + 5000$

The gradients that were computed inform us that to improve the model, the weight for carat should be increased ($~-60$), and the bias should be decreased ($~-3730$). Gradients are subtracted as opposed to added as the gradient represents the slope, and to reduce the error the weights should be updated to _descend_ down the slope, not _ascend_. 

However, as shown with the robot analogy, it's not clear how much the weight and bias should be updated. Should the entire gradient be subtracted, some multiple of the gradient, or some fraction of the gradient? Let's try each scenario and see how they work.

In [None]:
def update_via_gradient_descent(model, lr):
    '''
    Performs a single iteration of gradient descent on a univariate model.
    
    :param model: An OLS model to train via gradient descent.
    :param lr: The learning rate of the model.
    '''
    if len(model.weights) == 1:
        weight_gradient = np.mean((-2.0 * X) * (Y - ((model.weights[0] * X) + model.bias)))
        bias_gradient = np.mean(-2.0 * (Y - ((model.weights[0] * X) + model.bias)))
        model.weights[0] -= lr * weight_gradient
        model.bias -= lr * bias_gradient
    
old_weight = gradient_descent_model.weights[0]
old_bias = gradient_descent_model.bias

gd_model_lr_1 = copy.deepcopy(gradient_descent_model)
gd_model_lr_point_1 = copy.deepcopy(gradient_descent_model)
gd_model_lr_2 = copy.deepcopy(gradient_descent_model)

update_via_gradient_descent(gd_model_lr_1, 1.0)
update_via_gradient_descent(gd_model_lr_point_1, 0.1)
update_via_gradient_descent(gd_model_lr_2, 2)

plt.figure(figsize=(10, 10))
plt.scatter(X, Y, color="blue")
plt.plot(X, old_weight * X + old_bias, color="black", label="Gradient Descent - Original Weights")
plt.plot(X, gd_model_lr_1.weights[0] * X + gd_model_lr_1.bias, color="green", label="Gradient Descent - LR 1.0")
plt.plot(X, gd_model_lr_point_1.weights[0] * X + gd_model_lr_point_1.bias, color="magenta", label="Gradient Descent - LR 0.1")
plt.plot(X, gd_model_lr_2.weights[0] * X + gd_model_lr_2.bias, color="orange", label="Gradient Descent - LR 2.0")
plt.plot(X, closed_form_model.weights[0] * X + closed_form_model.bias, color="red", label="Closed-form solution")
plt.legend()
plt.title("Line of Best Fit - OLS Models")
plt.xlabel("Carat")
plt.ylabel("Price")
plt.ylim((0, Y.max() * 1.1))
plt.grid()
plt.show()

print("Gradient Descent - Original Weights")
ModelEvaluator(gradient_descent_model).print_metrics(np.array([X]).T, Y)
print("\nGradient Descent - LR 1.0")
ModelEvaluator(gd_model_lr_1).print_metrics(np.array([X]).T, Y)
print("\nGradient Descent - LR 0.1")
ModelEvaluator(gd_model_lr_point_1).print_metrics(np.array([X]).T, Y)
print("\nGradient Descent - LR 2.0")
ModelEvaluator(gd_model_lr_2).print_metrics(np.array([X]).T, Y)

With a learning rate of 2.0, the model greatly overadjusted based on the calculated gradients and became significantly less accurate than before, with the MSE increasing from $1.61e7$ to $4.26e7$. With no learning rate, the model slightly improved after a single iteration, with the MSE dropping to $1.58e7$, but the greatest improvement was seen in the model with a learning rate of 0.1, reaching an MSE of $1.49e7$. 

A single iteration of gradient descent suggests that a smaller learning rate may be better, but it'd be ill-advised to generalize the entire process of gradient descent based on the first iteration in a process that can take hundreds of steps. So gradient descent will be continued for these three models to see how much they improve, and how many steps each model requires to reach its best performance.

In [None]:
steps = 150
mse_history = {0.1: [], 1.0: [], 2.0: []}
mean_error_history = {0.1: [], 1.0: [], 2.0: []}
gradient_models = [(gd_model_lr_point_1, 0.1), (gd_model_lr_1, 1.0), (gd_model_lr_2, 2.0)]
XdotT = np.array([X]).T

for step in range(steps):
    for model, lr in gradient_models:
        update_via_gradient_descent(model, lr)
        mse_history[lr].append(mse(Y, model.predict(XdotT)))
        mean_error_history[lr].append(mean_error(Y, model.predict(XdotT)))

fig, axs = plt.subplots(3, figsize=(8, 12))
axs[0].plot(range(steps), mse_history[0.1], color="magenta", label="LR - 0.1")
axs[0].grid()
axs[0].legend()
axs[0].set_title("MSE Error vs. Iteration")
axs[1].plot(range(steps), mse_history[1.0], color="green", label="LR - 1.0")
axs[1].grid()
axs[1].legend()
axs[1].set_yscale("log")
axs[2].plot(range(steps), mse_history[2.0], color="orange", label="LR - 2.0")
axs[2].grid()
axs[2].legend()
axs[2].set_yscale("log")
axs[2].set_xlabel("Iteration Number")
plt.show()

print("Final MSE Value for LR 0.1 - {}".format(mse_history[0.1][-1]))
print("Final MSE Value for LR 1.0 - {}".format(mse_history[1.0][-1]))
print("Final MSE Value for LR 2.0 - {}".format(mse_history[2.0][-1]))

The MSE for the model with a learning rate of 0.1 gradually improves with each subsequent iteration until finally plateauing just above the MSE of the closed-form solution of $2.398e7$. On the contrary, the MSE of the model when using the other two learning rates increases logarithmically with each iteration, although it's not immediately clear why this is the case.

To help explain why gradient descent with a high learning rate leads to worse performance, the same graphs from above will be plotted but this time with each model's mean error instead of their MSE.

In [None]:
fig, axs = plt.subplots(3, figsize=(8, 12))
axs[0].plot(range(steps), mean_error_history[0.1], color="magenta", label="LR - 0.1")
axs[0].grid()
axs[0].legend()
axs[0].set_title("Mean Error vs. Iteration")
axs[1].plot(range(steps), mean_error_history[1.0], color="green", label="LR - 1.0")
axs[1].grid()
axs[1].legend()
axs[1].set_yscale("symlog")
axs[2].plot(range(steps), mean_error_history[2.0], color="orange", label="LR - 2.0")
axs[2].grid()
axs[2].legend()
axs[2].set_yscale("symlog")
axs[2].set_xlabel("Iteration Number")
plt.show()

With a learning rate of 0.1, each iteration of gradient descent reduces the model's mean error until it converges near 0. The learning rates of 1.0 and 2.0 cause a logarithmic increase in the mean absolute error of the model, with the value of the mean error oscillating from positive and negative with each iteration. This peculiar observation can be understood as a result of the calculated gradient for the model leading to an overcompensation in the computed delta for the model's weights. 

Returning to the robot analogy, the model with a learning rate of 0.1 is a robot taking short, but accurate steps towards its goal of the bottom of the valley. The other two learning rates make it impossible for the robot to reach its goal. The robot gets a signal of the slope, travels down the slope, but overestimates how far it needs to travel and ends up continuing far up the other side of the valley before it gets another signal. In fact, the robot goes further up in elevation on the other side of the valley than it was originally on the previous side. And this cycle continues endlessly, as evidenced by the graphs above.

But this isn't to say that a smaller learning rate is always better. The error functions being used to evaluate the linear regression models are, fortunately, simple, but in practice the error function can be much more complex, especially as the dimensionality of the error space increases. Such error functions can have multiple local minima, but will only ever have a single global minima. To demonstrate this, consider the imaginary error function below.

In [None]:
x = np.linspace(-3, 1, 100)

plt.plot(x, 2*(x**4) + 7*(x**3) + 5*(x**2) + 4)
plt.ylabel("Error Function")
plt.grid()
plt.show()

The error function has two valleys that the robot could get stuck in, but one valley would result in an error value of zero while the other would result in a higher error value. With too small of a learning rate, not only would the model take a long time to converge, but small updates to the weights with each iteration could result in the model converging on sub-optimal weights. Variations of gradient descent, such as Stochastic Gradient Descent (SGD), seek to avoid this issue by starting with a higher learning rate and reducing it with each iteration, but SGD won't be covered in this notebook as it's not particularly useful here.

Nevertheless, a linear regression model will be trained using gradient descent with decay to show how starting with a large learning rate that slowly decreases can converge faster and avoid the issues seen with a constant high learning rate.

In [None]:
steps = 150
mse_history = []
mean_error_history = []
lr_history = []

lr = 1.0
decay = 0.05

model = copy.deepcopy(gradient_descent_model)

for step in range(steps):
    update_via_gradient_descent(model, lr)
    mse_history.append(mse(Y, model.predict(XdotT)))
    mean_error_history.append(mean_error(Y, model.predict(XdotT)))
    lr_history.append(lr)
    lr -= (lr * decay)

fig, axs = plt.subplots(3, figsize=(8, 12))
axs[0].plot(range(steps), mse_history, color="magenta", label="MSE History")
axs[0].grid()
axs[0].legend()
axs[0].set_title("Gradient Descent with Decay")
axs[1].plot(range(steps), mean_error_history, color="green", label="Mean Error History")
axs[1].grid()
axs[1].legend()
axs[2].plot(range(steps), lr_history, color="orange", label="Learning Rate History")
axs[2].grid()
axs[2].legend()
axs[2].set_xlabel("Iteration Number")
plt.show()

print("Gradient Descent with Decay")
ModelEvaluator(model).print_metrics(XdotT, Y)
print("\nGradient Descent LR 0.1 - After 150 Iterations")
ModelEvaluator(gd_model_lr_point_1).print_metrics(XdotT, Y)
print("\nClosed-Form Solution")
ModelEvaluator(closed_form_model).print_metrics(XdotT, Y)

Even though the model started gradient descent with a learning rate of 1.0, decaying the learning rate over time allowed the model to converge on the known global optimum of $2.40e7$. In fact, the final MSE of the linear regression model with decay was slightly less than that of the model with a constant learning rate of 0.1, though the model with a learning rate of 0.1 was able to obtain a lower MAE than both the gradient descent model with decay and the closed-form solution.

The above graphs of the model also reveal how the early iterations of gradient descent had similar issues as those models with a constant high learning rate, but that these issues eventually subsided as the learning rate continued to decay. 

Having decay is useful in that it can lead to faster convergence and make gradient descent more robust to local minima, but having decay introduces an additional hyperparameter that requires tuning, which can take time to do.

The above walkthrough of gradient descent should demonstrate that it is a viable alternative to a closed-form solution, especially in the case where minimizing an error function has no associated closed-form solution. Linear regression with gradient descent was able to fit a model that performed as well as that of the closed-form solution, but admittedly the model was trained on only a single variable. 

Of course, gradient descent can work just as well when additional variables are introduced, but to show this, such a model will be trained below.

In [None]:
class LinRegressMSE:
    
    def __init__(self, learning_rate, steps):
        self.alpha = learning_rate
        self.steps = steps
    
    def _gradient_mse(self, X, loss):
        entries = len(X)
        return (-X.T @ loss) / entries

    def train(self, X, Y):
        orig_X = X
        X = np.c_[X, np.ones(len(X))]
        self.mse_costs = []
        self.mae_costs = []
        self.weight_history = []
        self.step_time = []
        entries = X.shape[0]
        self.weights = np.ones(X.shape[1])
        for _ in range(self.steps):
            start_time = time.time()
            loss = Y - (X @ self.weights)
            self.weight_history.append(np.copy(self.weights))
            gradient = self._gradient_mse(X, loss)
            self.weights -= self.alpha * gradient
            self.mse_costs.append(ModelEvaluator(self).mse(orig_X, Y))
            self.mae_costs.append(ModelEvaluator(self).mae(orig_X, Y))
            self.step_time.append(time.time() - start_time)
        
    def predict(self, X):
        return (X @ self.weights[:-1]) + self.weights[-1]

X_df = df.drop(["price"], axis=1)
X = X_df.values
X = (X - X.min()) / (X.max() - X.min())
Y = df["price"]

closed_form_model = OLS()
closed_form_model.fit(X, Y)

steps = 250
model = LinRegressMSE(0.1, steps)
model.train(X, Y)

print("Gradient Descent Model")
ModelEvaluator(model).print_metrics(X, Y)
print("\nClosed-Form Solution")
ModelEvaluator(closed_form_model).print_metrics(X, Y)

The expectation of the multivariate linear regression model being as easy to fit with gradient descent as the univariate model seems fairly misguided given the above results. The MSE of the gradient descent model is around an order of magnitude worse than the closed-form solution's MSE, with the other calculated metrics demonstrating a poor performance as well.

But why is this the case? Was the learning rate too large or too small, or did the model not have enough iterations to fit the weights properly? Graphing the values of the model's weights with each iteration will hopefully provide the necessary insight to answer these questions.

In [None]:
plt.figure(figsize=(6, 6))
for i in range(len(model.weights)):
    label_name = X_df.columns[i] if i < len(X_df.columns) else "Bias"
    plt.plot(range(steps), [row[i] for row in model.weight_history], label=label_name)
plt.xlabel("Iteration Number")
plt.ylabel("Weight Value")
plt.legend()
plt.grid()
plt.show()

plt.figure(figsize=(6, 6))
plt.plot(range(steps), model.mse_costs)
plt.xlabel("Iteration Number")
plt.ylabel("MSE")
plt.grid()
plt.show()

Within the first 25 iterations, the model's weights shifted greatly from their starting values of 1 to a wide range of values. Accordingly, the MSE of the model dropped significantly - from over $2.6e7$ to under $1.6e7$ - within this same period.

Nevertheless, all of the weights continued to shift over the remaining iterations, with the MSE slowly dropping as well - with necessary emphasis on _slowly_.

In fact, assuming that the MSE is decreasing linearly after the sharp initial drop (this is a bad assumption), how many more iterations are required for the model to finish training?

In [None]:
slope_estimate = (model.mse_costs[49] - model.mse_costs[249]) / 200
mse_delta = ModelEvaluator(model).mse(X, Y) - ModelEvaluator(closed_form_model).mse(X, Y)
print("Additional Iterations Required: {}".format(mse_delta / slope_estimate))

Even with the rather optimistic assumption of a linear decrease in MSE over each iteration, the model would still require ~15,000 more steps before it converges on the MSE of the closed-form solution.

Running this many iterations will take significantly more time, so a more desirable solution would be to adjust the learning rate or potentially add a decay parameter to help train the model faster. Out of curiosity, though, the model will be trained for the calculated number of iterations just to observe the outcome. 

Additionally, a rolling average of the time it takes to go through each iteration of gradient descent will be graphed to see if the iteration time increases significantly over time. This might be the case given the increasing size of `weight_history` and the other lists that are being used to track the model's metadata.

In [None]:
steps = 16_000
model = LinRegressMSE(0.1, steps)
model.train(X, Y)

plt.figure(figsize=(6, 6))
for i in range(len(model.weights)):
    label_name = X_df.columns[i] if i < len(X_df.columns) else "Bias"
    plt.plot(range(steps), [row[i] for row in model.weight_history], label=label_name)
plt.xlabel("Iteration Number")
plt.ylabel("Weight Value")
plt.legend()
plt.grid()
plt.show()

plt.figure(figsize=(6, 6))
plt.plot(range(steps), model.mse_costs)
plt.xlabel("Iteration Number")
plt.ylabel("MSE")
plt.grid()
plt.show()

step_time_rolling_average = np.convolve(model.step_time, np.ones(500), 'valid') / 500
plt.figure(figsize=(6, 6))
plt.plot(range(len(step_time_rolling_average)), step_time_rolling_average)
plt.xlabel("Iteration Number")
plt.ylabel("Iteration Time (Rolling Average)")
plt.grid()
plt.show()

As expected, the model's MSE decreases sublinearly with each iteration. The closer the model's MSE gets to the optimal value, the smaller the gradient would be, and thus the smaller the delta of the weights will be between each iteration. It's likely the model won't converge even after a million more iterations with the current hyperparameters (I did this on my own and confirmed this is the case), but given that each iteration seems to consistently take around $60ms$ it's not feasible to go through that many iterations in this notebook. So, somehow, the model will need to be tweaked.

The most obvious change in the case of slow model convergence is to increase the learning rate, so the learning rate of the model will be increased sharply from $0.1$ to $1.0$.

In [None]:
steps = 10_000
model = LinRegressMSE(1.0, steps)
model.train(X, Y)

plt.figure(figsize=(6, 6))
for i in range(len(model.weights)):
    label_name = X_df.columns[i] if i < len(X_df.columns) else "Bias"
    plt.plot(range(steps), [row[i] for row in model.weight_history], label=label_name)
plt.xlabel("Iteration Number")
plt.ylabel("Weight Value")
plt.legend()
plt.grid()
plt.show()

plt.figure(figsize=(6, 6))
plt.plot(range(steps), model.mse_costs)
plt.xlabel("Iteration Number")
plt.ylabel("MSE")
plt.grid()
plt.show()

print("Gradient Descent Model")
ModelEvaluator(model).print_metrics(X, Y)

Increasing the learning rate to 1.0 certianly helps to speed up the learning process of the model, but as the MSE approaches the optimal value it becomes more evident how slowly the model improves the better it performs.

We can continue blindly increasing the learning rate but, in fact, I tried this on my own and found that raising the learning rate even slightly higher than 1.0 causes the MSE to increase logarthmically in the same manner that it did in the univariate regression model. Having decay also won't help here, as the issue in this case is that the learning rate is too small. So is there anything that can be done?

Many of the techniques developed to improve gradient descent concern the domain of neural networks, or other loss functions that feature local minima. In such contexts, gradient descent needs to be performed in a way so as to ideally reach the global minimum while avoiding the numerous local minima that a model may get stuck in. 

But MSE for linear regression happens to be a convex function regardless of the dimensionality of the dataset, and this fact can be exploited to develop unconventional gradient descent optimizations that would be impractical if the loss function had any local minima, but which could greatly boost training time if such local minima weren't present. A naive example of this could be to slowly increase the learning rate over time until the loss starts increasing, and then to adjust the learning rate and backtrack until the minimum loss is reached. In the presence of local minima, doing so could increase the likelihood of the model converging at a local minimum, but such a concern isn't necessary here.

Many of the techniques utilized to improve gradient descent in the context of local minima could also be useful here as well. One such example is momentum, which, in plain terms, keeps track of the previous gradient and uses that value to calculate the current gradient. The result of this is weights that gain velocity with each iteration, as shown by the definition below.

Consider the old formula for calculating the next weight after each iteration:

$w_{i+1} = w_i - (\alpha * gradient)$

Imagine, instead, weights are updated using the following formulae:

$v_i = (\gamma * v_{i-1}) + (\alpha * gradient)$

$w_{i+1} = w_i - v_i$

If a model's weight had small gradients over the past few iterations (where "few" depends on the value of the hyperparameter $\gamma$), $v_i$ will be small, and the behavior of gradient descent will be comparable to when there was no momentum. If a model's weight had large gradients over the past few iterations, however, $v_i$ will be much larger, and learning will be much faster with momentum than without it.

Momentum will be implemented below to see by how much it's able to improve the convergence time of the model.

In [None]:
class LinRegressMSE:
    
    def __init__(self, learning_rate, momentum, steps):
        self.alpha = learning_rate
        self.momentum = momentum
        self.steps = steps
    
    def _gradient_mse(self, X, loss):
        entries = len(X)
        return (-X.T @ loss) / entries

    def train(self, X, Y):
        orig_X = X
        X = np.c_[X, np.ones(len(X))]
        self.mse_costs = []
        self.mae_costs = []
        self.weight_history = []
        self.step_time = []
        entries = X.shape[0]
        self.weights = np.ones(X.shape[1])
        for _ in range(self.steps):
            start_time = time.time()
            loss = Y - (X @ self.weights)
            self.weight_history.append(np.copy(self.weights))
            gradient = self._gradient_mse(X, loss)
            velocity = (self.weight_history[-1] - self.weight_history[-2]) * self.momentum if len(self.weight_history) > 1 else np.zeros(len(self.weights))
            self.weights += velocity - (self.alpha * gradient)
            self.mse_costs.append(ModelEvaluator(self).mse(orig_X, Y))
            self.mae_costs.append(ModelEvaluator(self).mae(orig_X, Y))
            self.step_time.append(time.time() - start_time)
        
    def predict(self, X):
        return (X @ self.weights[:-1]) + self.weights[-1]
  

steps = 10_000
model = LinRegressMSE(1.0, 0.99, steps)
model.train(X, Y)

plt.figure(figsize=(6, 6))
for i in range(len(model.weights)):
    label_name = X_df.columns[i] if i < len(X_df.columns) else "Bias"
    plt.plot(range(steps), [row[i] for row in model.weight_history], label=label_name)
plt.xlabel("Iteration Number")
plt.ylabel("Weight Value")
plt.legend()
plt.grid()
plt.show()

plt.figure(figsize=(6, 6))
plt.plot(range(steps), model.mse_costs)
plt.xlabel("Iteration Number")
plt.ylabel("MSE")
plt.grid()
plt.show()

print("Gradient Descent Model with Momentum")
ModelEvaluator(model).print_metrics(X, Y)
print("\nClosed-Form Solution for the weight of carat: {}".format(closed_form_model.weights[0]))

The above results highlight a few interesting findings.

First, momentum was able to significantly boost the learning time of the model. In fact, the model with momentum is able to reach an MSE of $1.6e7$ in the same number of iterations it was able to reach $2.8e7$ without momentum.

Second, even with a large value of $0.99$ for momentum (a value of $1.0$ would make convergence nearly impossible) `carat` is still converging on its optimal value quite slowly. Testing outside of this notebook seemed to suggest the model takes around 50,000 iterations before converging, but ideally this number could be shortened by fine-tuning the hyperparameters.

Lastly, with the introduction of momentum, the MSE of the model no longer strictly decreases with each iteration. This is a consequence of some of the weights gaining too much momentum and overshooting their optimum, and the effects can be observed in both graphs within the first 500 iterations of training.

Given the above findings, we might expect some issues if we increase the momentum hyperparameter by much more, but regardless, we'll try increasing it from $0.99$ to $0.999$ below.

In [None]:
steps = 10_000
model = LinRegressMSE(1.0, 0.999, steps)
model.train(X, Y)

plt.figure(figsize=(6, 6))
for i in range(len(model.weights)):
    label_name = X_df.columns[i] if i < len(X_df.columns) else "Bias"
    plt.plot(range(steps), [row[i] for row in model.weight_history], label=label_name)
plt.xlabel("Iteration Number")
plt.ylabel("Weight Value")
plt.legend()
plt.grid()
plt.show()

plt.figure(figsize=(6, 6))
plt.plot(range(steps), model.mse_costs)
plt.xlabel("Iteration Number")
plt.ylabel("MSE")
plt.grid()
plt.show()

print("Gradient Descent Model with Momentum")
ModelEvaluator(model).print_metrics(X, Y)

The oscillation experienced by each weight is far more evident with a momemntum of $0.999$ than it was with a momentum of $0.99$. Before, the oscillation wasn't readily discernable past the first 500 iterations, but now all of the weights seem to be experiencing some degree of oscillation even at the 10,000th iteration. Also, while the rolling average of the MSE is decreasing over time, the volatility with which it is doing so is extreme. 

Nevertheless, the resulting MSE of $1.480291e7$ is better than the previous model's MSE of $1.607e7$, and very close to the closed-form solution's MSE of $1.479989e7$, even if the manner by which this result is achieved is dubious.

I could dedicate more time in this notebook fine tuning these hyperparameters and seeing how the gradient descent is then affected, but instead of looking more into how the model can be improved, I would rather discuss why weights like `carat` converge so slowly while other weights like `clarity` converge much sooner.

To do this, the MSE loss landscapes of the univariate linear regression models will be graphed to get better insight into how the loss landscape looks for different variables.

In [None]:
def mse_contour(df, variable_name, target_variable_name, level_count = 9, start_level = 0, samples = 35):
    '''
    Plots the error landscape of MSE in regards to a linear regression model where 
    `target_variable_name` is the target variable and `variable_name` is the variable
    to use to train the model.
    
    :param df: The pandas DataFrame.
    :param variable_name: The name of the column to show the MSE loss contour graph for.
    :param target_variable_name: The target variable to be predicted by the linear regression model.
    :param level_count: The number of levels shown in the contour graph. The lowest level is 0, 
    and the remaining levels increase by powers of 2 from (2 << `start_level`).
    :param start_level: The power of 2 to start populating the level values from.
    :param samples: The number of samples to use in calculating the MSE landscape. A higher value
    will produce more granular results, but at the expense of runtime.
    '''
    X = df[variable_name]
    X = (X - X.min()) / (X.max() - X.min())
    Y = df[target_variable_name]
    Y = (Y - Y.min()) / (Y.max() - Y.min())
    model = OLS()
    model.fit(X, Y)

    sample_size = samples
    sample_range = 10
    sample_weights = np.linspace(model.weights[0] - sample_range, model.weights[0] + sample_range, sample_size)
    sample_biases = np.linspace(model.bias - sample_range, model.bias + sample_range, sample_size)
    mse_vals = np.zeros(shape=(sample_size, sample_size))

    # Calculate MSE values for various weights close to optimal weights.
    for i, weight in enumerate(sample_weights):
        for j, bias in enumerate(sample_biases):     
            mse_vals[i, j] = mse(Y, (X * weight) + bias)

    # Set the levels to use to form boundaries in the contour plot.
    levels = [0] + [((2 << start_level) << i) for i in range(level_count - 1)]

    # Plot the Contour Graph
    plt.figure(figsize=(10, 10))
    plt.contourf(sample_weights, sample_biases, mse_vals, alpha=0.5, levels=levels)
    contour_plot = plt.contour(sample_weights, sample_biases, mse_vals, levels=levels)
    plt.clabel(contour_plot, fontsize=12, colors="black")
    plt.title("MSE Loss Plane (Optimal MSE = {})".format(mse(Y, (Y * model.weights[0]) + model.bias)))
    plt.xlabel("{} - (Optimal = {})".format(variable_name, model.weights[0]))
    plt.ylabel("Bias - (Optimal = {})".format(model.bias))
    plt.show()
    
    
mse_contour(df, "clarity", "price")

Before interpreting the above graph, it should be noted that min-max normalization was used to produce the above contour graph. This was simply to scale the features so that multiple contour graphs can be more easily compared.

Above is a contour graph of the MSE relative to `clarity` and the bias term, where the actual contours represent the value of the MSE for the model. As an example, the top-right corner of the graph would represent a linear regression model $price = 10.0clarity + 10.0$, with the MSE of such a model having a value between 128 and 256, as shown by the green region. The central ellipse, meanwhile, represents the contour region with the lowest obtainable MSE for a $price = (weight*clarity) + bias$ linear regression model. 

The above loss landscape shows, mainly, that shifting both the weight and bias by some `x` in the same direction will impact the performance of the model more than shifting them by that same `x` in opposite directions. For example, imagine a model with a weight of 0.0 and a bias of 2.5. Updating both by either +2.5 or -2.5 will move the model multiple contour levels, but changing one by +2.5 and the other by -2.5 won't have any impact on the contour level.

To see the impact this observation has on gradient descent, two univariate linear regression models will be trained to predict `price` given `clarity`. Both will be initialized with a bias of `100`, but one will start with a weight of `optimal_weight + 100` and the other with a weight of `optimal_weight - 100`.

In [None]:
X = df["clarity"]
X = (X - X.min()) / (X.max() - X.min())
Y = df["price"]
Y = (Y - Y.min()) / (Y.max() - Y.min())

cf_model = OLS()
cf_model.fit(X, Y)

model1 = OLS()
model1.weights = [cf_model.weights[0] + 100]
model1.bias = 100

model2 = OLS()
model2.weights = [cf_model.weights[0] - 100]
model2.bias = 100

steps = 1000
mse_history = [[], []]
models = [(0, model1), (1, model2)]
XdotT = np.array([X]).T

for step in range(steps):
    for i, model in models:
        update_via_gradient_descent(model, 0.1)
        mse_history[i].append(mse(Y, model.predict(XdotT)))
        if mse_history[i][-1] < 0.4:
            models.remove((i, model))
        
plt.figure(figsize=(8, 8))
plt.plot(range(len(mse_history[0])), mse_history[0], color="magenta")
plt.plot(range(len(mse_history[1])), mse_history[1], color="green")
plt.scatter(len(mse_history[0]), mse_history[0][-1], color="magenta")
plt.scatter(len(mse_history[1]), mse_history[1][-1], color="green")
plt.grid()
plt.title("MSE Error vs. Iteration")
plt.show()

Both of the above models have the same bias, and their weights are the same Euclidean distance from their target weight, but one model takes ~300 iterations to converge while the other model takes ~450. One model also starts with a MSE of over 13,000, while the other has an MSE under 2,000. But, the model that converges faster happens to be the one that starts with the larger MSE. 

As illustrated by the previous loss landscape, the model that starts with the higher MSE is also the model that has a steeper descent to the global optimum MSE value. Thus, by extension, that model will also have larger gradients, and therefore require less steps to converge than the other model. 

This observation won't always hold in gradient descent. MSE when applied to linear functions just happens to be a conveniently simple loss function. Furthermore, the conclusion from the results above shouldn't be that a higher starting MSE equates to a steeper slope. A simple demonstration below will disprove this.

In [None]:
X = df["clarity"]
X = (X - X.min()) / (X.max() - X.min())
Y = df["price"]
Y = (Y - Y.min()) / (Y.max() - Y.min())

cf_model = OLS()
cf_model.fit(X, Y)

model1 = OLS()
model1.weights = [cf_model.weights[0] + 100]
model1.bias = 100

model2 = OLS()
model2.weights = [cf_model.weights[0] - 100]
model2.bias = 100

model3 = OLS()
model3.weights = [cf_model.weights[0] - 200]
model3.bias = 100

steps = 1000
mse_history = [[], [], []]
models = [(0, model1), (1, model2), (2, model3)]
XdotT = np.array([X]).T

for step in range(steps):
    for i, model in models:
        update_via_gradient_descent(model, 0.1)
        mse_history[i].append(mse(Y, model.predict(XdotT)))
        if mse_history[i][-1] < 0.4:
            models.remove((i, model))
        
plt.figure(figsize=(8, 8))
plt.plot(range(len(mse_history[0])), mse_history[0], color="magenta")
plt.plot(range(len(mse_history[1])), mse_history[1], color="green")
plt.plot(range(len(mse_history[2])), mse_history[2], color="blue")
plt.scatter(len(mse_history[0]), mse_history[0][-1], color="magenta")
plt.scatter(len(mse_history[1]), mse_history[1][-1], color="green")
plt.scatter(len(mse_history[2]), mse_history[2][-1], color="blue")
plt.grid()
plt.title("MSE Error vs. Iteration")
plt.show()

The new model, indicated by the blue line, has an MSE slightly above 2,000, and yet it takes longer to converge than both the model with a smaller MSE and the model with a larger MSE. Convergence time, therefore, depends not on the starting value of a loss function, but on how steep the remaining descent for the model's weights are.

Having introduced loss landscapes, we'll now take a look at the loss landscape for `carat` and see how it compares to the loss landscape for `clarity`.

In [None]:
mse_contour(df, "clarity", "price")
mse_contour(df, "carat", "price")

The loss landscape for `carat` is a deep valley, and unlike with the loss landscape for `clarity`, the valley is oriented vertically, not diagonally. The signifiance of this is that large changes to the bias term will have minimal impact on the MSE of the model, while slight changes to the weight of `carat` will affect MSE greatly. 

Unfortunately, it wouldn't be trivial to graph the loss landscape of the multivariate linear regression model, but it's likely that there would be a long, flat valley associated with the `carat` weight which is causing the slow convergence for that term. As a consequence of this, large deltas to the weight for `carat` of the multivarate regression model might not have much of an impact on MSE, especially when compared to similar deltas for the other weights.

Now, having taken a look at the loss landscapes, let's return back to training multivariate linear regression models and try them out on a few sample inputs to see how well they predict the prices of various diamonds.

In [None]:
X_df = df.drop(["price", "depth", "table", "x", "y", "z"], axis=1)
X = X_df.values
Y = df["price"]

steps = 250
model = LinRegressMSE(0.1, 0.9, steps)
model.train(X, Y)

closed_form_model = OLS()
closed_form_model.fit(X, Y)

print("Gradient Descent Model")
ModelEvaluator(model).print_metrics(X, Y)
print("\nClosed-Form Solution")
ModelEvaluator(closed_form_model).print_metrics(X, Y)

For the above model, I discarded some of the variables I'm not personally familiar with when it comes to diamonds so I'm able to make predictions on diamonds with parameters that I understand. Fortunately, the model above trained on only `[carat, cut, color, clarity]` is able to converge in just 250 steps, and has an adjsuted $R^2$ not much smaller the adjusted $R^2$ of the closed-form solution trained on all available input parameters (0.9513 vs. 0.9528). It doesn't seem much information was lost from the removed columns.

Now, let's feed the model some example diamonds and see what the expected price of them are.

In [None]:
cut = ["Ideal", "Premium", "Very Good", "Good", "Fair"]
color = ["D", "E", "F", "G", "H", "I", "J"]
clarity = ["IF", "VVS1", "VVS2", "VS1", "VS2", "SI1", "SI2", "I1"]

def pretty_predict(model, carat_val, cut_val, color_val, clarity_val):
    expected_price = model.predict(np.array([carat_val, cut.index(cut_val), color.index(color_val), clarity.index(clarity_val)]))
    print("Carat - {}, Cut - {}, Color - {}, Clarity - {} --- Expected Price: {}".format(carat_val, cut_val, color_val, clarity_val, expected_price))
    
pretty_predict(model, 1.0, "Ideal", "D", "VVS2")
pretty_predict(model, 1.0, "Ideal", "J", "VVS2")
pretty_predict(model, 1.0, "Fair", "D", "VVS2")
pretty_predict(model, 1.0, "Ideal", "D", "I1")
pretty_predict(model, 2.0, "Fair", "J", "I1")
pretty_predict(model, 1.5, "Fair", "J", "I1")

The weight of a diamond is important, but not all that matters in regards to value. A `1.5, Fair, J, I1` diamond is only worth ~6,950 while a `1.0, Ideal, D, VVS2` diamond is worth ~7,750. The color also seems to have a significant impact on price, while the cut doesn't seem to have that much of an effect. Overall, the model seems to do a decent job predicting the prices of diamonds, but a different type of model - perhaps a neural network - could likely achieve better results.

In this notebook, we took a deep look at linear regression and gradient descent and encountered many of the common pitfalls that can occur when training a model with gradient descent. Common gradient descent hyperparameters such as learning rate, decay, and momentum were discussed, implemented, and utilizied, and several linear regression models were trained on various subsets of the features of the dataset. 

Linear regression models are intuitive, fast, and good for predicting the values of continuous features, but their simplicity often leads to them being outperformed by more complex models. Still, linear regression can be a good way of modeling a dataset while avoiding having to worry about as many hyperparameters as with other models. Linear regression can also be applied to various loss functions besides the popular mean-squared error, although gradient descent or a similar iterative approach is necessary to train a model on many of those other error functions.

Hopefully this notebook was useful to anyone who took a look! If this was helpful, I recommend looking at my past notebooks for similar tutorials of popular machine learning models and techniques. Any feedback is also appreciated.