# CS-6580 Lecture 13 - Gradient Descent
**Dylan Zwick**

*Weber State University*

## Gradient Descent

In our last lecture, we discussed how in linear regression the goal is to minimize the sum of square error, while in logistic regression the goal is to maximize the likelihood of the observations. However, unlike with linear regression, generally speaking it's very hard, and in most cases essentially impossible, to optimize precisely. Consequently, the maximum likelihood must be approximated using numeric techniques, and today we'll discuss the foundation for most - gradient descent.

The basic idea is suppose you’re on a mountain and you’re trying to get down it as quickly as possible, but you’re in a fog and so you can only see a couple feet in front of you. You
don’t know where the lowest point on the mountain is, but you do know the direction that will get you
down the farthest on your next step. So, you just take a step in that direction, look around, figure out the
step that will take you down the farthest from your new position, take that step, and so on. Every time,
you’re taking the step that gets you down the farthest locally, and hopefully this approach will get you to
the bottom of the mountain ASAP.

&nbsp;

<center>
    <img src="Gradient_Descent.gif" width="600">
</center>

&nbsp;

Stated more mathematically, at any given point it's frequently straightforward to find the gradient (the vector of partial derivatives) of our function. This gradient tells us the direction of greatest increase at that point, and so if we're looking to maximize a function, we can take a step in that direction. If we're looking to minimize the function, we can take a step in the other direction.

For the sigmoid function, the partial derivative with respect to a coefficient is relatively easy to calculate:

&nbsp;

<center>
    $\displaystyle \frac{\partial S}{\partial c_{i}} = \frac{X_{i}e^{−(c_{1}X_{1}+c_{2}X_{2}+···+c_{n}X_{n}+b)}}{\left(1 + e^{−(c_{1}X_{1}+c_{2}X_{2}+···+c_{n}X_{n}+b)}\right)^{2}} = X_{i}\left(1-\frac{1}{1 + e^{-(c_{1}X_{1}+c_{2}X_{2}+···+c_{n}X_{n}+b)}}\right)\left(\frac{1}{1 + e^{−(c_{1}X_{1}+c_{2}X_{2}+···+c_{n}X_{n}+b)}}\right) = X_{i}(1-S)S$
</center>
&nbsp;

Here $c_{0} = b$ and $X_{0} = 1$.

From this, we can fairly easily calculate the gradient at any point $\textbf{X}$. Now, an important hyperparameter here is the step size - how far you move in the direction of (or opposite) the gradient at each step. This is sometimes knows as the *learning rate*, and its study is an important field within machine learning. For today, we'll assume the step size is constant, but please note that isn't always the case.

Something to note is that the product of terms in our likelihood is equal to the number of observations in our dataset. For a relatively small dataset this isn't a problem, but for a huge dataset this can make likelihood calculations rather involved, and iterative likelihood calculations extremely resource intensive.

#### Stochastic Gradient Descent

One way to get around this issue is with *stochastic gradient descent*. The idea behind stochastic gradient descent is that (in the extreme) we only see how our prediction works on one single observation, and then we adjust our parameters accordingly based upon our prediction for that observation. In other words, we move either forward or backward depending on whether our prediction was right or wrong.

This can make the optimization problem *much* less computationally intensive, although it does introduce some potential problems, and makes your optimization dependent on the order in which you evaluate the observations. A middle ground between pure gradient descent and extreme stochastic gradient descent is batched stochastic gradient descent, where we divide our data into disjoint groups, and run gradient descent over each group individually.

Let's take a look at how we could implement logistic regression with stochastic gradient descent. In today's lecture, instead of relying on our standard libraries like pandas and numpy, we're going to try something different and (as much as possible) write everything from scratch, just to see how it would be done.

First, let's take a look at a simple sample dataset with two predictive inputs and a binary categorical output.

In [None]:
# Sample Data
dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]

For a set of observations and a set of coefficients for our logistic regression model, we can write a function that makes a prediction (0 or 1) for each observation.

In [None]:
from math import exp

# Make a prediction with coefficients
def predict(row, coefficients):
    X = coefficients[0]
    for i in range(len(row)-1):
        X += coefficients[i + 1] * row[i]
    return 1.0 / (1.0 + exp(-X))

Let's try this out with some initial coefficients all set to $1.0$.

In [None]:
coef = [1.0, 1.0, 1.0]

In [None]:
for row in dataset:
    yhat = predict(row, coef)
    print("Expected=%.3f, Predicted=%.3f [%d]" % (row[-1], yhat, round(yhat)))

Pretty much a zero-shot always predict 1 model. Let's see if we can use stochastic gradient descent to do better.

First, let's write a function for performing stochastic gradient descent.

In [None]:
# Estimate logistic regression coefficients using stochastic gradient descent
def coefficients_sgd(train, l_rate, n_epoch):
    coef = [1.0 for i in range(len(train[0]))] #Start everything at 0.0
    for epoch in range(n_epoch):
        for row in train:
            yhat = predict(row, coef)
            error = row[-1] - yhat
            coef[0] = coef[0] + l_rate * error * yhat * (1.0 - yhat)
            for i in range(len(row)-1):
                coef[i + 1] = coef[i + 1] + l_rate * error * yhat * (1.0 - yhat) * row[i]
    return coef

Then, let's give it a run with a learning rate of $.3$, and 100 epochs (trips through the dataset).

In [None]:
l_rate = 0.3
n_epoch = 100
coef = coefficients_sgd(dataset, l_rate, n_epoch)
print(coef)

Hopefully, this improves the predictive power of our model.

In [None]:
for row in dataset:
    yhat = predict(row, coef)
    print("Expected=%.3f, Predicted=%.3f [%d]" % (row[-1], yhat, round(yhat)))

Nailed it! Alright, let's try something a bit more challenging. We're going to build a logistic regression model using the Pima Indians diabetes dataset, which you can read more about [here](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database).

First, let's load the data. We'le write a function to do that, along with a function for converting strings to floats.

In [None]:
from csv import reader

# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

# Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())

Next, we'll write some functions to handle some data normalization. We'll want to do this because we're using the same step size (learning rate) for each input variable.

In [None]:
# Find the min and max values for each column
def dataset_minmax(dataset):
    minmax = list()
    for i in range(len(dataset[0])):
        col_values = [row[i] for row in dataset]
        value_min = min(col_values)
        value_max = max(col_values)
        minmax.append([value_min, value_max])
    return minmax

# Rescale dataset columns to the range 0-1
def normalize_dataset(dataset, minmax):
    for row in dataset:
        for i in range(len(row)):
            row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])

Next, we'll create our own random data splitting function.

In [None]:
from random import randrange
# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / n_folds)
    for i in range(n_folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split

A simple function for calculating the accuracy of a binary prediction model.

In [None]:
# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
    correct = 0
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            correct += 1
    return correct / float(len(actual)) * 100.0

A logistic regression function that creates a model on training data, and returns predictions on test data.

In [None]:
# Logistic Regression Algorithm With Stochastic Gradient Descent
def logistic_regression(train, test, l_rate, n_epoch):
    predictions = list()
    coef = coefficients_sgd(train, l_rate, n_epoch)
    for row in test:
        yhat = predict(row, coef)
        yhat = round(yhat)
        predictions.append(yhat)
    return(predictions)

An algorithm that evaluates the success of a binary prediction algorithm using cross validation splits.

In [None]:
# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
    folds = cross_validation_split(dataset, n_folds)
    scores = list()
    for fold in folds:
        train_set = list(folds)
        train_set.remove(fold)
        train_set = sum(train_set, [])
        test_set = list()
        for row in fold:
            row_copy = list(row)
            test_set.append(row_copy)
            row_copy[-1] = None
        predicted = algorithm(train_set, test_set, *args)
        actual = [row[-1] for row in fold]
        accuracy = accuracy_metric(actual, predicted)
        scores.append(accuracy)
    return scores

Finally, we'll test the logistic regression algorithm on the diabetes dataset.

In [None]:
from random import seed

# Test the logistic regression algorithm on the diabetes dataset
seed(1)
# load and prepare data
filename = 'pima-indians-diabetes.csv'
dataset = load_csv(filename)
for i in range(len(dataset[0])):
    str_column_to_float(dataset, i)
# normalize
minmax = dataset_minmax(dataset)
normalize_dataset(dataset, minmax)
# evaluate algorithm
n_folds = 5
l_rate = 0.1
n_epoch = 100
scores = evaluate_algorithm(dataset, logistic_regression, n_folds, l_rate, n_epoch)
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))