# Stochastic Gradient Descent

---

In this tutiorial we will develop the basic SGD algorithm in the context of two common problems: a simple linear regression and logistic regression for binary classification.

---

_**NB:** If you need to install a package in `Python` you may use the command `!pip install <package-name>` in a code cell._

<small>**Credits:** This tutorial is based on [Joseph Boyd](https://jcboyd.github.io/assets/lsml2018/stochastic_gradient_descent.html) and [Francis Bach](https://www.di.ens.fr/~fbach/learning_theory_class_2024/index.html) tutorials.</small>

## Preliminaries

Before turning to the SGD algorithm (our focus today), let's take a look at the deterministic gradient decent algorithm on a very simple example.

In [None]:
import numpy as np
import pandas as pd
import random as rd

import matplotlib.pyplot as plt
import numpy.linalg as la
import scipy.special as sp

Consider the two functions $f$ and $g$ defined above

##### <span style="color:purple">**Todo:** Visualize these two functions.</span>

In [None]:
f = lambda x: 2*x**4 + -2*x**3 -4*x**2 + 6
g = lambda x: x**2

In [None]:
## TO BE COMPLETED ##
# Plot graph



In [None]:
# %load solutions/prelim/plot_graph.py

The `grad_f` and `grad_g` functions defined below can be used to calculate the gradient of $f$ and $g$ respectively, at any point $x$.

##### <span style="color:purple">**Todo:** Given a gradient function (typically `grad_f` or `grad_g`), write a gradient descent algorithm.</span>

The function will take as argument
* `gradient` : Gradient function
* `start` : Initial value for the GD
* `learn_rate` : Learning rate for the GD
* `n_iter` : Num of iterations
* `vect` indicates whether the algorithm should also render the calculated values of theta (`vect=True`) or not (`vect=False`)

In [None]:
grad_f = lambda x: 4*2*x**3 + -2*3*x**2 -4*2*x
grad_g = lambda x: 2*x

In [None]:
## TO BE COMPLETED ##

def gradient_descent(gradient, start, learn_rate, n_iter, vect=False):
    theta = ...
    [...]
    return theta

In [None]:
# %load solutions/prelim/gradient_descent.py

##### <span style="color:purple">**Todo:** Go back to the figures above and visualize the gradient slope for $f$ and $g$ respectively.</span>

You can vary the starting point, the learning rate.

In [None]:
## TO BE COMPLETED ##
# Influence of starting position, learning rate


In [None]:
# %load solutions/prelim/gradient_descent_viz.py

> Comments?

## Linear regression

We are now going to move on to a so-called supervised statistical learning context: we have access to a labeled dataset $(x,y)$, and we are looking to make the link between observations $x$ and their label $y$.

The aim of this section is to apply stochastic gradient descent to linear regression. Here, an observation corresponds to a pair $(X_i,y_i)$, where $X_i=(x_{i1},\ldots,x_{ip})$ is the row matrix containing the $p$ measurements for experiment $i\in\{1,\ldots,n\}$, and $y_i$ is the label associated with these measurements. 
For example, we might want to explain the number of sales of a product as a function of the amount of advertising carried out for it (See Section 4).

We start with a synthetic data set.

In [None]:
from sklearn.datasets import make_regression

In [None]:
XX, yy = make_regression(n_samples=100, 
                         n_features=1,
                         n_informative=1,
                         noise=20,
                         random_state=0)

fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(1, 1, 1, xlabel='x', ylabel='y')
ax.scatter(XX, yy, alpha=0.5)
ax.set_title('Our dataset')

plt.show()

This is typically a simple linear regression problem: For each sample $i\in\{1,\ldots,n\}$, we observe a couple $(x_i,y_i)\in\mathbb{R}^p\times\mathbb{R}$.

##### <span style="color:purple">**Question:** What is the dimention $p$ of the observations?</span>

> Answer

Our aim is to learn a linear function $f_\theta(x)=\theta_0+\theta_1x$ such that for all $i$, $y_i\simeq f_\theta(x_i)$. Note that this problem is equivalent to ask for $Y\simeq X\theta$, where $Y=(y_i)_i\in\mathbb{R}^n$, $X=(1,X_i)_i\in\mathcal{M}_{n,p+1}\mathbb{R}$ and $\theta=(\theta_j)_j\in\mathbb{R}^p$. The matrix X thus constructed is called the _design matrix_.

##### <span style="color:purple">**Todo:** Construct matrix `X` and vector `y` from `XX` and `yy` previsously defined.</span>

In [None]:
## TO BE COMPLETED ##
# Build X and y


In [None]:
# %load solutions/reg_lin/build_Xy.py

### Least Square Estimator

We will measure the learning error via the mean square error (MSE) loss:
$$ \mathcal{L}_{MSE}(\theta) \,=\, \frac1n\Vert X\theta-y\Vert^2 \,=\, \frac1n\sum_{i=1}^n (\theta_0+\theta_1x_i-y_i)^2 $$
Note that the gradient of $\mathcal{L}_{MSE}$ is given by 
$$\nabla\mathcal{L}_{MSE}(\theta)=\frac2n X^\top(X\theta-y)$$

##### <span style="color:purple">**Todo:** Deduce a closed form for $\theta^\ast$, the minimum of this loss function.</span>

> Answer

##### <span style="color:purple">**Question:** Does this computation seem reasonable to you when many different variables are observed, _i.e._ if $p$ is large?</span>

> Answer

The high-dimensional case for $\theta$ is common in deep learning. In this experiment, we will assume that this is the context. However, for practical reasons, we will work in a low-dimensional setting (specifically, $p = 1$!). Still, the insights you gain here will remain valid in higher dimensions.

##### <span style="color:purple">**Todo:** Calculate the optimal parameter vector $\theta^\ast$ associated with the formula you have just found. Make prediction $\hat{y}=X\theta^\ast$ on training data using the learned parameters:<span>

In [None]:
## TO BE COMPLETED ##
# Optimal theta


In [None]:
# %load solutions/reg_lin/optimal_theta.py

##### <span style="color:purple">**Todo:** Visualize the regression line associated with `y_pred_star` on the `XX`, `yy` data set<span>

In [None]:
## TO BE COMPLETED ##
# Optimal theta


In [None]:
# %load solutions/reg_lin/optimal_reglin.py

##### <span style="color:purple">**Todo:** Implement the MSE loss<span>

Compare the MSE value associated with $\theta^\ast$ to the noise used to generate the data.

In [None]:
## TO BE COMPLETED ##

def mean_square_error(X, y, theta):
    return ...

mse = 0 ####
print('MSE: %.02f RMSE: %.02f' % (mse, np.sqrt(mse)))

In [None]:
# %load solutions/reg_lin/mean_square_error.py

> Comments

### Batch Gradient Descent

We recall taht the gradient of $\mathcal{L}_{MSE}$ is given by 
$$\nabla\mathcal{L}_{MSE}(\theta)=\frac2n X^\top(X\theta-y)$$

##### <span style="color:purple">**Todo:** Implement a function to compute the gradient.<span>

Check that this gradient is null in $\theta^\ast$

In [None]:
## TO BE COMPLETED ##

def grad_MSE(X, y, theta):
    return ...

In [None]:
# %load solutions/reg_lin/grad_MSE.py

Recall that the stochastic gradient algorithm applies to a problem of the form
$$ \min_{\theta\in\Theta}\,\mathbb{E}_{u\sim\mathbb{P}}[j(\theta,u)] \,. $$
In this tutorial, we consider the MSE defined for the observed dataset $(X,y)$ by
$$ MSE(\theta) \,=\, \frac1n\Vert{X\theta-y}\Vert^2 \,=\, \frac1n\sum_{i=1}^n (X_i\theta -y_i)^2 \,. $$
So, according to Monte Carlo approximation, we have
$$ MSE(\theta) \,\simeq\, \int_{u=(u_1,u_2)} (u_1\theta-u_2)^2 \mathrm{d}u \,. $$
 <!-- \mathbb{E}[(U_1\theta-U_2)^2]$$ -->
Now, we can applied the stochastic gradient descent! In fact, sampling $u=(u_1,u_2)\in\mathbb{R}^p\times\mathbb{R}$, where $u$ follows the same distribution as our original dataset, is trivially equivalent to pick-up a point in the dataset.

##### <span style="color:purple">**Todo:** Implement the batch gradient descent algorithm for the linear regression problem.<span>

You can draw inspiration from the code created in the preliminary section. Your function must take as input 
* `X` the design matrix for the MSE
* `y` the label vector
* `theta_init`
* `learn_rate`
* `n_iter`
* `vect`, fedined as before
  
In particular, we do not give the gradient as an input value: we consider here that we are dealing exclusively with linear regression, and that we can therefore use the `grad_MSE` function without further precaution.

We recall that the Batch version of the gradient descent correspond to a stocastic gradient descent where the Monte Carlo approximation is obtained from the whole dataset.

In [None]:
def batch_gradient_descent_MSE(X, y, learn_rate=1e-1, n_iter=30, vect=False):
    [...]

In [None]:
# %load solutions/reg_lin/batch_gradient_descent_MSE.py

##### <span style="color:purple">**Todo:** Visualize gradient descent during iterations.<span>

Compare the theta obtained by GD with that obtained by exact calculation

In [None]:
## TO BE COMPLETED ##
# GD visualization & theta comparison


In [None]:
# %load solutions/reg_lin/batch_gradient_descent_viz.py

##### <span style="color:purple">**Todo:** Visualize the MSE loss function during itration for different learning rate.<span>

A log scale on the y-axis can help to visualize the loss better if the learning rates vary "a lot". 

In [None]:
## TO BE COMPLETED ##
# Influence of the learning rates


In [None]:
# %load solutions/reg_lin/batch_gradient_descent_lr.py

> Comments

### Stochastic Gradient Descent

Here, we have no particular difficulty in calculating the gradient using the whole observed dataset.
However, _to practice_, we will code a "pure" stochastic version of this gradient.

##### <span style="color:purple">**Todo:** Implement the stochastic gradient descent algorithm for the linear regression problem.<span>

In [None]:
def stochastic_gradient_descent_MSE(X, y, theta_init, learn_rate=1e-2, n_iter=30, vect=False):
    [...]

In [None]:
# %load solutions/reg_lin/stochastic_gradient_descent_MSE.py

##### <span style="color:purple">**Todo:** Compare the batch grandient descent and the stochastic gradient descent.<span>

In [None]:
## TO BE COMPLETED ##
# Compare gradient descent


In [None]:
# %load solutions/reg_lin/compare_gradient_descent.py

### Your turn!

The. `Marketing_Data` dataset contains, for 200 experiments the advertising experiment between Social Media Budget and Sales (in
Thousands $).


In [None]:
Marketing_Data = pd.read_csv('Marketing_Data.csv')
Marketing_Data.head()

In [None]:
XX = Marketing_Data.iloc[:, :3].to_numpy()
ones_column = np.ones((XX.shape[0], 1))

X_sales = np.hstack((ones_column, XX))
y_sales = Marketing_Data.iloc[:, 3].to_numpy()

##### <span style="color:purple">**Question:** Can you predict the sales from the social media budget?<span>

Answer this question numerically using a suitable algorithm

## Binary classification

For this problem, we will use the dataset available at [www.di.ens.fr/%7Efbach/orsay2017/data_orsay_2017.mat](http://www.di.ens.fr/%7Efbach/orsay2017/data_orsay_2017.mat)

Although this is a Matlab file, we can use a scipy function to read the data.

In [None]:
from scipy import io

In [None]:
data = io.loadmat('data_orsay_2017.mat')

XX_train, y_train = data['Xtrain'], data['ytrain']
XX_test, y_test = data['Xtest'], data['ytest']

print('XX_train shape: %s' % str(XX_train.shape))
print('y_train shape: %s' % str(y_train.shape))
print('XX_test shape: %s' % str(XX_test.shape))
print('y_test shape: %s' % str(y_test.shape))

In logistic regression, we encode the positive and negative classes as $y\in\{-1,1\}$. 
Thus, 
$$ p(y=1\mid X\theta) = \sigma(X\theta) \qquad\text{and}\qquad p(y=-1 \mid X\theta) = 1-\sigma(X\theta) = \sigma(-X\theta) \,, $$
where $\displaystyle\sigma(x) = \frac{1}{1 + \exp(-x)}$ is the sigmoid function, and $X$ is the design matrix (_i.e._ the observation matrix completed by a column of $1$).

To train the model, we minimize the _binary cross-entropy_ between the predicted distribution and the ground truth, which we can assume to be _one-hot_, assigning all probability to the correct class. This simplifies the loss function to
$$ \mathcal{L}_{BCE}(\theta) \,=\, \frac1n\sum_{i=1}^n \log\big(1+\exp(-y_i\,X_i\theta)\big) \,. $$

##### <span style="color:purple">**Todo:** Implement the loss function of the logistic regression, _i.e._ the binary cross entropy.<span>

In [None]:
def binary_cross_entropy(X, y, theta):
    return ...

In [None]:
# %load solutions/reg_log/binary_cross_entropy.py

##### <span style="color:purple">**Todo:** Implement the gradient of the previous loss.<span>

Note that 
$$ \nabla\mathcal{L}_{BCE}(\theta) \,=\, \sum_{i=1}^n \frac{-y_i\,X_i}{1+\exp(y_i\,X_i\theta)} \,. $$

In [None]:
def grad_BCE(X, y, theta):
    return ...

In [None]:
# %load solutions/reg_log/grad_BCE.py

In [None]:
# Design matrices

X_train = np.concatenate([np.ones((XX_train.shape[0], 1)), XX_train], axis=1)
X_test = np.concatenate([np.ones((XX_test.shape[0], 1)), XX_test], axis=1)

### Mini-Batch Stochastic Gradient Descent

In large-scale applications, computing the full gradient can be computationally expensive. Moreover, using a small sample of size $m<n$ from a large dataset at each iteration is often sufficient for making an accurate descent step. Minibatch gradient descent addresses this by using a subsample of size $m$ at each iteration. In the extreme case where $m= 1$, this method is known as stochastic gradient descent (SGD). As a result, the complexity of the gradient computation is reduced from $\mathcal{O}(np)$ to $\mathcal{O}(mp)$.

In the context of training deep neural networks, minibatch gradient descent (and its variants) has become the most widely used approach. Additionally, the inherent stochasticity of this method can help avoid getting stuck in local minima of non-convex loss functions.

Operationally, the main change is the size of the data fed into the gradient function, referred to as the ``batch''. The most straightforward strategy for selecting a batch is to cycle through the (pre-shuffled) dataset and slice the next $m$ values. 
A full cycle of the training data is known as an _epoch_.

##### <span style="color:purple">**Todo:** Implement a cycling strategy for `minibatch_gradient_descent` with a given `batch_size`.<span>

Follow the same structure as previously.

In [None]:
def cycle_minibatch_gradient_descent_BCE(X, y, theta_init, learn_rate=1e-2, n_iter=30, vect=False):
    [...]

In [None]:
# %load solutions/reg_log/cycle_minibatch_gradient_descent_BCE.py

##### <span style="color:purple">**Todo:** Evaluate the accuracy of the estimate using the test set.<span>

1. Estimate a parameter $\theta$ using `cycle_minibatch_gradient_descent_BCE`
2. Using the [`expit`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.expit.html) function in the `scipy.special` package, compute the output probabilities
3. Make a prediction by thresholding the probabilities at $0.5$: any probabilty $\geqslant0.5$ we will say is positive; any $<0.5$ negative
4. Check model accuracy using the [`accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function from `sklearn.metrics`

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
## TO BE COMPLETED ##
# Estimation accuracy 


In [None]:
theta_init = np.random.rand(X_train.shape[1],1)
batch_size = 20
learn_rate=1e-2
n_iter=30

theta_cycle_MSE = cycle_minibatch_gradient_descent_BCE(X_train, y_train, theta_init, batch_size,  learn_rate, n_iter, vect=False)
probs = sp.expit(X_test.dot(theta_cycle_MSE))
y_pred_cycle_BCE = np.where(probs >= 0.5, 1, -1)
 
print('Accuracy:', accuracy_score(y_test, y_pred_cycle_BCE))

> Comments

##### <span style="color:purple">**Todo:** visualise the descent curves for different batch size.<span>


In [None]:
## TO BE COMPLETED ##
# Batch size comparison


In [None]:
# %load solutions/reg_log/cycle_minibatch_gradient_descent_viz.py

An alternative to our cycling strategy is to randomly sample a batch at each iteration (like we did for the logistique regression).

##### <span style="color:purple">**Todo:** Implement a sampling strategy for `minibatch_gradient_descent` with a given `batch_size`.<span>

Compare both method.

In [None]:
def shuffle_minibatch_gradient_descent_BCE(X, y, theta_init, batch_size, learn_rate=1e-2, n_iter=30, vect=False):
    [...]

In [None]:
# %load solutions/reg_log/cycle_minibatch_gradient_descent_BCE.py

In [None]:
## TO BE COMPLETED ##
# Comparison of the methods


> Comments?