### The Gradient Descent Optimization Algorithm

<br>

On this page I summarize some of the key ideas behind the gradient descent optiization algorithm. The text and code has been lifted and summarized from the blog post
[An Introduction to Gradient Descent in Python](https://tillbe.github.io/python-gradient-descent.html). I recommend to visit that page for a more in-depth treatment of the subject matter (and for some great plotting examples). 

<br>

<blockquote>*A Machine Learning model is trained by starting with an initial guess for the weights and bias and iteratively **adjusting those guesses** until learning the weights and bias with the lowest possible loss.*<sup>[1](#quote1)</sup> 

</blockquote>

<a name="quote1">1</a>: [source](https://developers.google.com/machine-learning/crash-course/reducing-loss/an-iterative-approach.html)

<br>

* Adjusting guesses is an iterative process:
  * iterate until overall loss stops decreasing or decreases extremely slowly.
  * at each iteration, adjustments are done by subtracting the true values from the guessed values. The model parameters are then multiplied by the result of the observed difference (referred to as *delta* or *error*).

* Initialization of the model parameters
 * The first stage in gradient descent is to pick starting values for the model parameters. These are stored in a vector $\Theta$ of size equla to the number of training examples *m*.
 * Many algorithms initialize $\Theta$ with zeros or random values.
 * The gradient descent algorithm then calculates the gradient of the loss curve at the starting values and update $\Theta$ by multipying its values by the gradients. 
   
   * Note: if theta,is init to  say,  zero values, then in the first iteration the predictions, as well as the values of the corresponding error vector, will all be zeros. Thus, in the first iteration, the vector of observed differences $\delta$ is equal to -$y$.
   
 * $\Theta$ is used to decide which feature should have most weight in helping predict the outcome.
 * $\Theta$ does not pertain to any specific training example but rather is derived from the training examples size of theta is equal to number of features.

<br>

**The Gradient Descent process**

Below is a barebones description of the *gradient descent* process. It does not include elements such as regularization etc. 

```  
- initialize theta with some values e.g. all zeros
- repeat:
  - multiply the [m x n] matrix X by the [n x 1] theta vector to obtain a [m x 1] vector of predictions 
  - subtract the vector of truth values y from the vector of predictions to get a [m x 1] "delta" vector of errors i.e. one error for each training example.
  - compute gradient:
       
       gradient = X.T @ delta / m  # [n x 1] = [n x m] x [m x 1]
       gradient  = learning rate * gradient
  - theta = theta - gradient
  - compute cost using theta of current iteration e.g. sum the squared error of delta
     
```            

 <br>
 
**Batch vs. Stochastic Gradient Descent** 

In gradient descent, a batch is the total number of examples used to calculate the gradient in a single iteration. For example the batch can be the entire set of training data. This is a common scenario when implementing gradient descent for say educational *Kaggle* competitions. However in many real life use cases, data sets contain billions or more examples. Using all of the data in one batch may cause even a single iteration to take a very long time to compute.
  
By choosing examples at random from the data set, it is possible to estimate (albeit, noisily) a big average from a much smaller one. *Stochastic gradient descent (SGD)* takes this idea to the extreme--it uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term "stochastic" indicates that the one example comprising each batch is chosen at random.
 
      
* Some considerations:

  * The gradient vector in the batch approach is a vector that is specific to model parameter vector $\Theta$. If we want a gradient for a specific training example (i.e. * **stochastic** gradient descent*) we could do:
       
        gradient_i = X[i,:].T @ delta[i]  # [n x 1] = [n x 1] x [1 x 1]
       
     note how we don't divide by *m* as we only deal with one training example
  
  * In the batch approach, gradients are summed as part of the matrix multiplication: 
  
        gradient = X.T @ delta / m  # [n x 1] = [n x m] x [m x 1]
  
     However, when we do * **stochastic** gradient descent* we need to **accumulate** the gradients e.g.:
   
        Phi[i,:] += gradient_ij
        Theta[:, j] += gradient_ij
       
       
    

In [1]:
# source: https://tillbe.github.io/python-gradient-descent.html

# loading necessary libraries and setting up plotting libraries
import numpy as np

from sklearn.datasets.samples_generator import make_regression 
from scipy import stats 


# The make_regression() function will create a dataset with a linear relationship between inputs and the outputs.
X, y = make_regression(n_samples = 100, 
                       n_features=3, 
                       n_informative=3, 
                       noise=10,
                       random_state=2015)

print('X shape', X.shape)
print('y shape', y.shape)
# print(np.random.rand(X.shape[1], 1))

def gradient_descent(X, y, iters, alpha):
    costs = []
    m = y.size # number of data points
    
    theta = np.random.rand(X.shape[1]) # [n x 1] 
    
    
    
    history = [theta] # to store all thetas
    preds = []
    for i in range(iters):
        pred = np.dot(X, theta) # [m x 1] = [m x n] x [n x 1] i.e. a prediction for each example
        error = pred - y 
        
        cost = np.sum(error ** 2) / (2 * m)
        costs.append(cost)
        
        if i % 25 == 0: preds.append(pred)

        gradient = X.T.dot(error)/m  # [n x 1] = [n x m] x [m x 1]
        theta = theta - alpha * gradient  # update
        
    return theta, costs, preds
 
# add intercet term
X = np.c_[np.ones(X.shape[0]), X] 

alpha = 0.001 # set step-size
iters = 5 # set number of iterations
theta, cost, preds = gradient_descent(X, y, iters, alpha)

print('final theta', theta)


X shape (100, 3)
y shape (100,)
final theta [0.96082907 0.64587089 0.78438197 0.19516988]
final theta shape (4,)
