# Topic 5 -- Advanced Optimizations

Welcome back! in the previous topic, you learned how Neural Networks are formed by **stacking** Logistic Regression layers on top of one another, creating a construct that can learn **much more complicated hypothesis functions**. While this allows you to tackle more challenging problems such as computer vision, having so many $w$ and $b$ parameters means Neural Networks **take much longer to train** and are **prone to overfitting**.

In this notebook, we are going to cover various optimization techniques that can **dramatically speed up learning**, as well as advanced methods to **address overfitting**. Later on, we are going to revisit our **handwritten digits classifier** and see if we can **improve its performance** with our new knowledge.

## Table of Contents

1. [Big Data](#bigdata)
    - [Stochastic Gradient Descent](#sgd)
    - [Mini-Batch Gradient Descent](#minibatch)
    
    
2. [Coding Exercise: Data Generator](#datagen)
    - [Getting Familiar with `yield`](#yield)
    - [Data Generator](#gen)
    
    
3. [Dropout](#dropout)
    - [A New Form of Regularization](#newreg)
    - [Implementation in PyTorch](#dppy)

### Before we Begin...

Let's first import our modules as always. The modules used in this notebook the same as the ones used previously.

- **Numpy**: Powerful linear algebra library
- **Pandas**: Used for organizing our data
- **SKLearn**: Abstract machine learning library
- **MatPlotLib, Bokeh,** and **SeaBorn**: Data visualization libraries
- **utils.py**: A custom python script that contains functions used in this course.

In [1]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
from bokeh.plotting import figure, show, output_notebook
from utils import *

## Big Data <a name="bigdata"></a>

Up until now, when training a learning algorithm, whether it is simple Linear/Logistic Regression or it's a Neural Network, we trained it by passing in the **entire dataset**. The act of training a ML algorithm by iterating over the entire training dataset at once is called **Batch Gradient Descent**.

For now, batch gradient descent has worked for our purposes, however, there are two main issues when we mention batch gradient descent in the context of Neural Networks. **Firstly**, batch gradient descent is **slow** -- you need to make one pass over the **entire training set** before making **one gradient descent step**. The **second**, and more important issue is that often times in Deep Learning, we need to use datasets that are **massive** -- sometimes over 100,000 hi-res images can be used to train a Deep Neural Net. A computer simply **does not have enough memory** to load in that big of a dataset.

### Stochastic Gradient Descent <a name="sgd"></a>

**Stochastic Gradient Descent** or **SGD** is probably something you have seen before. In our PyTorch portions of the course, we've used the `SGD()` optimizer to perform gradient descent on our cost function. <u>To make things clear, we used `SGD` to perform **Batch Gradient Descent** on all of our projects so far</u>, however, the formal concept of SGD is actually quite the opposite of batch gradient descent. Instead of feeding forward the entire training set and then updating the weights and biases, SGD will **feed in one training example** at a time, updating the parameters **after every training example**.

##### Question for the Students:

Other than much more frequent parameter updates, what are some other benefits of of SGD? What are some drawbacks?

### Mini-Batch Gradient Descent <a name="minibatch"></a>

On one end of the extreme we have batch gradient descent which iterates on the entire training set, and on the other extreme we have Stochastic Gradient Descent that iterates on the individual training examples. **Mini-batch Gradient Descent** is the happy medium in between, where the training set is divided into multiple "mini-batches", and the learning algorithm iterates on those mini-batches.

Often times mini-batch sizes range from 8 to 512. Generally, most people train with mini-batch sizes of 16, 32, 64, or 128, as these strike a good balance between update frequency and memory useage.

##### Question for Students
If SGD updates more frequently than mini-batch gradient descent, then why is mini-batch gradient descent the preferred method of training a Neural Network?
    
    
---

## Coding Exercise: Data Generator <a name="datagen"></a>

A **data generator** is a program that helps "funnel" data into the Neural Network. Previously, we did not need to use a data generator because all we had to do was send the entire training data into the model. This data generator will **break up** the training set into mini-batches which are then fed into the model. In this section you will learn to use the `yield` statement in python, which creates a custom **iterable**.


### Getting Familiar with `yield` <a name="yield"></a>

The `yield` statement creates an **iterable** that you can step through. You can almost treat it as the `return` statement, except it allows you to `return` a value from a function without actually terminating the function!

In [2]:
def arange():
    n = 1
    yield n
    
    n += 1 
    yield n
        
    n += 1 
    yield n
        
    n += 1 
    yield n
        
    n += 1 
    yield n
        
    n += 1 
    yield n
        
    n += 1 
    yield n
        
    n += 1 
    yield n
    

In the function above, we wrote 8 `yield` statements. By assigning this function to a variable, and **iterable** is created that allows you to step through using the keyword `next`. Try stepping through this iterable 8 times, or even more to see what happens.

In [3]:
numbers = arange()

print(next(numbers))
print(next(numbers))
print(next(numbers))
print(next(numbers))
print(next(numbers))
print(next(numbers))
print(next(numbers))
print(next(numbers))
print(next(numbers))


1
2
3
4
5
6
7
8


StopIteration: 

As you can see, once we call  our simple generator the 9<sup>th</sup> time, it raises a `StopIteration` exception. Now that we are familiar with python generators using the `yield` statement, we can continue to design our data generator. Lets take a look at our current training loop and see if we can make any modifications. Note that this is **pseudo-code.**

```python
def fit(X_train, Y_train)

    for i in range(epochs):
        zero_grads()
    
        Y_pred = forward_prop(X_train)
        cost = loss_fn(Y_train, Y_pred)
    
        grads = back_prop(cost)
        update_weights(grads)

```

Here, notice that for every iteration, the **entire** training set is passed into the network via the `forward_prop()` function. We want to use **generators** to feed **mini-batches** into the model. 

Previously, we've always refered to **epochs** as **iterations**. The definition of one epoch is ***a full pass through the training set***, and in batch gradient descent, we pass the entire training set, or in other words, we pass the entire epoch every iteration.

In mini-batch gradient descent, we iterate on the mini-batches, therefore we can iterate **multiple times** per epoch. Lets say that we have 100 training examples in our training data. If we choose a mini-batch size of 20, that means we can have 5 steps per epoch. With that said, here's the new `fit()` function:

```python
def train_generator(dataset):
    ... some code here...
    yield (X_train_batch, Y_train_batch)

def fit_minibatch(train_generator):
    for i in range(epochs):
        for j in range(steps_per_epoch):
            X_train_batch, Y_train_batch = next(train_generator)
            
            Y_pred_batch = feed_forward(X_train_batch)
            cost = loss_fn(Y_train_batch, Y_pred_batch)
            
            grads = back_prop(cost)
            update_weights(grads)

```

### Data Generator <a name="gen"></a>

Now it's time to put what we've gathered and create a data generator! First, let's create some simple **training data** we can use to test our generator.

In [5]:
trainset = np.arange(0, 100).reshape(-1, 1)
trainset

array([[ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11],
       [12],
       [13],
       [14],
       [15],
       [16],
       [17],
       [18],
       [19],
       [20],
       [21],
       [22],
       [23],
       [24],
       [25],
       [26],
       [27],
       [28],
       [29],
       [30],
       [31],
       [32],
       [33],
       [34],
       [35],
       [36],
       [37],
       [38],
       [39],
       [40],
       [41],
       [42],
       [43],
       [44],
       [45],
       [46],
       [47],
       [48],
       [49],
       [50],
       [51],
       [52],
       [53],
       [54],
       [55],
       [56],
       [57],
       [58],
       [59],
       [60],
       [61],
       [62],
       [63],
       [64],
       [65],
       [66],
       [67],
       [68],
       [69],
       [70],
       [71],
       [72],
       [73],
       [74],
       [75],
       [76],

Here, we've created an array with shape of `(100, 1)`. This represents a training set with 100 examples and one feature per example. Next, let's build the generator:

In [16]:
def generator(dataset, batch_size=16):
    lower = 0
    upper = batch_size
    while(True):
        
        
        if upper <= dataset.shape[0]:  # Normal operation
            yield(dataset[lower:upper])
        else:  # Wrap around
            batch = dataset[lower:]
            
            # Calculate how many elements are already in the batch
            already_added = dataset.shape[0]-lower
            # How many still needs to be added
            left_in_batch = batch_size - already_added            
            # joins the last few training examples with the first few
            batch = np.concatenate((batch, dataset[:left_in_batch]), axis=0)
            
            upper = left_in_batch
            yield(batch)
            
        lower = upper
        upper += batch_size

In [24]:
gen = generator(trainset)

print(next(gen), '\n')
print(next(gen), '\n')
print(next(gen), '\n')
print(next(gen), '\n')
print(next(gen), '\n')
print(next(gen), '\n')
print(next(gen), '\n')
print(next(gen), '\n')
print(next(gen), '\n')


[[ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [15]] 

[[16]
 [17]
 [18]
 [19]
 [20]
 [21]
 [22]
 [23]
 [24]
 [25]
 [26]
 [27]
 [28]
 [29]
 [30]
 [31]] 

[[32]
 [33]
 [34]
 [35]
 [36]
 [37]
 [38]
 [39]
 [40]
 [41]
 [42]
 [43]
 [44]
 [45]
 [46]
 [47]] 

[[48]
 [49]
 [50]
 [51]
 [52]
 [53]
 [54]
 [55]
 [56]
 [57]
 [58]
 [59]
 [60]
 [61]
 [62]
 [63]] 

[[64]
 [65]
 [66]
 [67]
 [68]
 [69]
 [70]
 [71]
 [72]
 [73]
 [74]
 [75]
 [76]
 [77]
 [78]
 [79]] 

[[80]
 [81]
 [82]
 [83]
 [84]
 [85]
 [86]
 [87]
 [88]
 [89]
 [90]
 [91]
 [92]
 [93]
 [94]
 [95]] 

[[96]
 [97]
 [98]
 [99]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]] 

[[12]
 [13]
 [14]
 [15]
 [16]
 [17]
 [18]
 [19]
 [20]
 [21]
 [22]
 [23]
 [24]
 [25]
 [26]
 [27]] 

[[28]
 [29]
 [30]
 [31]
 [32]
 [33]
 [34]
 [35]
 [36]
 [37]
 [38]
 [39]
 [40]
 [41]
 [42]
 [43]] 



Awesome, looks like our generator works very well! We will write something similar for our programming project.

---

## Dropout <a name="dropout"></a>

In the previous topics, you learned about **overfitting** and **underfitting**, and how to address them. This is more important than ever when talking about Neural Networks. Since Neural Networks are able to **fit more complex relations**, it is **more prone to overfitting.** This is why over the years, computing scientists came up with more advanced methods of regulating the NN's weights and biases. 

### A New Form of Regularization <a name="newreg"></a>

So far, we've seen **L2 regularization** in action, smoothing out high degree polynomials so that they **generalize better**. While this works great for logistic regression, many times L2 regularization doesn't work as well for Neural Networks. That's not to say you *shouldn't* use L2, in fact more often you'll encounter NN's that use L2 than ones that do not. 

**Dropout** is a powerful method of regularization that often times work better than L2. It works by randomly turning off various neurons, which in turn results in a simpler network that is less prone to overfitting. In the figure below, notice how after every iteration, **a different set of neurons are turned off**.

<img src="images/dropout.png" alt="cannot display image">

In addition to simplifying the network architecture, leading to less overfitting, dropout also **changes** the network architecture, so in short, your Neural Network could be fitting many different architectures to the problem.

### Implementation in PyTorch <a name="dppy"></a>

In PyTorch, L2 reglularization was implemented in the optimizer, meaning that the entire Neural Network had the same L2 regularization parameter. Dropout gives you more flexibility, as **each layer** will have its own dropout module. Dropout is implemented **after** the layer's activation function:

```python
def forward(self, x):
    x = self.z1(x)
    x = self.a1(x)
    x = Dropout(0.3)(x)
    
    x = self.z2(x)
    x = self.a2(x)
    x = Dropout(0.4)(x)
    
    x = self.z3(x)
    x = self.a3(x)
    x = Dropout(0.6)(x)
  
    x = self.z4(x)
    x = self.a4(x)
    
    return x
```

Here, we can pass a parameter into each dropout layer which represents the probability that a certain neuron is dropped. Notice that no dropout is applied to the output layer, as you do not want to randomly get rid of class predictions!