<a href="https://colab.research.google.com/github/spatank/CIS-522/blob/main/W4_Tutorial1_SPP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CIS-522 Week 4 Part 1
# Optimization

__Instructor__: Lyle Ungar

__Content creator:__ Rongguang Wang

__Content reviewer:__ Pooja Consul


---
# Intro: Optimization and why it matters

Video be watched **before** the pod meets.


In [None]:
#@title Video : Introduction video
from IPython.display import YouTubeVideo
video = YouTubeVideo(id="cqQ7dVSYn7c", width=854, height=480, fs=1)
print("Video available at https://youtube.com/watch?v=" + video.id)

import time
try: t0;
except NameError: t0=time.time()

video

## Objectives for today

We show how gradient descent can be tweaked using minibatch, adaptive learning rate, and other techiniques to really speed up the optimization process, and hint at the theory behind it.

0.   Make sure to optimize the right thing!
1.   The optimization landscape; geometric intuition behind Stochastic Gradient Descent (SGD) and momentum
2.   Select batch size for minibatch gradient descent 
3.   Know batch normalization strengths and weaknesses

In [None]:
#@markdown What is your Pennkey and pod? (text, not numbers, e.g. bfranklin)
my_pennkey = '' #@param {type:"string"}
my_pod = 'Select' #@param ['Select', 'euclidean-wombat', 'sublime-newt', 'buoyant-unicorn', 'lackadaisical-manatee','indelible-stingray','superfluous-lyrebird','discreet-reindeer','quizzical-goldfish','astute-jellyfish','ubiquitous-cheetah','nonchalant-crocodile','fashionable-lemur','spiffy-eagle','electric-emu','quotidian-lion']


## Recap the experience from last week

What did you learn last week. What questions do you have? [15 min discussion]

In [None]:
learning_from_previous_week = '' #@param {type:"string"}

*Estimated time: 20 minutes since start*

---
# Setup
Note that some of the code for today can take up to an hour to run. We have therefore "hidden" that code and shown the resulting outputs.

[Here](https://docs.google.com/presentation/d/1NSE9VQPKhWQMlRniuxrUiKeVbjhEf7_rxjAwbQ-qzeE/edit?usp=sharing) are the slides for today's videos (in case you want to take notes). **Do not read them now.**

In [None]:
# imports
from __future__ import print_function
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR
from torch.utils.data import Dataset
import time
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
import requests
import io
from urllib.request import urlopen


In [None]:
# @title Figure Settings
import ipywidgets as widgets
%matplotlib inline 
fig_w, fig_h = (8, 6)
plt.rcParams.update({'figure.figsize': (fig_w, fig_h)})
%config InlineBackend.figure_format = 'retina'
SMALL_SIZE = 12


plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/"
              "course-content/master/nma.mplstyle")

# plt.rcParams.update(plt.rcParamsDefault)
# plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
# plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
# plt.rc('axes', labelsize=SMALL_SIZE)    # fontsize of the x and y labels
# plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
# plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
# plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
# plt.rc('figure', titlesize=SMALL_SIZE)  # fontsize of the figure title

---
# Section 0: Before we get to work...
# A cautionary tale
Generally speaking, when optimizing make sure that you :
- optimize for the right thing and 
- be aware of the potential unintended consequences that can arise from your optimization scheme



Consider the following example:

You are optimizing a food delivery service that will deliver 12 meals over a 3 hour evening period. Each person you deliver to will give you a rating of 1-5 stars, mostly based on whether the food is on time or late. Assume customers start at 3 stars and deduct one star for each quarter hour that the delivery is late. You have all 10 meals at hand (they are precooked, delivered cold for people to heat up) and it takes 15 minutes to deliver each meal.
On days that things go right, you’re perfectly on time with all meals. 

If you get delayed by 15 minutes at the start of the night, what is an optimal strategy for maximizing the number of stars you get?

What would might be a better loss function?

[~3 minutes discussion]

In [None]:
strategy = '' #@param {type:"string"}
loss_fn = '' #@param {type:"string"}

## Question of the week

**How can we improve gradient descent?** [10 min discussion]

Our aim is to find the lowest point  of the loss function. However, the actual direction to this minium is not known. We can only look locally, repeatedly taking small steps in the direction of steepest descent-- the negative of the gradient of the loss function w.r.t. the weights..

We want to solve for
$$\min_wf(w)$$
where the loss function $f$ is a continous and differetiable function.

Gradient descent:
$$w_{t+1}=w_t-\eta\nabla f(w_t)$$
where 

*   $w_{t+1}$ is the updated weight of the $t$-th iteration,
*   $w_t$ is the initial weight before the $t$-th iteration,
*   $\eta$ is the step size,
*   $\nabla f(w_t)$ is the gradient of the loss function $f$ with respect to all the weights $w_j$, $df/dw_j$, evaluated with the current weights $w_t$.

In standard gradient descent, the loss function $f$ is the loss (L2 or maxent) between the neural net and the correct answer averaged over  all the observations. In Stochastic gradient descent, the loss is just for a simple observation. In mini-batch gradient descent, it is averaged over all of the point in the Batch, $B$.



*Estimated time: 35 minutes since start*

---
# Section 1: Why is optimization hard ? 

[~2min discussion]

In [None]:
#@title Video: Example of an optimization landscape

try: t1;
except NameError: t1=time.time()

video = YouTubeVideo(id="g0zOEcPix2w", width=854, height=480, fs=1)
print("Video available at https://youtube.com/watch?v=" + video.id)
video

Hint: think about what the optimization landscape can look like for more complex functions then take a look at the video and the interactive plot below to better understand why it is hard to find the global minima and not get stuck in a local minima.

[Interactive visualization](https://losslandscape.com/explorer)


So what do you think on the difficulty of optimization now ? 

[~2min discussion]

In [None]:
#@title Video : The difficulty of training a deep neural network
video = YouTubeVideo(id="68VFZeWWe-s", width=854, height=480, fs=1)
print("Video available at https://youtube.com/watch?v=" + video.id)
video

As you've just seen, the function we are looking to find a minimum of, can have a very complex landscape. 
  

In [None]:
#@markdown **Student response**: Can you come up with some basic characteristics that we need for a good gradient descent algorithm ? 
characteristics_for_gd = '' #@param {type:"string"}

*Estimated time: 45 minutes since start*

---
# Section 2: Minibatch stochastic gradient descent (SGD)

In [None]:
#@title Video: Minibatch

try: t2;
except NameError: t2=time.time()

video = YouTubeVideo(id="l4n7BZjNbTI", width=854, height=480, fs=1)
print("Video available at https://youtube.com/watch?v=" + video.id)
video

In stochastic gradient descent, we replace the actual gradient vector with a stochastic estimation of the gradient vector. Specifically for a neural network, the stochastic estimation uses the gradient of the loss for a single data point (single instance).

Given $f_i=l(x_i, y_i, w)$, the expected value of the $t$-th step of SGD is the same as the $t$-th step of full gradient descent.

$$\mathbb{E}[w_{t+1}]=w_t-\eta \mathbb{E}[\nabla f_i(w_t)]=w_t-\eta\nabla f(w_t)$$

where $i$ is chosen uniformly at random, thereby $f_i$ is a noisy but unbiased estimator of $f$.

$$w_{t+1}=w_t-\eta\nabla f_i(w_t)$$

We update the weights according to the gradient over $f_i$ (as opposed to the gradient over the total loss $f$).

SGD advantages:
*   The noise in the SGD update can prevent convergence to a bad (shallow) local minima.
*   It is drastically cheaper to compute (as you don’t go over all data points).


### Minibatching

Often we are able to make better use of our hardware by using mini batches instead of single instances. We compute the loss over a mini-batch -- a set of randomly selected instances instead of calculating it over just one instance. This reduces the noise in the step update.

Given the $t$th minibatch $B_t$ consisting of $k$ observations: 

$$w_{t+1}=w_t-\eta \frac{1}{|B_t|}\sum_{i\in B}\nabla f_i(w_t)$$


In [None]:
#@markdown How would the plot of training error vs epochs differ for minibatch gradient descent when compared to stochastic gradient descent?
sgd_vs_minibtach_plot = '' #@param {type:"string"}

#@markdown What are the advantages of minibatch gradient descent over stochastic gradient descent? (Select all that apply)
noisy_learning_process = False #@param {type:"boolean"}
computationally_efficient = False #@param {type:"boolean"}
increased_model_update_frequency  = False #@param {type:"boolean"}
memory_efficient = False #@param {type:"boolean"}


## Exercise 1: Finding the optimal minibatch size 

### Exercise 1.1

One of the main constraints of training deep neural networks is the relative limited size of GPU memory. Being able to quickly estimate if your minibatch size can be held in that memory will save you time and out-of-memory errors.

What do we need to store at training time? 
- outputs of intermediate layers (forward pass): 
- model parameters
- error signal at each neuron
- the gradient of parameters
plus any extra memory needed by optimizer (e.g. for momentum)




Fully connected layers
- #weights = #outputs x #inputs
- #biases = #outputs


This is dominated by the weights and their gradients. (You can confirm that there are far fewer node outputs than weights.) Assume we need to store the weights, their gradients and momentum, at 4 bytes/weight.  


In [None]:
#@markdown How many megabytes is this for the model specified in exercise 1.2?

megabytes_1 = '' #@param {type:"string"}

#@markdown If we also store a gradient for every observation in a minibatch of size 50 (to allow parallel processing), how many megabytes will now be needed?

megabytes_2 = '' #@param {type:"string"}


### Exercise 1.2

We find the optimal minibatch size using a 2-hidden layer nerual network on the hand-written digit classification dataset (MNIST). There are 10 classes (0 - 9) in the dataset. We use stochastic gradient descent (SGD) algorithm to optimize the training phase.

Plot test accuracy as a function of minibatch size (with constant wall time); explain the pattern.

In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc2 = nn.Linear(128, 10)
        self.fc3 = nn.Linear(784, 128)

    def forward(self, x):
        x = torch.flatten(x, 1)
        x = self.fc3(x)
        x = F.relu(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args['log_interval'] == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.4f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
    return 100. * correct / len(test_loader.dataset)
    
def main(args):
    use_cuda = not args['no_cuda'] and torch.cuda.is_available()
    torch.manual_seed(args['seed'])
    device = torch.device('cuda' if use_cuda else 'cpu')

    train_kwargs = {'batch_size': args['batch_size']}
    test_kwargs = {'batch_size': args['test_batch_size']}
    if use_cuda:
        cuda_kwargs = {'num_workers': 1,
                       'pin_memory': True,
                       'shuffle': True}
        train_kwargs.update(cuda_kwargs)
        test_kwargs.update(cuda_kwargs)

    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])
    train_loader = torch.utils.data.DataLoader(datasets.MNIST('../data', train=True, download=True,
                       transform=transform),**train_kwargs)
    test_loader = torch.utils.data.DataLoader(datasets.MNIST('../data', train=False,
                       transform=transform), **test_kwargs)

    model = Net().to(device)
    optimizer = optim.SGD(model.parameters(), lr=args['lr'], momentum=args['momentum'])

    acc_list, time_list = [], []
    start_time = time.time()
    for epoch in range(1, args['epochs'] + 1):
        train(args, model, device, train_loader, optimizer, epoch)
        time_list.append(time.time()-start_time)
        acc = test(model, device, test_loader)
        acc_list.append(acc)

    return acc_list, time_list

The training takes over 20 mins. Please skip running below cells for now and come back if time is allowed.

In [None]:
# @markdown ### Train (run me)
# Training settings
args = {'batch_size': 32,
        'test_batch_size': 1000,
        'epochs': 3,
        'lr': 0.01,
        'momentum': 0.9,
        'no_cuda': False,
        'seed': 1,
        'log_interval': 100
        }

batch_size = [8, 16, 32, 64, 256, 512, 1024]
acc_dict = {}
test_acc = []

for i in range(len(batch_size)):
    args['batch_size'] = batch_size[i]
    acc, timer = main(args)
    acc_dict['acc'+str(batch_size[i])] = acc
    acc_dict['time'+str(batch_size[i])] = timer
    test_acc.append(acc[-1])

In [None]:
# @markdown ### Plot (run me)
with plt.xkcd():
    plt.plot(batch_size, test_acc, linewidth=2)
    plt.title('Optimal Minibach Size')
    plt.ylabel('Test Accuracy (%)')
    plt.xscale('log')
    plt.xlabel('Batch Size')
    plt.savefig('minibatch.png')
    plt.show()

Plot the optimal batch size curve by running below cell.

In [None]:
url = "https://raw.githubusercontent.com/CIS-522/course-content/main/tutorials/W4_Optimization/static/W4_Tutorial1_Exercise1_minibatch.png"
img = plt.imread(requests.get(url, stream=True).raw)
plt.imshow(img)
plt.axis('off')

In [None]:
#@markdown How did the convergence speed vary with batch size? Why?
convergence_speed = '' #@param {type:"string"}

*Estimated time: 70 minutes since start*

---
# Section 3: Batch normalization

In [None]:
#@title Video: Batch Normalization

try: t3;
except NameError: t3=time.time()

video = YouTubeVideo(id="FAnd9Ra7v-E", width=854, height=480, fs=1)
print("Video available at https://youtube.com/watch?v=" + video.id)
video

Rather than improving the optimization algorithms, batch normalization improves the network structure itself by adding additional layers in between existing layers. The goal is to improve the optimization and generalization performance.

In neural networks, we typically alternate linear (weighted summation) operations with non-linear operations, the activation functions, such as ReLU. The most common practice is to put the normalization is between the linear layers and activation functions.

More formally, normalization is as follows:
$$\tilde x_j = a\frac{x_j-\mu_j}{\sigma_j}+b$$
where
*   $x_j$ is the output of a neuron or, equivalently, the input to the next layer,
*   $\tilde x_j$ is that same feature after being normalized ,
*   $\mu_j$ is the mean of the feature $x_j$ over the minibatch,
*   $\sigma_j$ is the estimate of the standard deviation of $x_j$ over the minibatch (with $\epsilon$ added, so we don't divide by zero),
*   $a$ is the learnable scaling factor,
*   $b$ is the learnable bias term.

Batch normalization tries to reduce the “internal covariate shift” between the training and testing data. Internal covariate shift is the change in the distribution of network activations due to the change in paramaters during training. In neural networks, the output of the first layer feeds into the second layer, the output of the second layer feeds into the third, and so on. When the parameters of a layer change, so does the distribution of inputs to subsequent layers. These shifts in input distributions can be problematic for neural networks, especially deep neural networks that could have a large number of layers. Batch normalization tries to mitigate this. You can check out [this](https://arxiv.org/abs/1502.03167) paper where the idea of mitigating internal covariance shift with batch normalization was first introduced. 


The advantages of BN are as follows:

*   Networks with normalization layers are easier to optimize, allowing for the use of larger learning rates, speeding up the training of neural networks.
*   The mean/std deviation estimates are noisy due to the randomness of the samples in batch. This extra “noise” sometimes results in better generalization. Normalization has a regularization effect.
*   Normalization reduces sensitivity to weight initialization.


In [None]:
#@markdown Why do we need learnable parameters a and b for batch normalization? Why isn't the unit gaussian form sufficient?
batch_norm_ab = '' #@param {type:"string"}


## Exercise 2: The joys and perils of batch normalization

We implement 4 netowrks: 2-layer without batch norm, 2-layer without batch norm, 5-layer without batch norm, and 5-layer with batch norm separately to see how BN works in a shallow network and a deep one.

In [None]:
class BNShallowNet(nn.Module):
    def __init__(self):
        super(BNShallowNet, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)
        self.bn = nn.BatchNorm1d(128)
        self.dropout = nn.Dropout(0.5)
        
    def forward(self, x):
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = self.bn(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

class BNDeepNet(nn.Module):
    def __init__(self):
        super(BNDeepNet, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, 32)
        self.fc5 = nn.Linear(32, 10)
        self.bn1 = nn.BatchNorm1d(128)
        self.bn2 = nn.BatchNorm1d(64)
        self.bn3 = nn.BatchNorm1d(32)
        self.bn4 = nn.BatchNorm1d(32)
        self.dropout = nn.Dropout(0.5)
        
    def forward(self, x):
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = self.bn1(x)
        x = F.relu(x)
        x = self.fc2(x)
        x = self.bn2(x)
        x = F.relu(x)
        x = self.fc3(x)
        x = self.bn3(x)
        x = F.relu(x)
        x = self.fc4(x)
        x = self.bn4(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.fc5(x)
        output = F.log_softmax(x, dim=1)
        return output

class DeepNet(nn.Module):
    def __init__(self):
        super(DeepNet, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, 32)
        self.fc5 = nn.Linear(32, 10)
        self.dropout = nn.Dropout(0.5)
        
    def forward(self, x):
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        x = F.relu(x)
        x = self.fc4(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.fc5(x)
        output = F.log_softmax(x, dim=1)
        return output

In [None]:
# @markdown ### helper functions (Run Me)
def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    avg_loss = 0.
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        avg_loss += loss.item()
        loss.backward()
        optimizer.step()
        if batch_idx % args['log_interval'] == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
    avg_loss /= len(train_loader.dataset)
    return avg_loss
            
def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.4f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
    return test_loss

def bn_eval(args):
    use_cuda = not args['no_cuda'] and torch.cuda.is_available()
    torch.manual_seed(args['seed'])
    device = torch.device('cuda' if use_cuda else 'cpu')

    train_kwargs = {'batch_size': args['batch_size']}
    test_kwargs = {'batch_size': args['test_batch_size']}
    if use_cuda:
        cuda_kwargs = {'num_workers': 1,
                       'pin_memory': True,
                       'shuffle': True}
        train_kwargs.update(cuda_kwargs)
        test_kwargs.update(cuda_kwargs)

    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])
    train_loader = torch.utils.data.DataLoader(datasets.MNIST('../data', train=True, download=True,
                       transform=transform),**train_kwargs)
    test_loader = torch.utils.data.DataLoader(datasets.MNIST('../data', train=False,
                       transform=transform), **test_kwargs)

    if args['net_type'] == 'Shallow':
        model = Net().to(device)
    elif args['net_type'] == 'BNShallow':
        model = BNShallowNet().to(device)
    elif args['net_type'] == 'Deep':
        model = DeepNet().to(device)
    elif args['net_type'] == 'BNDeep':
        model = BNDeepNet().to(device)
    optimizer = optim.SGD(model.parameters(), lr=args['lr'], momentum=args['momentum'])

    train_list, test_list = [], []
    for epoch in range(1, args['epochs'] + 1):
        train_loss = train(args, model, device, train_loader, optimizer, epoch)
        test_loss = test(model, device, test_loader)
        train_list.append(train_loss)
        test_list.append(test_loss)

    return train_list, test_list

The training takes over 20 mins. Please skip running below cells for now and come back if time is allowed.

In [None]:
# @markdown ### Train function (Run me)
# Training settings
args = {'batch_size': 64,
        'test_batch_size': 1000,
        'epochs': 10,
        'lr': 0.01,
        'momentum': 0.9,
        'net_type': 'Net',
        'no_cuda': False,
        'seed': 1,
        'log_interval': 100
        }

net = ['Shallow', 'BNShallow', 'Deep', 'BNDeep']
loss_dict = {}

for i in range(len(net)):
    args['net_type'] = net[i]
    train_loss, test_loss = bn_eval(args)
    loss_dict['train' + str(net[i])] = train_loss
    loss_dict['test' + str(net[i])] = test_loss

In [None]:
# @markdown ### Plot (run me)
with plt.xkcd():
    fig, axs = plt.subplots(1, 2, figsize=(10,4))
    axs[0].plot(loss_dict['trainShallow'], label='Shallow w/o BN', color='b')
    axs[1].plot(loss_dict['testShallow'], label='Shallow w/o BN', color='b', linestyle='dashed')
    axs[0].plot(loss_dict['trainBNShallow'], label='Shallow BN', color='r')
    axs[1].plot(loss_dict['testBNShallow'], label='Shallow BN', color='r', linestyle='dashed')
    axs[0].plot(loss_dict['trainDeep'], label='Deep w/o BN', color='g')
    axs[1].plot(loss_dict['testDeep'], label='Deep w/o BN', color='g', linestyle='dashed')
    axs[0].plot(loss_dict['trainBNDeep'], label='Deep BN', color='orange')
    axs[1].plot(loss_dict['testBNDeep'], label='Deep BN', color='orange', linestyle='dashed')
    axs[0].set_title('Train')
    axs[1].set_title('Test')
    axs[0].set_ylabel('Loss')
    #plt.yscale('log')
    axs[0].set_xlabel('Epoch')
    axs[1].set_xlabel('Epoch')
    axs[0].legend()
    axs[1].legend()
    plt.show()

Plot the train and test convergence curves of the 4 networks by running below cell.

In [None]:
url = "https://raw.githubusercontent.com/CIS-522/course-content/main/tutorials/W4_Optimization/static/W4_Tutorial1_Exercise2_batchnorm.png"
img = plt.imread(requests.get(url, stream=True).raw)
plt.imshow(img)
plt.axis('off')

In [None]:
#@markdown You looked at a shallow and a deep network. When did BN help or hurt?  Why do you think that happens?

bn_deep_shallow = '' #@param {type:"string"}

## Momentum 

Momentum in gradient descent is similar to the concept of momentum in physics. The optimization process resembles a ball rolling down the hill. Momentum keeps the ball moving in the same direction that it is already moving in. The gradient can be thought of as a force pushing the ball in some other direction.

<p align="center">
  <img width="460" height="300" src="https://miro.medium.com/max/640/1*i1Qc2E0TVlPHEKG7LepXgA.gif">
</p>

Mathematically it can be expressed as follows-
$$w_{t+1}=w_t-\eta (\nabla f(w_t) +\beta m_{t}) $$
$$m_{t+1}= \nabla f(w_t) +\beta m_{t}$$
or, equivalently
$$w_{t+1}= w_t -\eta\nabla f(w_t) +\beta (w_{t} -w_{t-1})$$

where
*   $m$ is the momentum (the running average of the past gradients, initialized at zero),
*   $\beta\in [0,1)$ is the damping factor, usually $0.9$ or $0.99$.



Let’s consider two extreme cases to understand this decay rate parameter better. If the decay rate is 0, then it is exactly the same as (vanilla) gradient descent (blue ball). If the decay rate is 1 (and provided that the learning rate is reasonably small), then it rocks back and forth endlessly like the frictionless ball we saw previously; you do not want that. Typically the decay rate is chosen around 0.8–0.9 — it’s like a surface with a little bit of friction so it eventually slows down and stops (purple ball).

<p align="center">
  <img width="460" height="300" src="https://miro.medium.com/max/800/1*zVi4ayX9u0MQQwa90CnxVg.gif">
</p>



Go check this out --> [Interactive view](https://distill.pub/2017/momentum/)

In [None]:
#@markdown What are the advantages of momentum compared to vanilla gradient descent? (Select all that apply)
reduced_oscillations = False #@param {type:"boolean"}
faster_convergence = False #@param {type:"boolean"}
possibility_of_evading_local_minima  = False #@param {type:"boolean"}
adaptive = False #@param {type:"boolean"}



*Estimated time: 90 minutes since start*

---
# Wrap up

## Submit responses

In [None]:
#@markdown #Run Cell to Show Airtable Form
#@markdown ##**Confirm your answers and then click "Submit"**

import time
import numpy as np
from IPython.display import IFrame

def prefill_form(src, fields: dict):
  '''
  src: the original src url to embed the form
  fields: a dictionary of field:value pairs,
  e.g. {"pennkey": my_pennkey, "location": my_location}
  '''
  prefills = "&".join(["prefill_%s=%s"%(key, fields[key]) for key in fields])
  src = src + prefills
  src = "+".join(src.split(" "))
  return src


#autofill time if it is not present
try: t0;
except NameError: t0 = time.time()
try: t1;
except NameError: t1 = time.time()
try: t2;
except NameError: t2 = time.time()
try: t3;
except NameError: t3 = time.time()
try: t4;
except NameError: t4 = time.time()

times = [(t-t0) for t in [t1,t2,t3,t4]]


#autofill fields if they are not present
#a missing pennkey and pod will result in an Airtable warning
#which is easily fixed user-side.
try: my_pennkey;
except NameError: my_pennkey = ""
try: my_pod;
except NameError: my_pod = ""
try: learning_from_previous_week;
except NameError: learning_from_previous_week = "" 
try: strategy;
except NameError: strategy = "";
try: loss_fn;
except NameError: loss_fn = ""
try: characteristics_for_gd;
except NameError: characteristics_for_gd = ""
try: sgd_vs_minibtach_plot;
except NameError: sgd_vs_minibtach_plot = ""
try: noisy_learning_process;
except NameError: noisy_learning_process = False
try: computationally_efficient;
except NameError: computationally_efficient = False
try: increased_model_update_frequency;
except NameError: increased_model_update_frequency = False
try: memory_efficient;
except NameError: memory_efficient = False
try: convergence_speed;
except NameError: convergence_speed = False
try: batch_norm_ab;
except NameError: batch_norm_ab = ""
try: bn_deep_shallow;
except NameError: bn_deep_shallow = False
try: reduced_oscillations;
except NameError: reduced_oscillations = False
try: faster_convergence;
except NameError: faster_convergence = False
try: possibility_of_evading_local_minima;
except NameError: possibility_of_evading_local_minima = False
try: adaptive;
except NameError: adaptive = False
try: megabytes_1;
except NameError: megabytes_1 = ""
try: megabytes_2;
except NameError: megabytes_2 = ""


fields = {
    "my_pennkey": my_pennkey,
    "my_pod": my_pod, 
    "learning_from_previous_week": learning_from_previous_week,
    "strategy": strategy, 
    "loss_fn": loss_fn, 
    "characteristics_for_gd": characteristics_for_gd,
    "sgd_vs_minibtach_plot": sgd_vs_minibtach_plot,
    "noisy_learning_process": noisy_learning_process,
    "computationally_efficient": computationally_efficient,
    "increased_model_update_frequency": increased_model_update_frequency,
    "memory_efficient": memory_efficient,
    "convergence_speed": convergence_speed,
    "batch_norm_ab": batch_norm_ab,
    "bn_deep_shallow": bn_deep_shallow,
    "reduced_oscillations": reduced_oscillations,
    "faster_convergence": faster_convergence,
    "possibility_of_evading_local_minima": possibility_of_evading_local_minima,
    "adaptive": adaptive,
    "cumulative_times": times,
    "megabytes_1": megabytes_1,
    "megabytes_2": megabytes_2
}

src = "https://airtable.com/embed/shrxiEcmPSW5yr7np?"

#now instead of the original source url, we do: src = prefill_form(src, fields)
display(IFrame(src = prefill_form(src, fields), width = 800, height = 400))


# Feedback

*   How could this session have been better?
*   How happy are you in your group?
*   How do you feel right now?

Feel free to use the embeded form below or use this link:
<a target="_blank" rel="noopener noreferrer" href="https://airtable.com/shrNSJ5ECXhNhsYss">https://airtable.com/shrNSJ5ECXhNhsYss</a>

In [None]:
# report to Airtable
display(IFrame(src="https://airtable.com/embed/shrNSJ5ECXhNhsYss?backgroundColor=red", width = 800, height = 400))