# 01 Sequential
**In this episode, we're going to learn how to use PyTorch's Sequential class to build neural networks.**
## PyTorch Sequential Module
The `Sequential` class allows us to build PyTorch neural networks on-the-fly **without** having to build an **explicit class**. This make it much easier to rapidly build networks and allows us to skip over the step where we implement the `forward()` method. When we use the sequential way of building a PyTorch network, we construct the `forward()` method implicitly by defining our network's architecture sequentially.

A sequential module is a **container or wrapper** class that **extends** the `nn.Module` base class and allows us to compose modules together. We can compose any `nn.Module` with in any other `nn.Module`.

This means that we can compose layers to make networks, and since networks are also `nn.Module` instances, we can also compose networks with one another. Additionally, since the Sequential class is also a `nn.Module` itself, we can even compose `Sequential` modules with one another.

At this point, we may be wondering about other required functions and operations, like pooling operations or activation functions. We'll, the answer is that all of the functions and operations in the `nn.functional` API have been wrapped up into `nn.Module` classes. This allows us to pass things like activation function to `Sequential` wrappers to fully build out our networks in a sequential way.

## Building PyTorch Sequential Networks
There are **three** ways to create a Sequential model. Let's see them in action.
### Code Setup
Firstly, we handle our imports.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import torchvision
import torchvision.transforms as transforms

import matplotlib.pyplot as plt
import math

from collections import OrderedDict

torch.set_printoptions(linewidth=150)

Then, we need to create a dataset that we can use for the purposes of passing a sample to the networks we will be building.

In [2]:
train_set = torchvision.datasets.FashionMNIST(
    root = './data',
    train = True,
    download = True,
    transform = transforms.Compose([transforms.ToTensor()])
)

Now, we'll grab a sample image from the FashionMNIST dataset instance.

In [3]:
image, label = train_set[0]
image.shape

torch.Size([1, 28, 28])

Now, we'll grab some values that will be used to construct our network

In [4]:
in_features = image.numel()
in_features

784

In [5]:
out_features = math.floor(in_features / 2)
out_features

392

In [6]:
type(image.numel())

int

In [7]:
out_classes = len(train_set.classes)
out_classes

10

### Sequential Model Initialization: Way 1
The first way to create a sequential model is to pass `nn.Module` instances **directly** to the `Sequential` class constructor.

In [8]:
network1 = nn.Sequential(
    nn.Flatten(start_dim = 1),
    nn.Linear(in_features, out_features),
    nn.Linear(out_features, out_classes)
)

In [10]:
network1

Sequential(
  (0): Flatten()
  (1): Linear(in_features=784, out_features=392, bias=True)
  (2): Linear(in_features=392, out_features=10, bias=True)
)

### Sequential Model Initialization: Way 2
The second way to create a sequential model is to create an `OrderedDict` that contains `nn.Module` instances. Then, pass the dictionary to the `Sequential` class constructor.

In [9]:
layers = OrderedDict([
    ('flat',nn.Flatten(start_dim=1)),
    ('hidden',nn.Linear(in_features, out_features)),
    ('output',nn.Linear(out_features, out_classes))
])

network2 = nn.Sequential(layers)

In [11]:
network2

Sequential(
  (flat): Flatten()
  (hidden): Linear(in_features=784, out_features=392, bias=True)
  (output): Linear(in_features=392, out_features=10, bias=True)
)

This way of initialization allows us to name the `nn.Module` instances explicitly.
### Sequential Model Initialization: Way 3
The third way of creating a sequential model is to create a sequential instance using an empty constructor. Then, we can use the `add_module()` method to add `nn.Module` instances to the network after it has already been initialize.

In [12]:
network3 = nn.Sequential()
network3.add_module('flat',nn.Flatten(start_dim = 1))
network3.add_module('hidden',nn.Linear(in_features,out_features))
network3.add_module('output',nn.Linear(out_features,out_classes))

In [13]:
network3

Sequential(
  (flat): Flatten()
  (hidden): Linear(in_features=784, out_features=392, bias=True)
  (output): Linear(in_features=392, out_features=10, bias=True)
)

## Class Definition Vs Sequential
So far in this course, we've been working with a CNN that we defined using a class definition. The network is defined like this:

In [14]:
class Network(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 12, 5)

        self.fc1 = nn.Linear(in_features=12*4*4, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=60)
        self.out = nn.Linear(in_features=60, out_features=10)

    def forward(self, t):

        t = F.relu(self.conv1(t))
        t = F.max_pool2d(t, kernel_size=2, stride=2)

        t = F.relu(self.conv2(t))
        t = F.max_pool2d(t, kernel_size=2, stride=2)

        t = t.flatten(start_dim=1)
        t = F.relu(self.fc1(t))
        t = F.relu(self.fc2(t))
        t = self.out(t)

        return t

We get an instance of the network like so:
```python
network = Network()
```
Now, let's see how this same network can be created using the `Sequential` class. It works like this:

In [15]:
sequential = nn.Sequential(
      nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
    , nn.ReLU()
    , nn.MaxPool2d(kernel_size=2, stride=2)
    , nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)
    , nn.ReLU()
    , nn.MaxPool2d(kernel_size=2, stride=2)
    , nn.Flatten(start_dim=1)  
    , nn.Linear(in_features=12*4*4, out_features=120)
    , nn.ReLU()
    , nn.Linear(in_features=120, out_features=60)
    , nn.ReLU()
    , nn.Linear(in_features=60, out_features=10)
)

We said that these networks are the **same**. But what do we mean? In this case, we mean that the networks have the **same architecture**. From a programming standpoint, the two networks are **different types** under the hood.

Note that we can get the same output predictions for these two networks if we fix the seed that is used to generate random numbers in PyTorch. This is because both network's will have randomly generated weights. To be sure the weights are the same, we use the PyTorch method below before creating each network.
`torch.manual_seed(50)`
It's important to note that the method must be called twice, once before each network initialization.

## Quiz 01
1. The `Sequential` class allows us to build PyTorch neural networks on-the-fly without having to build an explicit _______________.
  * class

2. When we build a `Sequential` model, our `forward()` method is defined explicitly.
  * False
  
3. A sequential module is a container or _______________ class that allows us to compose modules together.
  * wrapper
  
4. The Sequential class extends the _______________ class.
  * nn.Module
  
5. The `nn.Flatten()` module is a wrapper class that wraps the `torch.flatten()` function.
  * True
  
---
---

#  02 Batch Normalization In PyTorch
**In this episode, we're going to see how we can add batch normalization to a PyTorch CNN.**

## What Is Batch Normalization?
In order to understand batch normalization, we need to first understand what data normalization is in general, and we learned about this concept in the episode on dataset [normalization](https://deeplizard.com/learn/video/lu7TCu7HeYc).

When we normalize a dataset, we are normalizing the **input data** that will be passed to the network, and when we add **batch normalization** to our network, we are normalizing the data again **after** it has passed through one or more **layers**.

One question that may come to mind is the following:
<center><b>Why normalize again if the input is already normalized?</b></center>

Well, as the data begins moving though layers, the values will begin to shift as the layer transformations are preformed. Normalizing the outputs from a layer ensures that the **scale stays in a specific range** as the data flows though the network from input to output.
    
The specific normalization technique that is typically used is called **standardization**. This is where we calculate a z-score using the mean and standard deviation.
$$z=\frac{x-mean}{std}$$

### How Batch Norm Works
When using batch norm, the mean and standard deviation values are calculated with respect to the **batch** at the time normalization is applied. This is **opposed to** the **entire dataset**, like we saw with dataset normalization.

Additionally, there are two learnable parameters that allow the data the data to be scaled and shifted. We saw this in the paper: [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167.pdf)

Note that the scaling given by $\gamma$ corresponds to the multiplication operation, and the sifting given by $\beta$ corresponds to the addition operation.

The *Scale* and *sift* operations sound fancy, but they simply mean *multiply* and *add*.

These learnable parameters give the distribution of values more freedom to move around, adjusting to the right fit.

The scale and sift values can be thought of as the slope and y-intercept values of a line, both which allow the line to be adjusted to fit various locations on the 2D plane.

## Adding Batch Norm To A CNN
Alright, let's create two networks, **one with batch norm** and **one without**. Then, we'll test these setups using the testing framework we've developed so far in the course. To do this, we'll make use of the `nn.Sequential` class.

Our first network will be called `network1`:

In [40]:
import json
import time

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import pandas as pd


from torch.utils.tensorboard import SummaryWriter
from itertools import product
from collections import namedtuple, OrderedDict
from IPython import display

torch.set_printoptions(linewidth=120)  # Display options for output

In [41]:
torch.manual_seed(50)
network1 = nn.Sequential(
      nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
    , nn.ReLU()
    , nn.MaxPool2d(kernel_size=2, stride=2)
    , nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)
    , nn.ReLU()
    , nn.MaxPool2d(kernel_size=2, stride=2)
    , nn.Flatten(start_dim=1)  
    , nn.Linear(in_features=12*4*4, out_features=120)
    , nn.ReLU()
    , nn.Linear(in_features=120, out_features=60)
    , nn.ReLU()
    , nn.Linear(in_features=60, out_features=10)
)

In [42]:
# Our second network will be called network2
torch.manual_seed(50)
network2 = nn.Sequential(
      nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
    , nn.ReLU()
    , nn.MaxPool2d(kernel_size=2, stride=2)
    , nn.BatchNorm2d(6)
    , nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)
    , nn.ReLU()
    , nn.MaxPool2d(kernel_size=2, stride=2)
    , nn.Flatten(start_dim=1)  
    , nn.Linear(in_features=12*4*4, out_features=120)
    , nn.ReLU()
    , nn.BatchNorm1d(120)
    , nn.Linear(in_features=120, out_features=60)
    , nn.ReLU()
    , nn.Linear(in_features=60, out_features=10)
)

Now, we'll create a networks dictionary that we'll use to store the two networks.

In [43]:
networks = {
    'no_batch_norm':network1,
    'batch_norm':network2
}

The names or keys of this dictionary will be used inside our run loop to access each network. To configure our runs, we can use the keys of the dictionary opposed to writing out each value explicity. This is pretty cool because it allows us to easily test different networks with one another simply by adding more networks to the dictionary.

In [44]:
params = OrderedDict(
    lr = [.01],
    batch_size = [1000],
    num_workers = [1],
    device = ['cuda'],
    trainset = ['normal'],
    network = list(networks.keys())
)

In [45]:
train_set = torchvision.datasets.FashionMNIST(
    root='./data'
    ,train=True
    ,download=True
    ,transform=transforms.Compose([
        transforms.ToTensor()
    ])
)

In [46]:
loader = torch.utils.data.DataLoader(train_set, batch_size = len(train_set), num_workers = 1)
data = next(iter(loader))
mean = data[0].mean()
std = data[0].std() #data[0]是image，data[1]是label

In [47]:
train_set_normal = torchvision.datasets.FashionMNIST(
    root='./data'
    ,train = True
    ,download=True
    ,transform=transforms.Compose([
        transforms.ToTensor()
        ,transforms.Normalize(mean, std)
    ])
)

In [48]:
class RunBuilder():
    @staticmethod
    def get_runs(params):
        Run = namedtuple('Run', params.keys())

        runs = []
        for v in product(*params.values()):
            runs.append(Run(*v))

        return runs

In [49]:
class RunManager():
    def __init__(self):
        self.epoch_count = 0
        self.epoch_loss = 0
        self.epoch_num_correct = 0
        self.epoch_start_time = None

        self.run_params = None
        self.run_count = 0
        self.run_data = []
        self.run_start_time = None

        self.network = None
        self.loader = None
        self.tb = None

    def begin_run(self, run, network, loader):

        self.run_start_time = time.time()
        self.run_params = run
        self.run_count += 1

        self.network = network
        self.loader = loader
        self.tb = SummaryWriter(comment=f'-{run}')

        images,labels = next(iter(self.loader))
        grid = torchvision.utils.make_grid(images)

        self.tb.add_image('images',grid)
        #self.tb.add_graph(self.network, images.to(getattr(run,'device','cpu')))

    def end_run(self):
        self.tb.close()
        self.epoch_count = 0

    def begin_epoch(self):
        self.epoch_start_time = time.time()

        self.epoch_count += 1
        self.epoch_loss = 0
        self.epoch_num_correct = 0
    
    def end_epoch(self):

        epoch_duration = time.time() - self.epoch_start_time
        run_duration = time.time() - self.run_start_time

        loss = self.epoch_loss / len(self.loader.dataset)
        accuracy = self.epoch_num_correct / len(self.loader.dataset)

        self.tb.add_scalar('Loss',loss,self.epoch_count)
        self.tb.add_scalar('Accuracy',accuracy,self.epoch_count)

        for name,param in self.network.named_parameters():
            self.tb.add_histogram(name, param, self.epoch_count)
            self.tb.add_histogram(f'{name}.grad', param.grad, self.epoch_count)

        results = OrderedDict()
        results["run"] = self.run_count
        results["epoch"] = self.epoch_count
        results["loss"] = loss
        results["accuracy"] = accuracy
        results["epoch duration"] = epoch_duration
        results["run duration"] = run_duration
        for k,v in self.run_params._asdict().items():results[k] = v
        self.run_data.append(results)

        df = pd.DataFrame.from_dict(self.run_data,orient='columns')
        
        
        display.clear_output(wait=True)
        display.display(df)

    def get_num_correct(self,preds, labels):
        return preds.argmax(dim=1).eq(labels).sum().item()

    def track_loss(self,loss,batch):
        self.epoch_loss += loss.item() * batch[0].shape[0]

    def track_num_correct(self,preds,labels):
        self.epoch_num_correct += self.get_num_correct(preds,labels)

    def save(self, fileName):

        pd.DataFrame.from_dict(
            self.run_data,orient = 'columns'
        ).to_csv(f'{fileName}.csv')

        with open(f'{fileName}.json','w',encoding='utf-8') as f:
            json.dump(self.run_data,f, ensure_ascii=False, indent = 4)

In [50]:
m = RunManager()

In [51]:
trainsets = {
    'not_normal':train_set
    ,'normal':train_set_normal
}

In [52]:
params = OrderedDict(
    lr = [.01],
    batch_size = [1000],
    num_workers = [1],
    device = ['cuda'],
    trainset = ['normal'],
    network = list(networks.keys())
)

In [53]:
for run in RunBuilder.get_runs(params):

    device = torch.device(run.device)
    network = networks[run.network].to(device)## UPDATE
    loader = torch.utils.data.DataLoader(
          trainsets[run.trainset]
        , batch_size=run.batch_size
        , num_workers=run.num_workers
    )
    optimizer = optim.Adam(network.parameters(), lr=run.lr)

    m.begin_run(run, network, loader)
    for epoch in range(20):
        m.begin_epoch()
        for batch in loader:

            images = batch[0].to(device)
            labels = batch[1].to(device)
            preds = network(images) # Pass Batch
            loss = F.cross_entropy(preds, labels) # Calculate Loss
            optimizer.zero_grad() # Zero Gradients
            loss.backward() # Calculate Gradients
            optimizer.step() # Update Weights

            m.track_loss(loss, batch)
            m.track_num_correct(preds, labels)
        m.end_epoch()
    m.end_run()
m.save('results3')

Unnamed: 0,run,epoch,loss,accuracy,epoch duration,run duration,lr,batch_size,num_workers,device,trainset,network
0,1,1,0.902308,0.6657,11.045463,12.710009,0.01,1000,1,cuda,normal,no_batch_norm
1,1,2,0.48159,0.816767,10.109992,22.955631,0.01,1000,1,cuda,normal,no_batch_norm
2,1,3,0.394888,0.854983,10.082085,33.174318,0.01,1000,1,cuda,normal,no_batch_norm
3,1,4,0.355903,0.86995,10.0192,43.323171,0.01,1000,1,cuda,normal,no_batch_norm
4,1,5,0.336055,0.875917,10.001249,53.456068,0.01,1000,1,cuda,normal,no_batch_norm
5,1,6,0.317772,0.883133,10.050117,63.638831,0.01,1000,1,cuda,normal,no_batch_norm
6,1,7,0.300082,0.88945,10.098988,73.863483,0.01,1000,1,cuda,normal,no_batch_norm
7,1,8,0.284005,0.8946,9.989282,83.986406,0.01,1000,1,cuda,normal,no_batch_norm
8,1,9,0.276586,0.897367,10.084028,94.206071,0.01,1000,1,cuda,normal,no_batch_norm
9,1,10,0.275722,0.8972,10.04912,104.388834,0.01,1000,1,cuda,normal,no_batch_norm


In [54]:
pd.DataFrame.from_dict(m.run_data).sort_values('accuracy', ascending=False)

Unnamed: 0,run,epoch,loss,accuracy,epoch duration,run duration,lr,batch_size,num_workers,device,trainset,network
39,2,20,0.168771,0.935367,10.035158,207.777825,0.01,1000,1,cuda,normal,batch_norm
38,2,19,0.174516,0.932917,10.030169,197.570128,0.01,1000,1,cuda,normal,batch_norm
36,2,17,0.187812,0.92985,10.041142,177.159722,0.01,1000,1,cuda,normal,batch_norm
37,2,18,0.184065,0.929483,10.037162,187.373412,0.01,1000,1,cuda,normal,batch_norm
35,2,16,0.191689,0.9277,10.083067,166.950032,0.01,1000,1,cuda,normal,batch_norm
34,2,15,0.196619,0.925683,10.192805,156.711448,0.01,1000,1,cuda,normal,batch_norm
33,2,14,0.197369,0.925033,10.125924,146.349096,0.01,1000,1,cuda,normal,batch_norm
32,2,13,0.203388,0.92305,10.139892,136.051644,0.01,1000,1,cuda,normal,batch_norm
31,2,12,0.205429,0.9225,10.020198,125.746195,0.01,1000,1,cuda,normal,batch_norm
30,2,11,0.209708,0.922033,10.109958,115.554456,0.01,1000,1,cuda,normal,batch_norm
