# 01 Training Loop Run Builder - Neural Network Experimentation Code

**In this episode, we’ll code a `RunBuilder` class that will allow us to generate multiple runs with varying parameters.**

## Using The `RunBuilder` Class

The purpose of this episode and the last couple of episodes of this series is to get ourselves into a position to be able to **efficiently experiment** with the training process that we’ve constructed. For this reason, we’re going to expand on something we touched on in the episode on hyperparameter experimentation. We’re going to make what we saw there a bit **cleaner**.

We’re going to build a class called `RunBuilder`. but, before we look at how to build the class. Let’s see what it will allow us to do. We’ll start with our imports.

In [1]:
from collections import OrderedDict
from collections import namedtuple
from itertools import product

We’re importing `OrderedDict` and `namedtuple` from collections and we’re importing a function called `product` from `itertools`. This `product()` function is the one we saw last time that computes a **Cartesian product** given multiple list inputs.

Alright. This is the `RunBuilder` class that will build sets of parameters that define our runs. We’ll see how it works after we see how to use it.[*的用法](https://blog.csdn.net/yzj99848873/article/details/48025593/)

In [2]:
class RunBuilder():
    @staticmethod
    def get_runs(params):

        Run = namedtuple('Run', params.keys())

        runs = []
        for v in product(*params.values()):
            runs.append(Run(*v))

        return runs

The main thing to note about using this class is that it has a `static` method called `get_runs()`. This method will get the runs for us that it builds based on the parameters we pass in.

Let’s define some parameters now.

In [3]:
params = OrderedDict(
    lr = [.01, .001]
    ,batch_size = [1000, 10000]
)

Here, we’ve defined a set of parameters and values inside an dictionary. We have a set of learning rates and a set of batch sizes we want to try out. When we say try out, we mean that we want to do a **training** run for **each learning rate** and **each batch size** in the dictionary.

To get these runs, we just call the `get_runs()` function of the `RunBuilder` class, passing in the parameters we’d like to use.

In [4]:
runs = RunBuilder.get_runs(params)

In [5]:
runs

[Run(lr=0.01, batch_size=1000),
 Run(lr=0.01, batch_size=10000),
 Run(lr=0.001, batch_size=1000),
 Run(lr=0.001, batch_size=10000)]

Great, we can see that the `RunBuilder` class has built and returned a **list** of four runs. Each of these runs has a learning rate and a batch size that defines the run.

We can access an individual run by indexing into the list like so:

In [6]:
run = runs[0]
run

Run(lr=0.01, batch_size=1000)

Notice the string representation of the run output. This string representation was automatically generated for us by the **`Run` tuple class**, and this string can be used to **uniquely** identify the run if we want to write out run statistics to disk for TensorBoard or any other visualization program.

Additionally, because the run is object is a `tuple` with named attributes, we can access the values using dot notation like so:

In [7]:
print(run.lr, run.batch_size)

0.01 1000


Finally, since the list of runs is a Python iterable, we can iterate over the runs cleanly like so:

In [8]:
for run in runs:
    print(run,run.lr,run.batch_size)

Run(lr=0.01, batch_size=1000) 0.01 1000
Run(lr=0.01, batch_size=10000) 0.01 10000
Run(lr=0.001, batch_size=1000) 0.001 1000
Run(lr=0.001, batch_size=10000) 0.001 10000


All we have to do to **add** additional values is to add them to the original **parameter list**, and if we want to add an additional type of parameter, all we have to do is add it. The new parameter and its values will automatically become available to be consumed inside the run. The string output for the run also updates as well.

Two parameters:

In [9]:
params = OrderedDict(
    lr = [.01, .001]
    ,batch_size = [1000, 10000]
)

runs = RunBuilder.get_runs(params)
runs

[Run(lr=0.01, batch_size=1000),
 Run(lr=0.01, batch_size=10000),
 Run(lr=0.001, batch_size=1000),
 Run(lr=0.001, batch_size=10000)]

In [10]:
# Three parameters:
params = OrderedDict(
    lr = [.01, .001]
    ,batch_size = [1000, 10000]
    ,device = ["cuda", "cpu"]
)

runs = RunBuilder.get_runs(params)
runs

[Run(lr=0.01, batch_size=1000, device='cuda'),
 Run(lr=0.01, batch_size=1000, device='cpu'),
 Run(lr=0.01, batch_size=10000, device='cuda'),
 Run(lr=0.01, batch_size=10000, device='cpu'),
 Run(lr=0.001, batch_size=1000, device='cuda'),
 Run(lr=0.001, batch_size=1000, device='cpu'),
 Run(lr=0.001, batch_size=10000, device='cuda'),
 Run(lr=0.001, batch_size=10000, device='cpu')]

This functionality will allow us to have greater control as we experiment with different values during training.

Let’s sees how to build this `RunBuilder` class.

## Coding The `RunBuilder` Class
The first thing we need to have is a **dictionary of parameters and values** we’d like to try.

In [11]:
params = OrderedDict(
    lr = [.01, .001]
    ,batch_size = [1000, 10000]
)

In [12]:
# Next, we get a list of keys from the dictionary.
params.keys()

odict_keys(['lr', 'batch_size'])

In [14]:
# Then, we get a list of values from the dictionary.
params.values()

odict_values([[0.01, 0.001], [1000, 10000]])

Once we have both of these, we just make sure we understand both of them by inspecting their output. Once we do, we use these keys and values for what comes next. We’ll start with the keys.

In [16]:
Run = namedtuple('Run',params.keys())

This line creates a new `tuple` **subclass** called `Run` that has named fields. This `Run` class is used to encapsulate the data for each of our runs. The field names of this class are set by the list of names passed to the constructor. First, we are passing the **class name**. Then, we are passing the **field names**, and in our case, we are passing the list of **keys from our dictionary**.

Now that we have a class for our runs, we are ready to create some

In [18]:
runs = []
for v in product(*params.values()):
    runs.append(Run(*v))

First we create a list called `runs`. Then, we use the `product()` function from `itertools` to create the **Cartesian product** using the values for each parameter inside our dictionary. This gives us a set of **ordered pairs** that define our runs. We iterate over these adding a run to the `runs` list for each one.

For each value in the Cartesian product we have an ordered tuples. The Cartesian product gives us every ordered pair so we have all possible order pairs of learning rates and batch sizes. When we pass the `tuple` to the `Run` constructor, we use the `* `operator to tell the constructor to accept the tuple values as arguments opposed to the `tuple` itself.

Finally, we wrap this code in our `RunBuilder` class.

In [21]:
class RunBuilder():
    @staticmethod
    def get_runs(params):

        Run = namedtuple('Run', params.keys())

        runs = []
        for v in product(*params.values()):
            runs.append(Run(*v))

        return runs

Since the `get_runs()` method is static, we can call it using the class itself. We don’t need an instance of the class.

Now, this allow us to update our training code in the following way:

Before:
```python
for lr, batch_size, shuffle in product(*param_values):
    comment = f' batch_size={batch_size} lr={lr} shuffle={shuffle}'

    # Training process given the set of parameters
```
After:
```python
for run in RunBuilder.get_runs(params):
    comment = f'-{run}'

    # Training process given the set of parameters
```

## What Is A Cartesian Product?

Do you know about the Cartesian product? Like many things in life, the Cartesian product is a mathematical concept. The Cartesian product is a binary operation. The operation takes two sets as arguments and returns a third set as an output. Let's look at a general mathematical example.

Suppose that $X$ is a set.

Suppose that $Y$ is a set.

The Cartesian product between two sets is denoted as,$X×Y$. The Cartesian product between the sets $X$ and the set $Y$ is defined to be the set of all ordered pairs $(x,y)$ such that little $x∈X$ and $y∈Y$. This can be expressed in the following way:$$X×Y={(x,y)∣x∈X and y∈ Y}$$

This way of expressing the output of the Cartesian product is called set builder notation. It is cool. So $X$×$Y$ is the set of all ordered pairs $(x,y)$ such that little $x∈X$ and $y∈Y$ .

To compute $X×Y$ we do the following:

For every $x∈X$ and for every $y∈Y$, we collect the corresponding pair $(x,y)$. The resulting collection gives us the set of all ordered pairs little $(x,y)$ such that $x∈X$ and $y∈Y$.

Here is a concrete example expressed in Python:
```python
X = {1,2,3}
Y = {1,2,3}

{ (x,y) for x in X for y in Y }
```
Output
```python
{(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3)}
```
Notice how powerful the mathematical code is. It covers all cases. Maybe you noticed that this can be achieved using for-loop iteration like so:
```python
X = {1,2,3}
Y = {1,2,3}
cartesian_product = set()
for x in X:
    for y in Y:
        cartesian_product.add((x,y))
cartesian_product
```
Output:
```python
{(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3)}
```

## Quiz 01
1. This `Run` class is used to _______________ the data for each of our runs.
```python
Run = namedtuple('Run', params.keys())
```
* encapsulate<br>
<br>
2. The Cartesian product contains _______________.
```python
cp = product(*params.values())
```
* ordered tuples
<br><br>
3. Suppose we pass the parameters dictionary below to the `get_runs()` function. How many runs will be generated?
```python
params = OrderedDict(
    lr = [.01, .001]
    ,batch_size = [1000, 10000]
)
runs = RunBuilder.get_runs(params)
```
* 4
<br><br>
4. Suppose we pass the parameters dictionary below to the `get_runs()` function. How many runs will be generated?
```python
params = OrderedDict(
    lr = [.01, .001]
    ,batch_size = [1000, 10000]
    ,device = ["cuda", "cpu"]
)
runs = RunBuilder.get_runs(params)
```
* 8

---
---

# 02 CNN Training Loop Refactoring - Simultaneous Hyperparameter Testing

## Refactoring The CNN Training Loop

**In this episode, we will see how we can experiment with large numbers of hyperparameter values easily while still keeping our training loop and our results organized.**

## Cleaning Up The Training Loop And Extracting Classes

We built out quite a lot of functionality that allowed us to experiment with many different parameters and values, and we also made the calls need inside our training loop that would get our results into TensorBoard.

All of this work has helped, but our training loop is pretty **crowded** now. In this episode, we're going to **clean up** our training loop and set the stage for more experimentation up by using the `RunBuilder` class that we built last time and by building a new class called `RunManager`.

Our **goal** is to be able to *add parameters and values at the top*, and have all the values tested or tried during multiple training runs.

For example, in this case, we are saying that we want to use **two parameters**, *lr* and *batch_size*, and for the batch_size we want to try two different values. This gives us a total of two training runs. Both runs will have the **same learning rat** while the **batch size varies**.

```python
params = OrderedDict(
    lr = [.01]
    ,batch_size = [1000, 2000]
)
```
For the results, we'd like to see and be able to compare the both runs.

<table class="table table-sm table-hover">
                                                    <thead>
                                                        <tr>
                                                            <th>run</th>
                                                            <th>epoch</th>
                                                            <th>loss</th>
                                                            <th>accuracy</th>
                                                            <th>epoch duration</th>
                                                            <th>run duration</th>
                                                            <th>lr</th>
                                                            <th>batch_size</th>
                                                        </tr>
                                                    </thead>
                                                    <tbody>
                                                        <tr>
                                                            <td>1</td>
                                                            <td>1</td>
                                                            <td>0.983</td>
                                                            <td>0.618</td>
                                                            <td>48.697</td>
                                                            <td>50.563</td>
                                                            <td>0.01</td>
                                                            <td>1000</td>
                                                        </tr>
                                                        <tr>
                                                            <td>1</td>
                                                            <td>2</td>
                                                            <td>0.572</td>
                                                            <td>0.777</td>
                                                            <td>19.165</td>
                                                            <td>69.794</td>
                                                            <td>0.01</td>
                                                            <td>1000</td>
                                                        </tr>
                                                        <tr>
                                                            <td>1</td>
                                                            <td>3</td>
                                                            <td>0.468</td>
                                                            <td>0.827</td>
                                                            <td>19.366</td>
                                                            <td>89.252</td>
                                                            <td>0.01</td>
                                                            <td>1000</td>
                                                        </tr>
                                                        <tr>
                                                            <td>1</td>
                                                            <td>4</td>
                                                            <td>0.428</td>
                                                            <td>0.843</td>
                                                            <td>18.840</td>
                                                            <td>108.176</td>
                                                            <td>0.01</td>
                                                            <td>1000</td>
                                                        </tr>
                                                        <tr>
                                                            <td>1</td>
                                                            <td>5</td>
                                                            <td>0.389</td>
                                                            <td>0.857</td>
                                                            <td>19.082</td>
                                                            <td>127.320</td>
                                                            <td>0.01</td>
                                                            <td>1000</td>
                                                        </tr>
                                                        <tr>
                                                            <td>2</td>
                                                            <td>1</td>
                                                            <td>1.271</td>
                                                            <td>0.528</td>
                                                            <td>18.558</td>
                                                            <td>19.627</td>
                                                            <td>0.01</td>
                                                            <td>2000</td>
                                                        </tr>
                                                        <tr>
                                                            <td>2</td>
                                                            <td>2</td>
                                                            <td>0.623</td>
                                                            <td>0.757</td>
                                                            <td>19.822</td>
                                                            <td>39.520</td>
                                                            <td>0.01</td>
                                                            <td>2000</td>
                                                        </tr>
                                                        <tr>
                                                            <td>2</td>
                                                            <td>3</td>
                                                            <td>0.526</td>
                                                            <td>0.791</td>
                                                            <td>21.101</td>
                                                            <td>60.694</td>
                                                            <td>0.01</td>
                                                            <td>2000</td>
                                                        </tr>
                                                        <tr>
                                                            <td>2</td>
                                                            <td>4</td>
                                                            <td>0.478</td>
                                                            <td>0.814</td>
                                                            <td>20.332</td>
                                                            <td>81.110</td>
                                                            <td>0.01</td>
                                                            <td>2000</td>
                                                        </tr>
                                                        <tr>
                                                            <td>2</td>
                                                            <td>5</td>
                                                            <td>0.440</td>
                                                            <td>0.835</td>
                                                            <td>20.413</td>
                                                            <td>101.600</td>
                                                            <td>0.01</td>
                                                            <td>2000</td>
                                                        </tr>
                                                    </tbody>
                                                </table>

### The Two Classes We Will Build

To do this, we need to build two new classes. We built the first class called `RunBuilder` in the last episode. It's being called at the top.
```python
for run in RunBuilder.get_runs(params):
```

In [None]:
class RunBuilder():
    @staticmethod
    def get_runs(params):

        Run = namedtuple('Run', params.keys())

        runs = []
        for v in product(*params.values()):
            runs.append(Run(*v))

        return runs

Now, we need to build this `RunManager` class that will allow us to manage each run inside our run loop. The `RunManager` instance will allow us to **pull out** a lot of the tedious TensorBoard calls and allow us to add additional functionality as well.

We'll see that as our number of parameters and runs get **larger**, TensorBoard will start to breakdown as a viable solution for reviewing our results.

The `RunManager` will be invoked at **different stages** inside each of our runs. We'll have calls at the start and end of both the **run** and the **epoch** phases. We'll also have calls to **track the loss** and the **number of correct predictions** inside each epoch. Finally, at the end, we'll **save the run results** to disk.

Let's see how to build this RunManager class.

## Building The `RunManger` For Training Loop Runs
Let's kick things off with our imports:

In [23]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms

from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from IPython.display import display, clear_output
import pandas as pd
import time
import json

from itertools import product
from collections import namedtuple
from collections import OrderedDict

For now, we'll take **no arguments in the constructor**, and we'll just define some attributes that will enable us to keep **track** of data across runs and across epochs.

We'll track the following:
* The number of epochs.
* The running loss for an epoch.
* The number of correct predictions for an epoch.
* The start time of the epoch.

Remember we saw that the `RunManager` class has two methods with epoch in the name. We have `begin_epoch()` and `end_epoch()`. These two methods will allow us to manage these values across the **epoch lifecycle**.

Now, next we have some attributes for the runs. We have an attribute called `run_params`. This is the run definition in terms for the run parameters. It's value will be one of the **runs** returned by the `RunBuilder` class.

Next, we have attributes to track the `run_count`, and the `run_data`. The `run_count` gives us the **run number** and the `run_data` is a list we'll use to keep track of the **parameter values** and the **results of each epoch** for each run, and so we'll see that we add a value to this list for each epoch. Then, we have the `run_start_time` which will be used to calculate the **run duration**.

Alright, next we will save the network and the data loader that are being used for the run, as well as a `SummaryWriter` that we can use to save data for TensorBoard.

In [24]:
# First, we declare the class using the class keyword.
class RunManager():
    # Next ,we'll define the class constructor
    def __init__(self):
        
        self.epoch_count = 0
        self.epoch_loss = 0
        self.epoch_num_correct = 0
        self.epoch_start_time = None
        
        self.run_params = None
        self.run_count = 0
        self.run_data = []
        self.run_start_time = None
        
        self.network = None
        self.loader = None
        self.tb = None

### What Are Code Smells?
Do you smell that? There's something that doesn't smell right about this code. Have you heard of code smells before? Have you smelled them? A code smell is a term used to describe a condition where something about the code in front of our eyes doesn't seem right. It's like a **gut feeling** for software developers.

A code smell **doesn't mean** that something is definitely **wrong**. A code smell **does not mean** the code is **incorrect**. It just means that there is likely a better way. In this case, the code smell is the fact that we have several variable names that have a **prefix**. The use of the prefix here indicates that the variables somehow belong together.

Anytime we see this, we need to be thinking about **removing** these prefixes. Data that belongs together should be together. This is done by **encapsulating the data inside of a `class`**. After all, if the data belongs together, object oriented languages give us the ability to express this fact using classes.

### Refactoring By Extracting A Class
It's fine to leave this code in now, but later we might want to refactor this code by doing what is referred to as extracting a class. This is a refactoring **technique** where we remove these prefixes and create a class called `Epoch`, that has these attributes, `count`, `loss`, `num_correct`, and `start_time`.
```python
class Epoch():
    def __init__(self):
        self.count = 0
        self.loss = 0
        self.num_correct = 0
        self.start_time = None 
```
Then, we'll replace these class variable with an **instance** of the `Epoch` class. We might even change the count variable to have a more intuitive name, like say `number` or `id`. The reason we can leave this now is because refactoring is an iterative process, and this is our first iteration.

### Extracting Classes Creates Layers Of Abstraction
Actually, what we are doing now by building this class is **extracting a class **from our **main training loop** program. The code smell that we were addressing is the fact that our loop was becoming **cluttered** and beginning to appear overly **complex**.

When we write a main program and then refactor it, we can think of this creating layers of abstraction that make the main program more and more **readable** and easier to understand. Each part of the program should be very easy to understand in its own right.

When we **extract** code into its own **class** or **method**, we are creating additional **layers of abstraction**, and if we want to understand the implementation details of any of the layers, we dive in so to speak.

In an iterative way, we can think of ***starting* with one single program**, and then, ***later* extracting the cod**e that creates deeper and deeper layers. The process can be view as a *branching tree-like structure*.

### Beginning A Training Loop Run
Anyway, let's look at the first method of this class which extracts the code needed to begin a run.

In [25]:
def begin_run(self, run, network, loader):

    self.run_start_time = time.time()

    self.run_params = run
    self.run_count += 1

    self.network = network
    self.loader = loader
    self.tb = SummaryWriter(comment=f'-{run}')

    images, labels = next(iter(self.loader))
    grid = torchvision.utils.make_grid(images)

    self.tb.add_image('images', grid)
    self.tb.add_graph(self.network, images)

First, we capture the **start time** for the run. Then, we save the passed in run parameters and **increment** the run count by one. After this, we **save** our **network** and our **data loader**, and then, we initialize a `SummaryWriter` for TensorBoard. Notice how we are passing our **run** as the **comment** argument. This will allow us to **uniquely** identify our run inside TensorBoard.

Alright, next we just have some TensorBoard calls that we made in our training loop before. These calls add our network and a batch of images to TensorBoard.

When we end a run, all we have to do is **close** the TensorBoard handle and set the **epoch count** back to **zero** to be ready for the next run.

In [26]:
def end_run(self):
    self.tb.close()
    self.epoch_count = 0

For starting an epoch, we first save the start time. Then, we increment the `epoch_count` by one and set the `epoch_loss` and `epoch_number_correct` to zero.

In [27]:
def begin_epoch(self):
    self.epoch_start_time = time.time()
    
    self.epoch_count += 1
    self.epoch_loss = 0
    self.epoch_num_correct = 0

Now, let's look at where the bulk of the action occurs which is **ending** an epoch.

In [28]:
def end_epoch(self):
    epoch_duration = time.time() - self.epoch_start_time
    run_duration = time.time() - self.run_start_time
    
    loss = self.epoch_loss / len(self.loader.dataset)
    accuracy = self.epoch_num_corect / len(self.loader.dataset)
    
    self.tb.add_scalar('Loss', loss, self.epoch_count)
    self.tb.add_scalar('Accuracy',accuracy, self.epoch_count)
    
    for name,param in self.network.named_parameters():
        self.tb.add_histogram(name, param, self.epoch_count)
        self.tb.add_histogram(f'{name}.grad',param.grad, self.epoch_count)

We start by calculating the **epoch duration** and the **run duration**. Since we are at the **end** of an epoch, the epoch duration is final, but the **run duration** here represents the running time of the **current run**. The value will keep running until the run ends. However, we'll still **save** it with each epoch.

Next, we compute the `epoch_loss` and `accuracy`, and we do it relative to the size of the training set. This gives us the **average loss** per sample. Then, we pass both of these values to TensorBoard.

Next, we pass our network's **weights** and **gradient** values to TensorBoard like we did before.

### Tracking Our Training Loop Performance
We're ready know for whats new in this processing. This is the part that we are adding to give us **additional insight** when we preform **large numbers** of **runs**. We're going to **save** all of the data ourselves so we can analyze it outsize of TensorBoard.
```python
def end_epoch(self):
    ...

    results = OrderedDict()
    results["run"] = self.run_count
    results["epoch"] = self.epoch_count
    results['loss'] = loss
    results["accuracy"] = accuracy
    results['epoch duration'] = epoch_duration
    results['run duration'] = run_duration
    for k,v in self.run_params._asdict().items(): results[k] = v
    self.run_data.append(results)

    df = pd.DataFrame.from_dict(self.run_data, orient='columns')

    ...
```
Here, we are building a dictionary that contains the keys and values we care about for our run. We add in the `run_count`, the `epoch_count`, the `loss`, the `accuracy`, the `epoch_duration`, and the `run_duration`.

Then, we iterate over the keys and values inside our **run parameters** adding them to the **results dictionary**. This will allow us to see the parameters that are associated with the performance results.

Finally, we append the **results** to the `run_data` list.

Once the data is added to the list, we turn the data list into a `pandas` **data frame** so we can have formatted output.

In [29]:
def end_epoch(self):
    
    results = OrderedDict()
    results["run"] = self.run_count
    results["epoch"] = self.epoch_count
    results['loss'] = loss
    results["accuracy"] = accuracy
    results['epoch duration'] = epoch_duration
    results['run duration'] = run_duration
    for k,v in self.run_params._asdict().item():
        results[k] = v
        self.run_data.append(results)
    
    df = pd.DataFrame.from_dict(self.run_data, orient = 'columns')

The next two lines are specific to **Jupyter notebook**. We clear the current output and display the new data frame.
```python
clear_output(wait=True)
display(df)
```

Alright, that ends an epoch. One thing you may be wondering is how the `epoch_loss` and `epoch_num_correct` values were **tracked**. We'll we have two methods just below for that.

We have a method called `track_loss()` and a method called `track_num_correct()`. These methods are called inside the training loop **after each batch**. The loss is passed into the `track_loss()` method and the predictions and labels are passed into the `track_num_correct()` method.

In [30]:
def track_loss(self, loss, batch):
    self.epoch_loss += loss.item() * batch[0].shape[0]
    
def track_num_correct(self, preds, labels):
    self.epoch_num_correct += self.get_num_correct(preds, labels)

To calculate the number of correct predictions, we are using the same `get_num_correct()` function that we defined in previous episodes. The difference here is that the function is now **encapsulated inside our RunManager class**.

In [31]:
def _get_num_correct(self, preds, labels):
    return preds.argmax(dim = 1).eq(labels).sum().item()

Lastly, we have a method called `save()` that **saves the run_data** in **two formats**, **json** and **csv**. This output goes to disk and makes it available for other apps to consume. For example, we can open the csv file in excel or we can even build our own even better TensorBoard with the data.

In [32]:
def save(self,fileName):
    
    pd.DataFrame.from_dict(
        self.run_data, orient = 'columns'
    ).to_csv(f'{fileName}.csv')
    
    with open(f'{fileName}.json','w',encoding = 'utf-8') as f:
        json.dump(self.run_data, f, ensure_ascii = False,indent = 4)

**Before：**

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import torchvision
import torchvision.transforms as transforms

torch.set_printoptions(linewidth=120) # Display options for output
torch.set_grad_enabled(True) # Already on by default

from torch.utils.tensorboard import SummaryWriter

print(torch.__version__)
print(torchvision.__version__)

1.6.0
0.7.0


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import torchvision
import torchvision.transforms as transforms

torch.set_printoptions(linewidth=120)  # Display options for output
torch.set_grad_enabled(True)  # Already on by default

from torch.utils.tensorboard import SummaryWriter
from itertools import product

def get_num_correct(preds, labels):
    return preds.argmax(dim=1).eq(labels).sum().item()


class Network(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)

        self.fc1 = nn.Linear(in_features=12 * 4 * 4, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=60)
        self.out = nn.Linear(in_features=60, out_features=10)

    def forward(self, t):
        t = t

        t = self.conv1(t)
        t = F.relu(t)
        t = F.max_pool2d(t,  kernel_size=2, stride=2)

        t = self.conv2(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)

        t = t.reshape(-1,12*4*4)
        t = self.fc1(t)
        t = F.relu(t)

        t = self.fc2(t)
        t = F.relu(t)

        t = self.out(t)

        return t

train_set = torchvision.datasets.FashionMNIST(
    root='./data/FashionMNIST'
    ,download=True
    ,transform = transforms.Compose([transforms.ToTensor()])
)

parameters = dict(
    lr = [.01,.001]
    ,batch_size = [100, 1000]
    ,shuffle = [True, False]
)

param_values = [v for v in parameters.values()]

for lr,batch_size,shuffle in product(*param_values):
    comment = f'batch_size={batch_size} lr={lr} shuffle={shuffle}'

    network = Network()

    train_loder = torch.utils.data.DataLoader(
        train_set,batch_size = batch_size,shuffle = shuffle
    )

    optimizer = torch.optim.Adam(
        network.parameters(), lr=lr
    )

    images, labels = next(iter(train_loder))
    grid = torchvision.utils.make_grid(images)

    tb = SummaryWriter(comment=comment)
    tb.add_image('images',grid)
    tb.add_graph(network, images)

    for epoch in range(2):

        total_loss = 0
        total_correct = 0

        for batch in train_loder:
            images,labels = batch

            preds = network(images)
            loss = F.cross_entropy(preds,labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step() # update weights

            total_loss += loss.item() * batch_size
            total_correct += get_num_correct(preds,labels)

        tb.add_scalar('Loss',total_loss,epoch)
        tb.add_scalar('Number Correct', total_correct, epoch)
        tb.add_scalar('Accuracy', total_correct / len(train_set), epoch)

        for name,weight in network.named_parameters():
            tb.add_histogram(name,weight,epoch)
            tb.add_histogram(f'{name}.grad',weight.grad,epoch)

        print("epoch", epoch, "total_correct:", total_correct, "loss:", total_loss)

    tb.close()

**After：**

In [None]:
import json
import time

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import pandas as pd


from torch.utils.tensorboard import SummaryWriter
from itertools import product
from collections import namedtuple, OrderedDict

torch.set_printoptions(linewidth=120)  # Display options for output
torch.set_grad_enabled(True)  # Already on by default




class Network(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)

        self.fc1 = nn.Linear(in_features=12 * 4 * 4, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=60)
        self.out = nn.Linear(in_features=60, out_features=10)

    def forward(self, t):
        t = t

        t = self.conv1(t)
        t = F.relu(t)
        t = F.max_pool2d(t,  kernel_size=2, stride=2)

        t = self.conv2(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)

        t = t.reshape(-1,12*4*4)
        t = self.fc1(t)
        t = F.relu(t)

        t = self.fc2(t)
        t = F.relu(t)

        t = self.out(t)

        return t


class RunBuilder():
    @staticmethod
    def get_runs(params):
        Run = namedtuple('Run', params.keys())

        runs = []
        for v in product(*params.values()):
            runs.append(Run(*v))

        return runs

class RunManager():
    def __init__(self):
        self.epoch_count = 0
        self.epoch_loss = 0
        self.epoch_num_correct = 0
        self.epoch_start_time = None

        self.run_params = None
        self.run_count = 0
        self.run_data = []
        self.run_start_time = None

        self.network = None
        self.loader = None
        self.tb = None

    def begin_run(self,run, network, loader):

        self.run_start_time = time.time()

        self.run_params = run
        self.run_count += 1

        self.network = network
        self.loader = loader
        self.tb = SummaryWriter(comment=f'-{run}')

        images, labels = next(iter(self.loader))
        grid = torchvision.utils.make_grid(images)

        self.tb.add_image('images',grid)
        self.tb.add_graph(self.network, images)

    def end_run(self):
        self.tb.close()
        self.epoch_count = 0

    def begin_epoch(self):
        self.epoch_start_time = time.time()

        self.epoch_count += 1
        self.epoch_loss = 0
        self.epoch_num_correct = 0

    def end_epoch(self):

        epoch_duration = time.time() - self.epoch_start_time
        run_duration = time.time() - self.run_start_time

        loss = self.epoch_loss / len(self.loader.dataset)
        accuracy = self.epoch_num_correct / len(self.loader.dataset)

        self.tb.add_scalar('Loss',loss,self.epoch_count)
        self.tb.add_scalar('Accuracy',accuracy,self.epoch_count)

        for name,param in self.network.named_parameters():
            self.tb.add_histogram(name, param, self.epoch_count)
            self.tb.add_histogram(f'{name}.grad', param.grad, self.epoch_count)

        results = OrderedDict()
        results["run"] = self.run_count
        results["epoch"] = self.epoch_count
        results["loss"] = loss
        results["accuracy"] = accuracy
        results["epoch duration"] = epoch_duration
        results["run duration"] = run_duration
        for k,v in self.run_params._asdict().items():results[k] = v
        self.run_data.append(results)

        df = pd.DataFrame.from_dict(self.run_data,orient='columns')

    def get_num_correct(self,preds, labels):
        return preds.argmax(dim=1).eq(labels).sum().item()

    def track_loss(self,loss,batch):
        self.epoch_loss += loss.item() * batch[0].shape[0]

    def track_num_correct(self,preds,labels):
        self.epoch_num_correct += self.get_num_correct(preds,labels)

    def save(self, fileName):

        pd.DataFrame.from_dict(
            self.run_data,orient = 'columns'
        ).to_csv(f'{fileName}.csv')

        with open(f'{fileName}.json','w',encoding='utf-8') as f:
            json.dump(self.run_data,f, ensure_ascii=False, indent = 4)

train_set = torchvision.datasets.FashionMNIST(
    root='./data/FashionMNIST'
    ,download=True
    ,transform = transforms.Compose([transforms.ToTensor()])
)


params = OrderedDict(
    lr = [.01]
    ,batch_size = [100, 1000]
)

m = RunManager()

for run in RunBuilder.get_runs(params):

    network = Network()
    loader = torch.utils.data.DataLoader(train_set, batch_size = run.batch_size)

    optimizer = torch.optim.Adam(
        network.parameters(), lr=run.lr
    )
    
    m.begin_run(run,network,loader)

    for epoch in range(2):

        m.begin_epoch()
        for batch in loader:

            images,labels = batch
            preds = network(images)
            loss = F.cross_entropy(preds,labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step() # update weights

            m.track_loss(loss,batch)
            m.track_num_correct(preds, labels)
        m.end_epoch()
    m.end_run()
m.save('results')

**results.csv**  
<table>
<tr><td></td><td>run</td><td>epoch</td><td>loss</td><td>accuracy</td><td>epoch duration</td><td>run duration</td><td>lr</td><td>batch_size</td></tr>
<tr><td>0</td><td>1</td><td>1</td><td>0.5638383385042349</td><td>0.7888166666666667</td><td>16.63007378578186</td><td>16.860456705093384</td><td>0.01</td><td>100</td></tr>
<tr><td>1</td><td>1</td><td>2</td><td>0.38631543077528474</td><td>0.85825</td><td>13.861908674240112</td><td>30.825090169906616</td><td>0.01</td><td>100</td></tr>
<tr><td>2</td><td>2</td><td>1</td><td>0.9647192517916362</td><td>0.63365</td><td>13.846972942352295</td><td>14.716645240783691</td><td>0.01</td><td>1000</td></tr>
<tr><td>3</td><td>2</td><td>2</td><td>0.5320606335997582</td><td>0.7906</td><td>13.615568399429321</td><td>28.440921545028687</td><td>0.01</td><td>1000</td></tr>
<tr><td></td></tr>
</table>



## Quiz 02
1. Groups of variables that share a prefix in their name is a code smell. Which refactoring technique can be used to improve code like this?  
```python
self.epoch_count = 0
self.epoch_loss = 0
self.epoch_num_correct = 0
self.epoch_start_time = None
```
* Extract class<br><br>

2. How many arguments does the run manager class constructor accept?
* 0<br><br>


3. In Python, the underscore prefix for methods of classes is used to signal what?
```python
def _get_num_correct(self, preds, labels):
    return preds.argmax(dim=1).eq(labels).sum().item()
```
* The method is meant to be used internally inside the class<br><br>

4. What does it feel like to be wrong?
* It feels the same way as it feels to be right.

# 03 PyTorch DataLoader Num_workers - Deep Learning Speed Limit Increase

## PyTorch DataLoader `num_workers` Test - Speed Things Up

**In this episode, we will see how we can speed up the neural network training process by utilizing the multiple process capabilities of the PyTorch DataLoader class.**

## Speeding Up The Training Process
To speed up the training process, we will make use of the `um_workers` optional attribute of the `DataLoader` class.

The `num_workers` attribute tells the data loader instance how many sub-processes to use for data loading. By default, the `num_workers` value is set to zero, and a value of zero tells the loader to load the data inside the main process.

This means that the training process will work sequentially inside the **main process**. After a batch is used during the training process and another one is needed, we read the batch data from disk.

Now, if we have a worker process, we can make use of the fact that our machine has **multiple cores**. This means that the next batch can already be loaded and ready to go by the time the main process is ready for another batch. This is where the speed up comes from. The batches are loaded using additional worker processes and are **queued up** in memory.

### Optimal Value For The `num_workers` Attribute
The natural question that arises is, **how many** worker processes should we add? There are a lot of factors that can affect the optimal number here, so the best way to find out is to test.

## Testing Values For The num_workers Attribute
To set up this test, we'll create a list of `num_workers` values to try. We'll try the following values:
* 0 (default)
* 1
* 2
* 4
* 8
* 16
For each of these values, we'll vary the batch size by trying the following values:
* 100
* 1000
* 10000

For the learning rate, we'll keep it at a constant value of `.01` for all of the runs.

The last thing to mention about the setup here is the fact that we are only doing a single epoch for each of the runs.

Alright, let's see what we get.
## Different `num_workers` Values: Results
Alright, we can see down below that we have the results. We completed a total of eighteen runs. We have three groups of differing batch sizes, and inside each of these groups, we varied the number of worker processes.
<table class="table table-sm table-hover">
                                                    <thead>
                                                        <tr style="text-align: right;">
                                                            <th>run</th>
                                                            <th>epoch</th>
                                                            <th>loss</th>
                                                            <th>accuracy</th>
                                                            <th>epoch duration</th>
                                                            <th>run duration</th>
                                                            <th>lr</th>
                                                            <th>batch_size</th>
                                                            <th>num_workers</th>
                                                        </tr>
                                                    </thead>
                                                    <tbody>
                                                        <tr>
                                                            <td>1</td>
                                                            <td>1</td>
                                                            <td>0.566253</td>
                                                            <td>0.782583</td>
                                                            <td>23.281029</td>
                                                            <td>23.374832</td>
                                                            <td>0.01</td>
                                                            <td>100</td>
                                                            <td>0</td>
                                                        </tr>
                                                        <tr>
                                                            <td>2</td>
                                                            <td>1</td>
                                                            <td>0.573350</td>
                                                            <td>0.783917</td>
                                                            <td>18.125359</td>
                                                            <td>18.965940</td>
                                                            <td>0.01</td>
                                                            <td>100</td>
                                                            <td>1</td>
                                                        </tr>
                                                        <tr>
                                                            <td>3</td>
                                                            <td>1</td>
                                                            <td>0.574852</td>
                                                            <td>0.782133</td>
                                                            <td>18.161020</td>
                                                            <td>19.037995</td>
                                                            <td>0.01</td>
                                                            <td>100</td>
                                                            <td>2</td>
                                                        </tr>
                                                        <tr>
                                                            <td>4</td>
                                                            <td>1</td>
                                                            <td>0.593246</td>
                                                            <td>0.775067</td>
                                                            <td>18.637056</td>
                                                            <td>19.669869</td>
                                                            <td>0.01</td>
                                                            <td>100</td>
                                                            <td>4</td>
                                                        </tr>
                                                        <tr>
                                                            <td>5</td>
                                                            <td>1</td>
                                                            <td>0.587598</td>
                                                            <td>0.777500</td>
                                                            <td>18.631994</td>
                                                            <td>20.123626</td>
                                                            <td>0.01</td>
                                                            <td>100</td>
                                                            <td>8</td>
                                                        </tr>
                                                        <tr>
                                                            <td>6</td>
                                                            <td>1</td>
                                                            <td>0.596401</td>
                                                            <td>0.775983</td>
                                                            <td>20.110439</td>
                                                            <td>22.930428</td>
                                                            <td>0.01</td>
                                                            <td>100</td>
                                                            <td>16</td>
                                                        </tr>
                                                        <tr>
                                                            <td>7</td>
                                                            <td>1</td>
                                                            <td>1.105825</td>
                                                            <td>0.577500</td>
                                                            <td>21.254815</td>
                                                            <td>21.941008</td>
                                                            <td>0.01</td>
                                                            <td>1000</td>
                                                            <td>0</td>
                                                        </tr>
                                                        <tr>
                                                            <td>8</td>
                                                            <td>1</td>
                                                            <td>1.013017</td>
                                                            <td>0.612267</td>
                                                            <td>15.961835</td>
                                                            <td>17.457127</td>
                                                            <td>0.01</td>
                                                            <td>1000</td>
                                                            <td>1</td>
                                                        </tr>
                                                        <tr>
                                                            <td>9</td>
                                                            <td>1</td>
                                                            <td>0.881558</td>
                                                            <td>0.666200</td>
                                                            <td>16.060656</td>
                                                            <td>17.614599</td>
                                                            <td>0.01</td>
                                                            <td>1000</td>
                                                            <td>2</td>
                                                        </tr>
                                                        <tr>
                                                            <td>10</td>
                                                            <td>1</td>
                                                            <td>1.034153</td>
                                                            <td>0.606767</td>
                                                            <td>16.206196</td>
                                                            <td>17.883490</td>
                                                            <td>0.01</td>
                                                            <td>1000</td>
                                                            <td>4</td>
                                                        </tr>
                                                        <tr>
                                                            <td>11</td>
                                                            <td>1</td>
                                                            <td>0.963817</td>
                                                            <td>0.626400</td>
                                                            <td>16.700765</td>
                                                            <td>18.882340</td>
                                                            <td>0.01</td>
                                                            <td>1000</td>
                                                            <td>8</td>
                                                        </tr>
                                                        <tr>
                                                            <td>12</td>
                                                            <td>1</td>
                                                            <td>1.046822</td>
                                                            <td>0.601683</td>
                                                            <td>17.912993</td>
                                                            <td>21.747298</td>
                                                            <td>0.01</td>
                                                            <td>1000</td>
                                                            <td>16</td>
                                                        </tr>
                                                        <tr>
                                                            <td>13</td>
                                                            <td>1</td>
                                                            <td>2.173913</td>
                                                            <td>0.265983</td>
                                                            <td>22.219368</td>
                                                            <td>27.145123</td>
                                                            <td>0.01</td>
                                                            <td>10000</td>
                                                            <td>0</td>
                                                        </tr>
                                                        <tr>
                                                            <td>14</td>
                                                            <td>1</td>
                                                            <td>2.156031</td>
                                                            <td>0.191167</td>
                                                            <td>16.563987</td>
                                                            <td>23.368729</td>
                                                            <td>0.01</td>
                                                            <td>10000</td>
                                                            <td>1</td>
                                                        </tr>
                                                        <tr>
                                                            <td>15</td>
                                                            <td>1</td>
                                                            <td>2.182048</td>
                                                            <td>0.210250</td>
                                                            <td>16.128202</td>
                                                            <td>23.030015</td>
                                                            <td>0.01</td>
                                                            <td>10000</td>
                                                            <td>2</td>
                                                        </tr>
                                                        <tr>
                                                            <td>16</td>
                                                            <td>1</td>
                                                            <td>2.245768</td>
                                                            <td>0.200683</td>
                                                            <td>16.248334</td>
                                                            <td>22.108252</td>
                                                            <td>0.01</td>
                                                            <td>10000</td>
                                                            <td>4</td>
                                                        </tr>
                                                        <tr>
                                                            <td>17</td>
                                                            <td>1</td>
                                                            <td>2.177970</td>
                                                            <td>0.206483</td>
                                                            <td>16.921782</td>
                                                            <td>23.897321</td>
                                                            <td>0.01</td>
                                                            <td>10000</td>
                                                            <td>8</td>
                                                        </tr>
                                                        <tr>
                                                            <td>18</td>
                                                            <td>1</td>
                                                            <td>2.153342</td>
                                                            <td>0.208017</td>
                                                            <td>18.555999</td>
                                                            <td>26.654219</td>
                                                            <td>0.01</td>
                                                            <td>10000</td>
                                                            <td>16</td>
                                                        </tr>
                                                    </tbody>
</table>
                                                
The main take-away from these results is that, across all three batch sizes, having a single worker process in addition to the main process resulted in a speed up of about twenty percent.

<center><b>20% Faster!</b></center>
Additionally, adding additional worker processes after the first one didn't really show any further improvements.

## Interpreting The Results
The twenty percent speed up that we see after adding a **single worker process** makes sense because the main process had less work to do.

While the **main process** is busy performing the forward and backward passes, the worker process is loading the next batch. By the time the main process is ready for another batch, the worker process already has it queued up in memory.

As a result, the main process doesn't have to read the data from disk. Instead, the data is already in memory, and this gives us the twenty percent speed up.

Now, why are we not seeing additional speed ups after adding more workers?

## Make It Go Faster With More Workers?
We'll if one worker is enough to keep the queue full of data for the main process, then adding more batches of data to the queue **isn't** going to do anything. This is what I think we are seeing here.

Just because we are adding more batches to the queue doesn't mean the batches are being processes faster. Thus, we are bounded by the time it takes to forward and backward propagate a given batch.

We can even see that things start **bogging** as we get to **16 workers**.

Hope this helps speed you up!

## Quiz 03
1. By default, the `num_workers` value is set to _______________.
* 0<br><br>

2. A value of _______________ tells the data loader to load the data inside the main process.
* 0<br><br>

3. If we have a worker process, we can make use of the fact that our machine has multiple _______________. This means that the next batch can be ready to go by the time the main process is ready for another batch.
* cores