# 04 Run PyTorch Code On A GPU - Neural Network Programming Guide

**In this episode, we're going to learn how to use the GPU with PyTorch. We'll see how to use the GPU in general, and we'll see how to apply these general techniques to training our neural network.**

## Using A GPU For Deep Learning
### PyTorch GPU Example
PyTorch allows us to seamlessly move data to and from our GPU as we preform computations inside our programs.

When we go to the GPU, we can use the `cuda()` method, and when we go to the CPU, we can use the `cpu()` method.

We can also use the `to()` method. To go to the GPU, we write `to('cuda')` and to go to the CPU, we write `to('cpu')`. The `to()` method is the preferred way mainly because it is more flexible. We'll see one example using using the first two, and then we'll default to always using the `to()` variant.

| <center><b>CPU</b></center> | <center><b>GPU</b></center> |
| --- | --- |
| <center>`cpu()`</center> | <center>`cuda()`</center> |
| <center>`to('cpu')`</center> | <center>`to('cuda')`</center> |

To make use of our GPU during the training process, there are two essential requirements. These requirements are as follows, the **data** must be moved to the GPU, and the **network** must be moved to the GPU.
1. Data on the GPU
2. Network on the GPU

By default, when a PyTorch tensor or a PyTorch neural network module is created, the corresponding data is initialized on the **CPU**. Specifically, the **data** exists inside the CPU's memory.

Now, let's create a tensor and a network, and see how we make the move from CPU to GPU.

Here, we create a tensor and a network:

In [1]:
import json
import time

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import pandas as pd


from torch.utils.tensorboard import SummaryWriter
from itertools import product
from collections import namedtuple, OrderedDict

torch.set_printoptions(linewidth=120)  # Display options for output
torch.set_grad_enabled(True)  # Already on by default

<torch.autograd.grad_mode.set_grad_enabled at 0x217323b7d60>

In [2]:
class Network(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)

        self.fc1 = nn.Linear(in_features=12 * 4 * 4, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=60)
        self.out = nn.Linear(in_features=60, out_features=10)

    def forward(self, t):
        t = t

        t = self.conv1(t)
        t = F.relu(t)
        t = F.max_pool2d(t,  kernel_size=2, stride=2)

        t = self.conv2(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)

        t = t.reshape(-1,12*4*4)
        t = self.fc1(t)
        t = F.relu(t)

        t = self.fc2(t)
        t = F.relu(t)

        t = self.out(t)

        return t

In [3]:
t = torch.ones(1,1,28,28)
network = Network()

Now, we call the `cuda()` method and reassign the tensor and network to returned values that have been copied onto the GPU:

In [4]:
t = t.cuda()
network = network.cuda()

Next, we can get a prediction from the network and see that the prediction tensor's device attribute confirms that the data is on cuda, which is the GPU:

In [5]:
gpu_pred = network(t)
gpu_pred.device

device(type='cuda', index=0)

Likewise, we can go in the **opposite** way:

In [6]:
t = t.cpu()
network = network.cpu()

cpu_pred = network(t)
cpu_pred.device

device(type='cpu')

This is, in a nutshell, how we can utilize the GPU capabilities of PyTorch. What we should turn to now are some important details that are lurking beneath the surface of the code we've just seen.

For example, although we've used the `cuda()` and `cpu(`) methods, they actually **aren't our best options**. Furthermore, what's the difference with the methods between the **network instance** and the **tensor instance**? These after all are different objects types, which means the two methods are different. Finally, we want to integrate this code into a working example and do a performance test.

### General Idea Of Using A GPU
The **main takeaway** at this point is that our **network** and our **data** must **both exist on the GPU** in order to perform computations using the GPU, and this applies to any programming language or framework.
![CPUGPU](https://deeplizard.com/images/gpu%20vs%20cpu.jpg)
As we'll see in our next demonstration, this is **also true for the CPU**. GPUs and CPUs are compute devices that compute on data, and so any two values that are directly being used with one another in a computation, **must exist on the same device**.

## PyTorch `Tensor` Computations On A GPU
Let's dive deeper by demonstrating some tensor computations.

We'll start by creating two tensors:

In [7]:
t1 = torch.tensor([
    [1,2],
    [3,4]
])

t2 = torch.tensor([
    [5,6],
    [7,8]
])

Now, we'll check which **device** these tensors were **initialized** on by inspecting the device attribute:

In [8]:
t1.device, t2.device

(device(type='cpu'), device(type='cpu'))

As we'd expect, we see that, indeed, both tensors are on the **same device**, which is the CPU. Let's **move** the first tensor t1 to the **GPU**.

In [9]:
t1 = t1.to('cuda')
t1.device

device(type='cuda', index=0)

We can see that this tensor's device has been changed to `cuda`, the GPU. Note the use of the `to()` method here. Instead of calling a particular method to move to a device, we call the same method and pass an argument that specifies the device. Using the `to()` method is the preferred way of moving data to and from devices.

Also, note the reassignment. The operation is not in-place, and so the reassignment is required.

Let's try an experiment. I'd like to test what we discussed earlier by attempting to perform a computation on these **two tensors**, `t1` and `t2`, that we now know to be on **different devices**.

Since we expect an error, we'll wrap the call in a `try` and `catch` the exception:

In [10]:
try:
    t1+t2
except Exception as e:
    print(e)

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!


These errors are telling us that the binary plus operation expects the second argument to have the same device as the first argument. Understanding the meaning of this error can help when debugging these types of device mismatches.

Finally, for completion, let's move the second tensor to the cuda device to see the operation succeed.

In [12]:
t2 = t2.to('cuda')
t1 + t2

tensor([[ 6,  8],
        [10, 12]], device='cuda:0')

## PyTorch `nn.Module` Computations On A GPU
We've just seen how tensors can be **moved** to and from devices. Now, let's see how this is done with PyTorch `nn.Module` instances.

More generally, we are interested in understanding **how** and **what** it means for a **network** to be on a device like a GPU or CPU. PyTorch aside, this is the essential issue.

We put a network on a device by moving the network's parameters to that said device. Let's create a network and take a look at what we mean.

In [13]:
nwtwork = Network()

In [15]:
# Now, let's look at the network's parameters:
for name,param in network.named_parameters():
    print(name,'\t\t',param.shape)

conv1.weight 		 torch.Size([6, 1, 5, 5])
conv1.bias 		 torch.Size([6])
conv2.weight 		 torch.Size([12, 6, 5, 5])
conv2.bias 		 torch.Size([12])
fc1.weight 		 torch.Size([120, 192])
fc1.bias 		 torch.Size([120])
fc2.weight 		 torch.Size([60, 120])
fc2.bias 		 torch.Size([60])
out.weight 		 torch.Size([10, 60])
out.bias 		 torch.Size([10])


Here, we've created a PyTorch network, and we've iterated through the network's parameters. As we can see, the network's parameters are the **weights** and **biases** inside the network.

In other words, these are simply tensors that live on a device like we have already seen. Let's verify this by checking the **device** of each of the parameters.

In [16]:
for n,p in network.named_parameters():
    print(p.device,'',n)

cpu  conv1.weight
cpu  conv1.bias
cpu  conv2.weight
cpu  conv2.bias
cpu  fc1.weight
cpu  fc1.bias
cpu  fc2.weight
cpu  fc2.bias
cpu  out.weight
cpu  out.bias


This shows us that all the **parameters** inside the **networ** are, by default, initialized on the **CPU**.

An important consideration of this is that it explains why `nn.Module` instances like networks don't actually have a device. **It's not the *network* that lives on a device**, but the ***tensors* inside the *network* that live on a device**.

Let's see what happens when we ask a network to be moved `to()` the GPU:

In [17]:
network.to('cuda')

Network(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 12, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=192, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=60, bias=True)
  (out): Linear(in_features=60, out_features=10, bias=True)
)

Note here that a **reassignment** was not required. This is because the operation is in-place as far as the network instance is concerned. However, this operation can be used as a reassignment operation. This is preferred for consistency between `nn.Module` instances and PyTorch tensors.

Here, we can see that now, all the network parameters are have a device of `cuda`:

In [18]:
for n,p in network.named_parameters():
    print(p.device,'',n)

cuda:0  conv1.weight
cuda:0  conv1.bias
cuda:0  conv2.weight
cuda:0  conv2.bias
cuda:0  fc1.weight
cuda:0  fc1.bias
cuda:0  fc2.weight
cuda:0  fc2.bias
cuda:0  out.weight
cuda:0  out.bias


### Passing A Sample To The Network
Let's round off this demonstration by passing a **sample** to the network.

In [19]:
sample = torch.ones(1,1,28,28)
sample.shape

torch.Size([1, 1, 28, 28])

In [20]:
# This gives us a sample tensor we can pass like so:
try:
    network(sample)
except Exception as e:
    print(e)

Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same


Since our **network** is on the **GPU** and this newly created **sample** is on the **CPU** by **default**, we are getting an error. The error is telling us that the CPU tensor was expected to be a GPU tensor when calling the forward method of the first convolutional layer. This is precisely what we saw before when adding two tensors directly.

We can fix this issue by sending our sample to the GPU like so:

In [21]:
try:
    pred = network(sample.to('cuda'))
    print(pred)
except Exception as e:
    print(e)

tensor([[-0.0682, -0.1137,  0.0062, -0.1020, -0.1043, -0.1616,  0.0101, -0.0623, -0.1047, -0.0606]], device='cuda:0',
       grad_fn=<AddmmBackward>)


Finally, everything works as expected, and we get a prediction.
### Writing Device Agnostic PyTorch Code
Before we wrap up, we need to talk about writing device agnostic code. This term `device agnostic` means that our code **doesn't depend on the underlying device**. You may come across this terminology when reading PyTorch documentation.

For example, suppose we write code that uses the `cuda()` method everywhere, and then, we give the code to a user who **doesn't have a GPU**. This won't work. Don't worry. We've got options!

Remember earlier when we saw the `cuda()` and `cpu()` methods?

We'll, one of the reasons that the `to()` method is preferred, is because the `to()` method is **parameterized**, and this makes it easier to **alter the device we are choosing**, i.e. it's flexible!

For example, a user could pass in `cpu` or `cuda` as an argument to a deep learning program, and this would allow the program to be device agnostic.

Allowing the user of a program to pass an argument that determines the program's behavior is perhaps the best way to make a program be device agnostic. However, we can also use PyTorch to check for a supported GPU, and set our devices that way.
```python
torch.cuda.is_available()
True
```
Like, if cuda is available, then use it!

## PyTorch GPU Training Performance Test
Let's see now how to add the use of a **GPU** to the **training loop**. We're going to be doing this addition with the code we've been developing so far in the series.

This will allow us to easily compare times, CPU vs GPU.

### Refactoring The RunManager Class
Before we update the training loop, we need to update the `RunManager` class. Inside the `begin_run()` method we need to modify the **device** of the images **tensor** that is passed to add_graph method.

It should look like this:
```python
def begin_run(self, run, network, loader):
    
    self.run_start_time = time.time()
    
    self.run_params = run
    self.run_count += 1
    
    self.network = network
    self.loader = loader
    self.tb = SummaryWriter(comment=f'-{run}')
    
    images,labels = next(iter(self.loader))
    grid = torchvision.utils.make_grid(images)
    
    self.tb.add_image('images',grid)
    self.tb.add_graph(self.network,images.to(getattr(run, 'device', 'cpu')))
```

Here, we are using the `getattr()` **built in function** to **get the value of the device** on the run object. If the run object **doesn't have a device**, then **cpu is returned**. This makes the **code backward compatible**. It will still work if we don't specify a device for our run.

Note that the **network doesn't need to be moved to a device** because it's device was set before being passed in. However, the images tensor is obtained from the loader.

### Refactoring The Training Loop
We'll set our configuration parameters to have a device. The two logical options here are `cuda` and `cpu`.
```python
params = OrderedDict(
    lr = [.01]
    ,batch_size = [1000, 10000, 20000]
    , num_workers = [0, 1]
    , device = ['cuda', 'cpu']
)
```
With these device values added to our configuration, they'll now be available to be accessed inside our training loop.

At the top of our run, we'll create a device that will be passed around inside the run and inside the training loop.
```python
device = torch.device(run.device)
```
The first place we'll use this device is when **initializing our network**.
```python
network = Network().to(device)
```
This will ensure that the network is moved to the appropriate device. Finally, we'll update our `images` and `labels` tensors by unpacking them separately and sending them to the device like so:
```python
images = batch[0].to(device)
labels = batch[1].to(device)
```

**Code：**

In [None]:
import json
import time

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import pandas as pd


from torch.utils.tensorboard import SummaryWriter
from itertools import product
from collections import namedtuple, OrderedDict

torch.set_printoptions(linewidth=120)  # Display options for output
torch.set_grad_enabled(True)  # Already on by default

class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)

        self.fc1 = nn.Linear(in_features=12 * 4 * 4,out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=60)
        self.out = nn.Linear(in_features=60, out_features=10)

    def forward(self,t):
        t = t

        t = self.conv1(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)

        t = self.conv2(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size = 2,stride = 2)

        t = t.reshape(-1,12*4*4)
        t = self.fc1(t)
        t = F.relu(t)

        t = self.fc2(t)
        t = F.relu(t)

        t = self.out(t)

        return t


class RunBuilder():
    @staticmethod
    def get_runs(params):
        Run = namedtuple('Run',params.keys())

        runs = []
        for v in product(*params.values()):
            runs.append(Run(*v))

        return runs

class RunManager():
    def __init__(self):
        self.epoch_count = 0
        self.epoch_loss = 0
        self.epoch_num_correct = 0
        self.epoch_start_time = None

        self.run_params = None
        self.run_count = 0
        self.run_data = []
        self.run_start_time = None

        self.network = None
        self.loader = None
        self.tb = None

    def begin_run(self, run, network, loader):

        self.run_start_time = time.time()
        self.run_params = run
        self.run_count += 1

        self.network = network
        self.loader = loader
        self.tb = SummaryWriter(comment=f'-{run}')

        images,labels = next(iter(self.loader))
        grid = torchvision.utils.make_grid(images)

        self.tb.add_image('images',grid)
        self.tb.add_graph(self.network, images.to(getattr(run,'device','cpu')))

    def end_run(self):
        self.tb.close()
        self.epoch_count = 0

    def begin_epoch(self):
        self.epoch_start_time = time.time()

        self.epoch_count += 1
        self.epoch_loss = 0
        self.epoch_num_correct = 0

    def end_epoch(self):

        epoch_duration = time.time() - self.epoch_start_time
        run_duration = time.time() - self.run_start_time

        loss = self.epoch_loss / len(self.loader.dataset)
        accuracy = self.epoch_num_correct / len(self.loader.dataset)

        self.tb.add_scalar('Loss',loss,self.epoch_count)
        self.tb.add_scalar('Accuracy',accuracy,self.epoch_count)

        for name,param in self.network.named_parameters():
            self.tb.add_histogram(name,param, self.epoch_count)
            self.tb.add_histogram(f'{name}.grad',param.grad, self.epoch_count)

        results = OrderedDict()
        results["run"] = self.run_count
        results["epoch"] = self.epoch_count
        results["loss"] = loss
        results["accuracy"] = accuracy
        results["epoch duration"] = epoch_duration
        results["run duration"] = run_duration
        for k,v in self.run_params._asdict().items():#???
            results[k] = v
        self.run_data.append(results)

        df = pd.DataFrame.from_dict(self.run_data,orient='columns')

    def get_num_correct(self, preds, labels):
        return preds.argmax(dim=1).eq(labels).sum().item()

    def track_loss(self,loss,batch):
        self.epoch_loss += loss.item() * batch[0].shape[0]

    def track_num_correct(self,preds, labels):
        self.epoch_num_correct += self.get_num_correct(preds,labels)

    def save(self, fileName):
        pd.DataFrame.from_dict(self.run_data,orient='columns').to_csv(f'{fileName}.csv')

        with open(f'{fileName}.json','w',encoding='utf-8') as f:
            json.dump(self.run_data, f, ensure_ascii=False, indent=4)



train_set = torchvision.datasets.FashionMNIST(
    root = './data/FashionMNIST',download=True,transform=transforms.Compose([transforms.ToTensor()])
)

params = OrderedDict(
    lr = [.01]
    ,batch_size = [1000,10000,20000]
    ,num_workers = [0,1]
    , device = ['cuda','cpu']
)

m = RunManager()

for run in RunBuilder.get_runs(params):

    network = Network().to(run.device)
    loader = torch.utils.data.DataLoader(train_set,batch_size = run.batch_size)

    optimizer = torch.optim.Adam(network.parameters(), lr=run.lr)

    m.begin_run(run,network,loader)

    for epoch in range(2):

        m.begin_epoch()
        for batch in loader:
            images = batch[0].to(run.device)
            labels = batch[1].to(run.device)
            preds = network(images)
            loss = F.cross_entropy(preds,labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            m.track_loss(loss,batch)
            m.track_num_correct(preds, labels)
        m.end_epoch()
    m.end_run()
m.save('result_GPU')




Results：
<table>
<tr><td></td><td>run</td><td>epoch</td><td>loss</td><td>accuracy</td><td>epoch duration</td><td>run duration</td><td>lr</td><td>batch_size</td><td>num_workers</td><td>device</td></tr>
<tr><td>0</td><td>1</td><td>1</td><td>1.028867890437444</td><td>0.61065</td><td>7.907843589782715</td><td>10.033156394958496</td><td>0.01</td><td>1000</td><td>0</td><td>cuda</td></tr>
<tr><td>1</td><td>1</td><td>2</td><td>0.5684726412097613</td><td>0.7791833333333333</td><td>7.863961696624756</td><td>18.047714710235596</td><td>0.01</td><td>1000</td><td>0</td><td>cuda</td></tr>
<tr><td>2</td><td>2</td><td>1</td><td>1.1850317627191544</td><td>0.5521833333333334</td><td>13.191706418991089</td><td>14.168094873428345</td><td>0.01</td><td>1000</td><td>0</td><td>cpu</td></tr>
<tr><td>3</td><td>2</td><td>2</td><td>0.6565005630254745</td><td>0.7424166666666666</td><td>12.927437782287598</td><td>27.199254989624023</td><td>0.01</td><td>1000</td><td>0</td><td>cpu</td></tr>
<tr><td>4</td><td>3</td><td>1</td><td>1.0353816469510397</td><td>0.5979166666666667</td><td>7.899864435195923</td><td>8.667809009552002</td><td>0.01</td><td>1000</td><td>1</td><td>cuda</td></tr>
<tr><td>5</td><td>3</td><td>2</td><td>0.5281389872233073</td><td>0.7974833333333333</td><td>7.695444345474243</td><td>16.482932567596436</td><td>0.01</td><td>1000</td><td>1</td><td>cuda</td></tr>
<tr><td>6</td><td>4</td><td>1</td><td>1.0003941506147385</td><td>0.6135</td><td>12.406805515289307</td><td>13.301412343978882</td><td>0.01</td><td>1000</td><td>1</td><td>cpu</td></tr>
<tr><td>7</td><td>4</td><td>2</td><td>0.5597394168376922</td><td>0.7819333333333334</td><td>12.996736526489258</td><td>26.39090061187744</td><td>0.01</td><td>1000</td><td>1</td><td>cpu</td></tr>
<tr><td>8</td><td>5</td><td>1</td><td>2.1817620595296225</td><td>0.21105</td><td>9.979301452636719</td><td>14.71061372756958</td><td>0.01</td><td>10000</td><td>0</td><td>cuda</td></tr>
<tr><td>9</td><td>5</td><td>2</td><td>1.5009960730870564</td><td>0.41345</td><td>7.790157794952393</td><td>22.617460012435913</td><td>0.01</td><td>10000</td><td>0</td><td>cuda</td></tr>
<tr><td>10</td><td>6</td><td>1</td><td>2.191338042418162</td><td>0.25776666666666664</td><td>12.654212713241577</td><td>20.27781581878662</td><td>0.01</td><td>10000</td><td>0</td><td>cpu</td></tr>
<tr><td>11</td><td>6</td><td>2</td><td>1.5385146339734395</td><td>0.4116166666666667</td><td>13.750212907791138</td><td>34.12576651573181</td><td>0.01</td><td>10000</td><td>0</td><td>cpu</td></tr>
<tr><td>12</td><td>7</td><td>1</td><td>2.0937188069025674</td><td>0.24205</td><td>10.781154155731201</td><td>16.329310655593872</td><td>0.01</td><td>10000</td><td>1</td><td>cuda</td></tr>
<tr><td>13</td><td>7</td><td>2</td><td>1.6782972415288289</td><td>0.3495666666666667</td><td>8.667845487594604</td><td>25.11484146118164</td><td>0.01</td><td>10000</td><td>1</td><td>cuda</td></tr>
<tr><td>14</td><td>8</td><td>1</td><td>2.181113620599111</td><td>0.18073333333333333</td><td>12.430742979049683</td><td>20.360525608062744</td><td>0.01</td><td>10000</td><td>1</td><td>cpu</td></tr>
<tr><td>15</td><td>8</td><td>2</td><td>1.4258009195327759</td><td>0.4513</td><td>12.419771671295166</td><td>32.86806273460388</td><td>0.01</td><td>10000</td><td>1</td><td>cpu</td></tr>
<tr><td>16</td><td>9</td><td>1</td><td>2.281795342763265</td><td>0.113</td><td>12.738913536071777</td><td>21.488025665283203</td><td>0.01</td><td>20000</td><td>0</td><td>cuda</td></tr>
<tr><td>17</td><td>9</td><td>2</td><td>1.8872746229171753</td><td>0.33266666666666667</td><td>7.82509183883667</td><td>29.428784132003784</td><td>0.01</td><td>20000</td><td>0</td><td>cuda</td></tr>
<tr><td>18</td><td>10</td><td>1</td><td>2.276853322982788</td><td>0.1421</td><td>14.396483182907104</td><td>28.404006242752075</td><td>0.01</td><td>20000</td><td>0</td><td>cpu</td></tr>
<tr><td>19</td><td>10</td><td>2</td><td>1.9167550802230835</td><td>0.29265</td><td>13.226643323898315</td><td>41.743351221084595</td><td>0.01</td><td>20000</td><td>0</td><td>cpu</td></tr>
<tr><td>20</td><td>11</td><td>1</td><td>2.2801879247029624</td><td>0.24583333333333332</td><td>12.877546787261963</td><td>22.319284915924072</td><td>0.01</td><td>20000</td><td>1</td><td>cuda</td></tr>
<tr><td>21</td><td>11</td><td>2</td><td>1.8660159905751545</td><td>0.39941666666666664</td><td>7.981645345687866</td><td>30.41063666343689</td><td>0.01</td><td>20000</td><td>1</td><td>cuda</td></tr>
<tr><td>22</td><td>12</td><td>1</td><td>2.291093111038208</td><td>0.15393333333333334</td><td>13.899812459945679</td><td>27.670968294143677</td><td>0.01</td><td>20000</td><td>1</td><td>cpu</td></tr>
<tr><td>23</td><td>12</td><td>2</td><td>1.9686975479125977</td><td>0.35846666666666666</td><td>12.957333087921143</td><td>40.71407151222229</td><td>0.01</td><td>20000</td><td>1</td><td>cpu</td></tr>
<tr><td></td></tr>
</table>


## Quiz 04
1. When we move data to the GPU, we can use the cuda() method.
```python
network = Network().cuda()
```
* True<br><br>

2. In neural network programming, it is ideal to put the data on the GPU while leaving the network on the CPU. This speeds up processing!  
* False

3. By default, when a PyTorch tensor or a PyTorch neural network module is created, the corresponding data is initialized on the _______________.
* CPU

4. GPUs and CPUs are compute devices that compute on data, and any two values that are directly being used with one another in a computation, must exist on the same device.
* True

5. By default, PyTorch initializes tensors on the _______________.
* CPU

6. What's the significance of the `0` in the `cuda` device below?
```python
> t2 = t2.to('cuda')
> t1 + t2

tensor([[ 6,  8],
      [10, 12]], device='cuda:0')
```
* Given multiple GPUs, it tells us which one

7. If a PyTorch program is device agnostic, the program will only run on machines that have a GPU.
* False

8. PyTorch `tensors` and PyTorch `nn.Module`instances both have device attributes.
* False

---
---

# 05 PyTorch Dataset Normalization - Torchvision.Transforms.Normalize()
**In this episode, we're going to learn how to normalize a dataset. We'll see how dataset normalization is carried out in code, and we'll see how normalization affects the neural network training process.**

## Data Normalization
The idea of data [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) is an **general concept** that refers to the act of **transforming** the original values of a dataset to new values. The new values are typically encoded relative to the dataset itself and are scaled in some way.

### Feature Scaling
For this reason, another name for data normalization that is sometimes used is [feature scaling](https://en.wikipedia.org/wiki/Feature_scaling). This term refers to the fact that when normalizing data, we often transform different features of a given dataset to a similar scale.

In this case, we are not just thinking of a dataset of values but rather, **a dataset of elements** that have **multiple features**, each with their on value.

Suppose for example that we are dealing with a dataset of **people**, and we have two relevant features in our dataset, **age** and **weight**. In this case, we can observe that the **magnitudes** or **scales** of these these two feature sets are **different**, i.e., the weights on average ar larger than the age.

This difference in magnitude can be **problematic** when comparing or computing using machine learning algorithms. Hence, this can be **one reason** we might want to scale the values of these features to some similar scale via feature scaling.

### Normalization Example
When we normalize a dataset, we said that we typically encode some form of information about each particular value relative to the dataset at large and rescale the data. Let's consider an example.

Suppose we have a set $S$ of positive numbers. Now, suppose we choose a random value $x$ from the set $s$ and ask the following question:<br><br>
<center><b>Is this value  the largest member of the set ?</b></center>

In this case, the answer is that **we don't know**. We simply don't have enough information to answer the question.

However, let's suppose now that we are told that the set $S$ has been normalized by **dividing** every value by **the largest value** inside the set. Given this normalization process, the information of which value is largest has been encoded and the data has been rescaled.

The **largest** member of the set is **1**, and the data has been scaled to the interval <math>
  <mo stretchy="false">[</mo>
  <mn>0</mn>
  <mo>,</mo>
  <mn>1</mn>
  <mo stretchy="false">]</mo>
</math>.

### What Is Standardization
Data [standardization](https://en.wikipedia.org/wiki/Standard_score) is a specific type of normalization technique. It is sometimes referred to as **z-score normalization**. The z-score, a.k.a. **standard score**, is the transformed value for each data point.

To normalize a dataset using standardization, we take every value $x$ inside the dataset and transform it to its corresponding $z$ value using the following formula:
$$z=\frac{x-mean}{std}$$

After performing this computation on every $x$ value inside our dataset, we have a new normalized dataset of $z$ values. The mean and standard deviation values are with respect to the dataset as a whole.

Suppose that a given set $S$ of numbers has $n$ members.

The mean of the set $S$ is given by the following equation:
$$ mean = \frac{1}{n} \left( \sum_{i=1}^{n} x_{i} \right) $$
The standard deviation of the set  is given by the following equation:
$$ std = \sqrt{\frac{1}{n} \left(\sum\limits_{i=1}^{n} \left( x_{i}-mean \right) ^{2}\right)} $$
We have seen how normalizing by dividing by the largest value had the effect of transforming the largest value to **1**, this standardization process transforms the dataset's mean value to **0** and its standard deviation to **1**.

It's **important** to note that when we normalize a dataset, we typically **group** these operations by **feature**. This means that the mean and standard deviation values are relative to **each feature set** that's being normalized. If we are working with **images**, the features are the **RGB color channels**, so we **normalize each color channel** with respect to the **mean** and **standard deviation** values calculated across **all pixels** in every **images** for the respective **color channel**.

## Normalize A Dataset In Code
Let's jump into a code example. The first step is to initialize our dataset, so in this example we'll use the Fashion MNIST dataset that we've been working with up to this point in the series.
```python
train_set = torchvision.datasets.FashionMNIST(
    root = './data'
    ,train=True
    ,download = True
    ,transform = transforms.Compose([
        transforms.ToTensor()
    ])
)
```
PyTorch allows us to normalize our dataset using the **standardization process** we've just seen by passing in the mean and standard deviation values for **each color channel** to the `Normalize()` transform.
```python
torchvision.transforms.Normalize(
    [meanOfChannel1, meanOfChannel2, meanOfChannel3] 
    , [stdOfChannel1, stdOfChannel2, stdOfChannel3] 
)
```

Since the images inside our dataset only have a **single channel**, we only need to pass in solo mean and standard deviation values. In order to do this we need to first calculate these values. Sometimes the values might be posted online somewhere, so we can get them that way. However, when in doubt, we can just calculate the manually.

There are two ways it can be done. The easy way, and the harder way. The **easy way** can be achieved if the dataset is **small enough to fit into memory all at once**. **Otherwise**, we have to **iterate over the data** which is slightly harder.

### Calculating `mean` And `std` The Easy Way
The easy way is easy. All we have to do is load the dataset using the data loader and get **a single batch tensor** that contains **all the data**. To do this we set the **batch size** to be equal to the **training set length**.
```python
loader = DataLoader(train_set, batch_size = len(train_set), num_workers = 1)
data = next(iter(loader))
data[0].mean(),data[0].std()
```

In [69]:
import json
import time

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import pandas as pd


from torch.utils.tensorboard import SummaryWriter
from itertools import product
from collections import namedtuple, OrderedDict
from IPython import display

torch.set_printoptions(linewidth=120)  # Display options for output
torch.set_grad_enabled(True)  # Already on by default

<torch.autograd.grad_mode.set_grad_enabled at 0x276aad111f0>

**from IPython import display**不要落了这个

In [2]:
train_set = torchvision.datasets.FashionMNIST(
    root='./data'
    ,train=True
    ,download=True
    ,transform=transforms.Compose([
        transforms.ToTensor()
    ])
)

In [6]:
loader = torch.utils.data.DataLoader(train_set, batch_size = len(train_set), num_workers = 1)
data = next(iter(loader))
data[0].mean(),data[0].std() #data[0]是image，data[1]是label

(tensor(0.2860), tensor(0.3530))

Here, we can obtain the mean and standard deviation values by simply using the corresponding PyTorch tensor methods.
### Calculating mean And std The Hard Way
The hard way is hard because we need to **manually** implement the formulas for the mean and standard deviation and **iterate** over smaller batches of the dataset.

First, we create a data loader with a **smaller batch size**.

In [29]:
loader = torch.utils.data.DataLoader(train_set,batch_size = 1000, num_workers = 1)

In [35]:
loader

<torch.utils.data.dataloader.DataLoader at 0x276aad096a0>

Then, we calculate our $n$ value or **total number** of **pixels**:

In [30]:
num_of_pixels = len(train_set) * 28 * 28

Note that the $28 * 28$ is the height and width of the images inside our dataset. Now, we **sum** the **pixels values** by **iterating over each batch**, and we calculate the **mean** by **dividing** this **sum** by the total number of pixels. 因为是单通道的灰度图，这些pixel value在0到1之间

In [31]:
total_sum = 0
for batch in loader:
    total_sum += batch[0].sum() # batch[0]为image的pixel tensor
mean = total_sum / num_of_pixels

In [36]:
mean

tensor(0.2860)

Next, we calculate the sum of the **squared errors** by iterating thorough each batch, and this allows us to calculate the standard deviation by dividing the sum of the squared errors by the total number of pixels and square rooting the result.

In [37]:
sum_of_squared_error = 0
for batch in loader:
    sum_of_squared_error += ((batch[0] - mean).pow(2)).sum() # 平方和
std = torch.sqrt(sum_of_squared_error / num_of_pixels)

In [38]:
mean,std

(tensor(0.2860), tensor(0.3530))

### Using The `mean` And `std` Values
Our task is to use these values to transform the pixel values inside our dataset to their corresponding standardized values. To do this we create a new train_set only this time we pass a **normalization transform** to the transforms composition.

In [39]:
train_set_normal = torchvision.datasets.FashionMNIST(
    root='./data'
    ,train = True
    ,download=True
    ,transform=transforms.Compose([
        transforms.ToTensor()
        ,transforms.Normalize(mean, std)
    ])
)

Note that the **order** of the transforms **matters** inside the composition. The images are loaded as Python **PIL objects**, so we must add the `ToTensor()` transform before the `Normalize()` transform due to the fact that the `Normalize()` transform expects a **tensor** as input.

Now, that our dataset has a `Normalize()` transform, the data will be **normalized** when it is **loaded** by the data loader. Remember, for each image the **following transform** will be applied to **every pixel** in the image.$$z=\frac{x-mean}{std}$$

This has the effect of rescaling our data relative to the mean and standard deviation of the dataset. Let's see this in action by recalculating these values.

In [41]:
loader = torch.utils.data.DataLoader(
    train_set_normal
    ,batch_size=len(train_set)
    ,num_workers = 1
)
data = next(iter(loader))
data[0].mean(),data[0].std()

(tensor(1.2368e-05), tensor(1.0000))

Here, we can see that the mean value is now **0** and the standard deviation value is now **1**.
## Training With Normalized Data
Let's see now how training with and without normalized data affects the training process. To this test, we'll do 20 epochs under each condition.

Let's create a **dictionary** of **training sets** that we can use to run the test in the framework that we've been building throughout the course.

In [42]:
trainsets = {
    'not_normal':train_set
    ,'normal':train_set_normal
}

Now, we can add these two train_sets to our configuration and access the values inside our runs loop.

In [43]:
params = OrderedDict(
    lr = [.01]
    ,batch_size = [1000]
    ,num_workers = [1]
    ,device = ['cuda']
    ,trainset = ['not_normal','normal']
)

In [44]:
class Network(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)

        self.fc1 = nn.Linear(in_features=12 * 4 * 4, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=60)
        self.out = nn.Linear(in_features=60, out_features=10)

    def forward(self, t):
        t = t

        t = self.conv1(t)
        t = F.relu(t)
        t = F.max_pool2d(t,  kernel_size=2, stride=2)

        t = self.conv2(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)

        t = t.reshape(-1,12*4*4)
        t = self.fc1(t)
        t = F.relu(t)

        t = self.fc2(t)
        t = F.relu(t)

        t = self.out(t)

        return t



In [45]:
class RunBuilder():
    @staticmethod
    def get_runs(params):
        Run = namedtuple('Run', params.keys())

        runs = []
        for v in product(*params.values()):
            runs.append(Run(*v))

        return runs

In [70]:
class RunManager():
    def __init__(self):
        self.epoch_count = 0
        self.epoch_loss = 0
        self.epoch_num_correct = 0
        self.epoch_start_time = None

        self.run_params = None
        self.run_count = 0
        self.run_data = []
        self.run_start_time = None

        self.network = None
        self.loader = None
        self.tb = None

    def begin_run(self, run, network, loader):

        self.run_start_time = time.time()
        self.run_params = run
        self.run_count += 1

        self.network = network
        self.loader = loader
        self.tb = SummaryWriter(comment=f'-{run}')

        images,labels = next(iter(self.loader))
        grid = torchvision.utils.make_grid(images)

        self.tb.add_image('images',grid)
        self.tb.add_graph(self.network, images.to(getattr(run,'device','cpu')))

    def end_run(self):
        self.tb.close()
        self.epoch_count = 0

    def begin_epoch(self):
        self.epoch_start_time = time.time()

        self.epoch_count += 1
        self.epoch_loss = 0
        self.epoch_num_correct = 0
    
    def end_epoch(self):

        epoch_duration = time.time() - self.epoch_start_time
        run_duration = time.time() - self.run_start_time

        loss = self.epoch_loss / len(self.loader.dataset)
        accuracy = self.epoch_num_correct / len(self.loader.dataset)

        self.tb.add_scalar('Loss',loss,self.epoch_count)
        self.tb.add_scalar('Accuracy',accuracy,self.epoch_count)

        for name,param in self.network.named_parameters():
            self.tb.add_histogram(name, param, self.epoch_count)
            self.tb.add_histogram(f'{name}.grad', param.grad, self.epoch_count)

        results = OrderedDict()
        results["run"] = self.run_count
        results["epoch"] = self.epoch_count
        results["loss"] = loss
        results["accuracy"] = accuracy
        results["epoch duration"] = epoch_duration
        results["run duration"] = run_duration
        for k,v in self.run_params._asdict().items():results[k] = v
        self.run_data.append(results)

        df = pd.DataFrame.from_dict(self.run_data,orient='columns')
        
        
        display.clear_output(wait=True)
        display.display(df)

    def get_num_correct(self,preds, labels):
        return preds.argmax(dim=1).eq(labels).sum().item()

    def track_loss(self,loss,batch):
        self.epoch_loss += loss.item() * batch[0].shape[0]

    def track_num_correct(self,preds,labels):
        self.epoch_num_correct += self.get_num_correct(preds,labels)

    def save(self, fileName):

        pd.DataFrame.from_dict(
            self.run_data,orient = 'columns'
        ).to_csv(f'{fileName}.csv')

        with open(f'{fileName}.json','w',encoding='utf-8') as f:
            json.dump(self.run_data,f, ensure_ascii=False, indent = 4)

In [71]:
m = RunManager()

In [72]:
for run in RunBuilder.get_runs(params):

    device = torch.device(run.device)
    network = Network().to(device)
    loader = torch.utils.data.DataLoader(
          trainsets[run.trainset]
        , batch_size=run.batch_size
        , num_workers=run.num_workers
    )
    optimizer = optim.Adam(network.parameters(), lr=run.lr)

    m.begin_run(run, network, loader)
    for epoch in range(20):
        m.begin_epoch()
        for batch in loader:

            images = batch[0].to(device)
            labels = batch[1].to(device)
            preds = network(images) # Pass Batch
            loss = F.cross_entropy(preds, labels) # Calculate Loss
            optimizer.zero_grad() # Zero Gradients
            loss.backward() # Calculate Gradients
            optimizer.step() # Update Weights

            m.track_loss(loss, batch)
            m.track_num_correct(preds, labels)
        m.end_epoch()
    m.end_run()
m.save('results2')

Unnamed: 0,run,epoch,loss,accuracy,epoch duration,run duration,lr,batch_size,num_workers,device,trainset
0,1,1,0.947734,0.65295,7.394213,9.47062,0.01,1000,1,cuda,not_normal
1,1,2,0.503703,0.80665,7.25763,16.86289,0.01,1000,1,cuda,not_normal
2,1,3,0.415752,0.846583,7.157878,24.144438,0.01,1000,1,cuda,not_normal
3,1,4,0.370302,0.8634,7.059152,31.336203,0.01,1000,1,cuda,not_normal
4,1,5,0.340771,0.8744,7.065094,38.531948,0.01,1000,1,cuda,not_normal
5,1,6,0.310214,0.88405,7.095062,45.756663,0.01,1000,1,cuda,not_normal
6,1,7,0.292965,0.891167,7.107009,52.999277,0.01,1000,1,cuda,not_normal
7,1,8,0.282665,0.895983,7.073123,60.207014,0.01,1000,1,cuda,not_normal
8,1,9,0.269787,0.9,7.095014,67.437665,0.01,1000,1,cuda,not_normal
9,1,10,0.265291,0.901583,7.113964,74.686269,0.01,1000,1,cuda,not_normal


In [73]:
pd.DataFrame.from_dict(m.run_data).sort_values('accuracy', ascending=False)

Unnamed: 0,run,epoch,loss,accuracy,epoch duration,run duration,lr,batch_size,num_workers,device,trainset
19,1,20,0.221359,0.915817,7.379379,148.569969,0.01,1000,1,cuda,not_normal
39,2,20,0.222482,0.915383,12.208193,227.286297,0.01,1000,1,cuda,normal
17,1,18,0.225766,0.914917,7.224724,133.260759,0.01,1000,1,cuda,not_normal
18,1,19,0.225115,0.9145,7.580782,141.047989,0.01,1000,1,cuda,not_normal
38,2,19,0.227248,0.913667,12.207076,214.93748,0.01,1000,1,cuda,normal
36,2,17,0.230035,0.913633,11.658277,190.525438,0.01,1000,1,cuda,normal
35,2,16,0.229931,0.913033,11.896156,178.719088,0.01,1000,1,cuda,normal
34,2,15,0.235163,0.911317,11.82856,166.674348,0.01,1000,1,cuda,normal
37,2,18,0.23175,0.9112,11.905782,202.583802,0.01,1000,1,cuda,normal
15,1,16,0.235477,0.910983,7.080069,118.692657,0.01,1000,1,cuda,not_normal


## Quiz 05
1. Data normalization refers to a single specific algorithm for transforming data to a new form.
* False

2. After a data is normalized, the new values typically encode information relative to the original dataset and are _______________ in some way.
* rescaled

3. When we normalize a dataset we usually target each feature set inside the dataset independently
* True

4. Feature scaling is the act of transforming different features of a dataset to similar scales.
* True

5. The _______________ of a feature set refers to the value range of the data.
* scale

6. Suppose we normalize a set of positive values by dividing each value by the maximum value of the set. What will the largest value of the new normalized set be?
* 1

7. Suppose we normalize a set of positive values by dividing each value by the maximum value of the set. This normalized values will be rescaled to which interval?
* $[0,1]$

8. Data normalization is a specific type of standardization technique.
* False