<h1 style="text-align: center;">Deep Learning with PyTorch</h1>

<br>

This lesson is dedicated to implementation of complex DL models in just a bunch of lines of Python code. The lesson doesn't pretend to be a complete DL manual, as the field is very wide and dynamic. The goal is to make you familiar with the PyTorch library specifics and implementation details, assuming that you're already familiar with DL fundamentals.

<img src="assets/pytorch1.png">

<br>

# 01. Tensors

---

Below are some terminology that we are going to use in this lesson:
- A **tensor** is a multi-dimensional array.
- A **scalar** (single number) is like a point, which is zero-dimensional. 
- A **vector** is one-dimensional like a line segment.
- A **matrix** is a two-dimensional object. 
- Three-dimensional number collections can be represented by a parallelepiped of numbers, but don't have a separate name in the same way as matrix. We can keep this term for collections of higher dimensions, which are named **multi-dimensional matrices or tensors**.

<img width="750px" src="assets/tensor.png">

<br>

### 01.1. Creation of Tensors

As you know, the centeral purpose of NumPy library is to handle multi-dimensional arrays in a generic way. In NumPy, such arrays aren't called tensors, but, in fact, they are tensors. Tensors are used very widely in scientific computations, as generic storage for data. For example, a color image could be encoded as a 3D tensor with dimensions of width, height, and color plane.

<br>

A tensor is characterized by dimension and type. PyTorch supports **eight types**:

1. Three float types (16-bit, 32-bit, and 64-bit) 
2. Five integer types (8-bit signed, 8-bit unsigned, 16-bit, 32-bit, and 64-bit). 

<br>

Tensors are represented by different **type classes**. The most common ones are as follow:
1. torch.FloatTensor (corresponding to a 32-bit float)
2. torch.ByteTensor (an 8-bit unsigned integer)
3. torch.LongTensor (a 64-bit signed integer). The rest can be found in the documentation.

<br>

There are three ways for **creating a tensor:**
1. Calling a constructor with a type.
2. Converting an array or a list into a tensor which the type will be same as the array's type.
3. Calling a build-in function like torch.zeros()

To give you examples of these methods, let's look at a simple session:

In [1]:
# Import the libraries
import torch
import numpy as np

In [2]:
# Create a 3x2 float type tensor
a = torch.FloatTensor(3, 2)  # PyTorch allocates memory for the tensor, but doesn't initialize it with anything
print(a)

tensor([[0.0000e+00, 2.0000e+00],
        [0.0000e+00, 2.0000e+00],
        [4.2039e-45, 0.0000e+00]])


In [3]:
# Clear the tensor's content
a.zero_()

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

There are two types of operation for tensors: 
- **Inplace:** Inplace operations have an underscore appended to their name and operate on the tensor's content. After this, the object itself is returned. Inplace operations are usually more efficient from a performance and memory point of view.

- **Functional:** The functional equivalent creates a copy of the tensor with the performed modification, leaving the original tensor untouched.


Another way to create a tensor by its constructor is to provide a Python iterable (for example, a list or tuple), which will be used as the contents of the newly created tensor:

In [5]:
# Convert a list of lists into PyTorch tensor
torch.FloatTensor([[1,2,3],[3,2,1]])

tensor([[1., 2., 3.],
        [3., 2., 1.]])

The **torch.tensor** method accepts the NumPy array as an argument and creates a tensor of appropriate shape from it. 

In the following example, we created a NumPy array initialized by zeros, which created a double (64-bit float) array by default. So, the resulting tensor has the DoubleTensor type.

Usually in deep learning, double precision is not required and it adds an extra memory and performance overhead. The common practice is to use the 32-bit float type, or even the 16-bit float type, which is more than enough. 

In [7]:
# Create a zero object using NumPy
n = np.zeros(shape = (3, 2))
print(n)

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [9]:
# Convert the zeros object into PyTorch tensor
b = torch.tensor(n)
print(b)

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]], dtype=torch.float64)


In [11]:
# Create a zero object using NumPy (with 32-bit float type)
n = np.zeros(shape = (3, 2), dtype = np.float32)

# Convert the zeros object into PyTorch tensor
b = torch.tensor(n)
print(b)

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])


As an option, the type of the desired tensor could be provided to the torch.tensor function in the dtype argument. Be careful to use PyTorch type and not NumPy type.

In [8]:
# Create a 3x2 float type tensor
n = np.zeros(shape = (3, 2))

# Convert the zeros object into PyTorch tensor + Specify the type
torch.tensor(n, dtype = torch.float32)

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

**Compatibility note:** The torch.tensor() method and PyTorch type specification were added in the 0.4.0 release. In previous versions, the torch.from_numpy() function was a recommended way to convert NumPy arrays, but it had issues with handling the combination of the Python list and NumPy arrays. This from_numpy() is deprecated in favor of the more flexible torch.tensor() method.

<br>

### 01.2. Scalar Tensors

Since the 0.4.0 release, PyTorch supports zero-dimensional tensors that correspond to scalar values (on the left of Figure 1). Such tensors can be a result of some operations, such as summing all values in a tensor. These tensors can be created using torch.tensor() function. To access the actual Python value of such a tensor use item() method.

In [3]:
# Creating a tensor
a = torch.tensor([1,2,3])
print(a)

tensor([1, 2, 3])


In [10]:
# Sum up the values inside the tensor
s = a.sum()
print(s)

tensor(6)

In [11]:
# Get the actual python values of the tensor
s.item()

6

In [12]:
# Create a zero-dimensional tensor
torch.tensor(1)

tensor(1)

<br>

### 01.3. Tensor Operations

There are lots of operations that you can perform on tensors. You can search all of them in the PyTorch documentation at http://pytorch.org/docs.

Besides the inplace and functional variants (e.g. with and without underscore, like zero() and zero_()), there are two places to look for operations: 
- **The torch package:** In this case, the function usually accepts the tensor as an argument.
- **The tensor class:** In this case, it operates on the called tensor.

Most of the time, tensor operations are trying to correspond to their NumPy equivalent, so if there is some not-very-specialized function in NumPy, then there is a good chance that PyTorch will also have it. Examples are torch.stack(), torch.transpose(), and torch.cat().

<br>

### 01.4. GPU Tensors

PyTorch supports CUDA GPUs, which means that all operations have two versions — CPU and GPU. Every tensor type that we mentioned is for CPU and has its GPU equivalent as well. Let's see the difference between these two:

- **CPU:** CPU tensors reside in the _torch._ package. For example, _torch.FloatTensor_ is a 32-bit float tensor which resides in CPU memory. 

- **GPU:** GPU tensors reside in the _torch.cuda_ package. For example, _torch.cuda.FloatTensor_ is a 32-bit float tensor which resides in GPU counterpart. 

<br>

There are two ways for converting from CPU to GPU:
1. There is a tensor method **to(device)** which creates a copy of the tensor to a specified device (which could be CPU or GPU). Device type can be specified in different ways. You can pass a string name of the device, which is "cpu" for CPU memory or "cuda" for GPU. A GPU device could have an optional device index specified after the colon, for example, the second GPU card in the system could be addressed by "cuda:1" (index is zero-based).
<br><br>
2. Another slightly more efficient way to specify a device in the to() method is using the torch.device class, which accepts the device name and optional index. For accessing the device that your tensor is currently residing in, it has a device property.

In [17]:
# Create a float tensor
a = torch.FloatTensor([2,3])
print(a)

tensor([2., 3.])


In [13]:
print(a + 1)

tensor([3., 4.])


In [None]:
# Copy the tensor a to GPU
c = a.cuda() 
print(c)

_tensor([ 2.,  3.], device='cuda:0')_

In [None]:
c + 1

_tensor([ 3.,  4.], device='cuda:0')_

In [None]:
# Check the device of tensor 'c'
c.device

_device(type='cuda', index=0)_

The following codes are taken from **PyTorch Documentation** from https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device:

In [18]:
# Define a CPU device type with index 0 (first CPU)
device_cpu = torch.device("cpu:0")

In [19]:
# Create a tensor in the given device
a = torch.FloatTensor([2,3], device = device_cpu)
print(a)

tensor([2., 3.])


In [None]:
# Check in which device tensor 'a' is
a.device

In [20]:
# Create a random tensor in the given device
torch.randn((2,3), device = device_cpu)

tensor([[-0.2818,  0.4418,  0.0888],
        [-1.5435,  0.1090, -0.3801]])

<br>

# 02. Gradients

---

The functionality of the ___automatic gradients computation___ was originally implemented in the Caffe toolkit and then became the de-facto standard in DL libraries. Computing gradients manually is extremely painful to implement and debug, even for the simplest neural network (NN). You have to calculate derivatives for all your functions, apply the chain rule, and then implement the result of the calculations, praying that everything is done right.

Now defining an NN of hundreds of layers requires nothing more than assembling it from predefined building blocks. All gradients will be carefully calculated for you, backpropagated, and applied to the network. To be able to achieve this, you need to define your network architecture in terms of the DL library used, which can be different in details, but in general, must be the same: you define the order in which your network will transform inputs to outputs.

<img width="600px" src="./assets/gradient flow.png">

There are two approaches on how the gradients will be calculated:

1. **Static graph:** <br> In this method, we will define our calculation in advance (we can't change this later). This graph gets processed and optimized by the DL library before any computation. This model of graph is implemented in Tensorflow and Theano.


2. **Dynamic graph:** <br> In this method, there is no need to define the graph in advance. In here, we execute the operations for data transformation (on the actual data). Meanwhile, the library will record the order of operations. Then when it wants to calculate the gradients, it unrolls its history of operations, accumulating the gradients of network parameters. This method is also called _notebook gradients_. This method is implemented in PyTorch and Chainer.

<br>

Both methods have their <u>strengths and weaknesses</u>. For example:
1. **Static graph:**
    - Faster since all computations can be moved to the GPU, minimizing the data transfer overhead. 
    - The library has more freedom in optimizing the order that computations are performed in, or even removing parts of the graph. 


2. **Dynamic graph:**
    - Higher computation overhead
    - More freedom for the deveoper. For example, they can say, "For this piece of data, I can apply this network two times, and for this piece of data, I'll use a completely different model with gradients clipped by the batch mean." 
    - Allows us to express our transformation more naturally and in a more "Pythonic" way. 

<br>

### 02.1. Tensors and Gradients

PyTorch tensors have a built-in gradient calculation and tracking machinery, so all you need to do is to convert the data into tensors and perform computations using the tensor's methods. Of course, if you need to access underlying low-level details, you always can, but most of the time, PyTorch does what you're expecting.

There are several attributes related to gradients that every tensor has:
- **grad:** A property which holds a tensor of the same shape containing computed gradients.


- **is_leaf:** <br>
    - True if this tensor is constructed by the user 
    - False if the object is a result of function transformation


- **requires_grad:** <br>
    - True if this tensor requires gradients to be calculated. This property is inherited from leaf tensors, which get this value from the tensor construction step (torch.zeros(), torch.tensor(), etc.). 
    - False (default) if you want gradients to be calculated for your tensor, then you need to explicitly say so.

To make all of this gradient-leaf machinery clearer, let's consider this session:

In [5]:
# Create a tensor that the gradients are required to be calculated
v1 = torch.tensor([1.0, 1.0], requires_grad = True)

In [6]:
# Create a tensor that the gradients are NOT required to be calculated
v2 = torch.tensor([2.0, 2.0])

In [7]:
# Add up the tensors element-wise (which is vector [3, 3])
v_sum = v1 + v2

In [8]:
# Double every element and sum the elements together
v_res = (v_sum*2).sum()
v_res

tensor(12., grad_fn=<SumBackward0>)

The result is a zero-dimension tensor with the value 12. Okay, so this is simple math so far. Now let's look at the underlying graph that our expressions created:

<img width="500px" src="assets/graph expression.png">

If we check the attributes of our tensors, then we find that v1 and v2 are the only leaf nodes and every variable except v2 requires gradients to be calculated:

In [23]:
print("Is v1 constructed by user: ", v1.is_leaf)
print("Is v2 constructed by user: ", v2.is_leaf)
print("Is v_sum constructed by user: ", v_sum.is_leaf)
print("Is v_res constructed by user: ", v_res.is_leaf)

Is v1 constructed by user:  True
Is v2 constructed by user:  True
Is v_sum constructed by user:  False
Is v_res constructed by user:  False


In [24]:
print("Is gradient calculation required in v1:", v1.requires_grad)
print("Is gradient calculation required in v2:", v2.requires_grad)
print("Is gradient calculation required in v_sum:", v_sum.requires_grad)
print("Is gradient calculation required in v_res:", v_res.requires_grad)

Is gradient calculation required in v1: True
Is gradient calculation required in v2: False
Is gradient calculation required in v_sum: True
Is gradient calculation required in v_res: True


In [25]:
# Compute the gradients of our graph - calculate the numerical derivative of the v_res variable, with respect to any variable that our graph has
v_res.backward()

In the following particular example, the value of 2 in v1's gradients means that by increasing every element of v1 by one, the resulting value of v_res will grow by two.

In [28]:
# Get the gradient value of v1
v1.grad

tensor([2., 2.])

As mentioned, PyTorch calculates gradients only for leaf tensors with requires_ grad=True. Indeed, if we try to check the gradients of v2 we get nothing:

In [29]:
# Get the gradient value of v1 (there's nothing because requires_grad=False)
v2.grad

The reason for that is efficiency in terms of computations and memory: in real
life, our network can have millions of optimized parameters, with hundreds of intermediate operations performed on them. During gradient descent optimization, we're not interested in gradients of any intermediate matrix multiplication; the only thing we want to adjust in the model is gradients of loss with respect to model parameters (weights). Of course, if you want to calculate the gradients of input data (it could be useful if you want to generate some adversarial examples to fool the existing NN or adjust pretrained word embeddings), then you can easily do so, by passing requires_grad=True on tensor creation.

<br>

# 03. NN Building Blocks

---

There are lots of predefined classes in torch.nn package. They are designed with practice in mind. For example, using minibaches, weight initialization, etc. All classes can act as a function when applied to its argument. For example, Linear class can implement a feed-forward layer with optional bias.

In [2]:
# Import the libraries
import torch.nn as nn

In [5]:
# Create a feed-forward layer with two inputs and five outputs
l = nn.Linear(in_features = 2, out_features = 5, bias = True)

In [10]:
# Create a float tensor
v = torch.FloatTensor([1, 2])

In [11]:
# Input the tensor into the layer
l(v)

tensor([ 0.7293,  1.3634, -1.8153,  1.8636, -0.6975], grad_fn=<AddBackward0>)

All classes in the torch.nn packages inherit from the nn.Module base class, which you can use to implement your own higher-level NN blocks. We'll see how you can do this in the next section.

Now let's look at useful methods that all nn.Module children provide. They are as follows:

- **parameters():** <br>A function that returns iterator of all variables which require gradient computation (that is, module weights) <br><br>
- **zero_grad():** <br>This function initializes all gradients of all parameters to zero<br><br>
- **to(device):** <br>This moves all module parameters to a given device (CPU or GPU)<br><br>
- **state_dict():** <br>This returns the dictionary with all module parameters and is useful for model serialization<br><br>
- **load_state_dict():** <br>This initializes the module with the state dictionary<br><br>

There is a class that allows us to combine layers into the pipe: __Sequential__. Let's see an example of it:

In [14]:
# Create a sequential and combine multiple layers
s = nn.Sequential(nn.Linear(in_features = 2, out_features = 5), 
                  nn.ReLU(), 
                  nn.Linear(in_features = 5, out_features = 20),
                  nn.ReLU(),
                  nn.Linear(in_features = 20, out_features = 10),
                  nn.Dropout(p = 0.3),
                  nn.Softmax(dim = 1))

In [15]:
s

Sequential(
  (0): Linear(in_features=2, out_features=5, bias=True)
  (1): ReLU()
  (2): Linear(in_features=5, out_features=20, bias=True)
  (3): ReLU()
  (4): Linear(in_features=20, out_features=10, bias=True)
  (5): Dropout(p=0.3, inplace=False)
  (6): Softmax(dim=1)
)

Here, we defined a three-layer NN with softmax on output, applied along dimension 1 (dimension 0 is batch samples), ReLU nonlinearities and dropout. Let's push something through it:

In [16]:
# Push a tensor through the layers
tensor = torch.FloatTensor([[1, 2]]) 
s(tensor)

tensor([[0.0898, 0.1059, 0.1098, 0.0902, 0.1264, 0.1270, 0.0524, 0.1016, 0.0958,
         0.1011]], grad_fn=<SoftmaxBackward>)

So, our minibatch is one example successfully traversed through the network!

In [17]:
s.state_dict()

OrderedDict([('0.weight', tensor([[ 0.6234,  0.1310],
                      [ 0.3247, -0.6904],
                      [ 0.1651, -0.2135],
                      [ 0.1598, -0.5944],
                      [ 0.1943, -0.0256]])),
             ('0.bias', tensor([-0.4926,  0.5587, -0.3992,  0.2365, -0.4030])),
             ('2.weight',
              tensor([[-0.2622,  0.1180, -0.0293, -0.2041,  0.1821],
                      [-0.0837,  0.2156, -0.0450, -0.0928,  0.1783],
                      [ 0.4273,  0.0384, -0.1828, -0.0298,  0.0754],
                      [ 0.4340, -0.3011,  0.3925,  0.2849,  0.1206],
                      [-0.3380, -0.2890,  0.2108, -0.0456,  0.0236],
                      [-0.3773, -0.1504,  0.2294,  0.1127,  0.1589],
                      [ 0.3415,  0.2268, -0.2070, -0.1196, -0.2004],
                      [ 0.1787,  0.2802,  0.3674,  0.4372,  0.3728],
                      [ 0.3034,  0.4025, -0.3040,  0.0568, -0.4281],
                      [ 0.3720, -0.1388, -0.3515

<br>

# 04. Custom Layers

---

By subclassing the nn.Module class, you can create your own building blocks which can be stacked together, reused later, and integrated into the PyTorch framework flawlessly.

<br>

At its core, nn.Module provides quite rich functionality to its children. For example:

1. Track all sub-modules that the current module includes.
2. Functions for dealing with all parameters of the registered submodules. For example:
    - Obtain the full list of the module's parameters (parameters() method)
    - Zero the gradients (zero_grads() method)
    - Move to CPU or GPU (to(device) method)
    - Serialize and deserialize the module (state_dict() and load_state_dict()), 
    - Perform generic transformations using your own callable (apply() method).
3. It establishes the convention of module application to data. Every module needs to perform its data transformation in the forward() method by overriding it
4. There are some more functions, such as the ability to register a hook function to tweak module transformation or gradients flow, but it's more for advanced use cases.

These functionalities allow us to nest our submodels into higher-level models in a unified way.

<br>

To create a custom module, we usually have to do two things: 
* register submodules 
* Implement the forward() method. 

Let's look at how this can be done for our Sequential example from the previous section, but in a more generic and reusable way:

In [22]:
# Import the libraries
import torch
import torch.nn as nn

In [23]:
# Our module class that inherits nn.Module
class OurModule(nn.Module):
    
    # The constructor
    def __init__(self, num_inputs, num_classes, dropout_prob = 0.3):
        
        # Call the parent's constructor to initialize itself
        super(OurModule, self).__init__()
        
        # Create a sequential with bunch of layers
        self.pipe = nn.Sequential(nn.Linear(in_features = num_inputs, out_features = 5),
                                  nn.ReLU(),
                                  nn.Linear(in_features = 5, out_features = 20),
                                  nn.ReLU(),
                                  nn.Linear(in_features = 20, out_features = num_classes),
                                  nn.Dropout(p = dropout_prob),
                                  nn.Softmax())

    # Forward function (we override the actual build-in forward function with our implementation of data transformation)
    def forward(self, x):
        return self.pipe(x)

In [24]:
if __name__ == "__main__":
    
    # Create ourmodule with the desired number of inputs and outputs
    net = OurModule(num_inputs = 2, num_classes = 3)
    print(net)
    
    # Create a tensor
    v = torch.FloatTensor([[2, 3]])
    
    # Ask ourmodule to transform our tensor
    out = net(v)
    print(out)
    
    # Use Cuda if available
    print("Cuda's availability is %s" % torch.cuda.is_available())
    if torch.cuda.is_available():
        print("Data from cuda: %s" % out.to('cuda'))

OurModule(
  (pipe): Sequential(
    (0): Linear(in_features=2, out_features=5, bias=True)
    (1): ReLU()
    (2): Linear(in_features=5, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
    (5): Dropout(p=0.3, inplace=False)
    (6): Softmax(dim=None)
  )
)
tensor([[0.3555, 0.3358, 0.3087]], grad_fn=<SoftmaxBackward>)
Cuda's availability is False


<br>

# 05. Final glue – loss functions and optimizers

---

It's not enough to have a network that only transforms the input into output. Our learning objective is to have a function that accepts two arguments: network's output, and desired output. This funtion should return to us a single number which tells how close the network's prediction is from the desired result. This function is called the __loss function__, and its output is the __loss value__. 

After getting the loss value, we will calculate the  __gradients of network parameters__ and adjust them to decrease this loss value, which pushes our model to better results in the future. 

<br>

### 05.1. Loss functions

Loss functions reside in the nn package and are implemented as an nn.Module subclass. Usually, they accept two arguments: output from the network (prediction), and desired output (ground-truth data which is also called the label of the data sample). 

The most common loss functions are:

- **nn.MSELoss:** <br>The Mean Square Error between arguments, which is the standard loss for regression problems.<br><br>

- **nn.BCELoss and nn.BCEWithLogits:** <br>Binary Cross-Entropy loss. The first version expects a single probability value (usually it's the output of the Sigmoid layer), while the second version assumes raw scores as input and applies Sigmoid itself. The second way is usually more numerically stable and efficient. These losses (as their names suggest) are frequently used in binary classification problems.<br><br>

- **nn.CrossEntropyLoss and nn.NLLLoss:** Famous "Maximum Likelihood" Criteria, which is used in multi-class classification problems. The first version expects raw scores for each class and applies LogSoftmax internally, while the second expects to have log probabilities as the input.

<br>

### 05.2. Optimizers

An optimizer takes the gradients of a model parameters, and change the parameters in order to decrease the loss value. By decreasing the loss value, we're pushing our model towards desired outputs. 

In torch.optim package, PyTorch provides a lot of optimizer implementations. The most widely known are as follows:

- **SGD:** <br>A vanilla stochastic gradient descent algorithm with optional momentum extension.<br><br>
- **RMSprop:** <br>An optimizer, proposed by G. Hinton<br><br>
- **Adagrad:** <br>An adaptive gradients optimizer

On construction, you need to pass an iterable of Variables, which will be modified during the optimization process. The usual practice is to pass the result of the params() call of the upper-level nn.Module instance, which will return an iterable of all leaf Variables with gradients. Now, let's discuss the common blueprint of a training loop:

In [None]:
# Iterate through input and output batches of data
for batch_samples, batch_labels in iterate_batches(data, batch_size = 32): 
    
    # Convert the samples into tensor
    batch_samples_t = torch.tensor(batch_samples)
    
    # Convert the labels into tensor
    batch_labels_t = torch.tensor(batch_labels)
    
    # Pass the data samples to the network
    out_t = net(batch_samples_t) 
    
    # Calculate the loss
    loss_t = loss_function(out_t, batch_labels_t) 
    
    # Calcuate the gradients for every leaf tensor with require_grad=True (Every time a gradient is calculated, it is accumulated in the tensor.grad field)
    loss_t.backward()
    
    # Apply the optimizer
    optimizer.step()
    
    # Do zero gradients of parameters
    optimizer.zero_grad()

<br>

# 06. Monitoring with TensorBoard
---

If you have ever tried to train a NN on your own, then you may know how painful it is to tune the hyperparameters. Of course, with practice and experience, you'll develop a strong intuition about the possible causes of problems, but intuition needs input data about what's going on inside your network. So you need to be able to peek inside your training process somehow and observe its dynamics. Below is a list of things that you should observe during your training, which usually includes the following:
- __Loss Value:__, This can consists of several components like base loss and regularization losses. You should monitor both total loss and individual components over time.
- __Validation results__ on training and test sets.
- __Gradients and weights__ statistics.
- __Hyperparameters__ that get adjusted over time like learning rate.

The list could be much longer and include domain-specific metrics, such as word embeddings' projections, audio samples, and images generated by GAN. You also may want to monitor values related to training speed, like how long an epoch takes, to see the effect of your optimizations or problems with hardware.

To make a long story short, you need a generic solution to track lots of values over time and represent them for analysis, preferably developed specially for DL. Luckily, such tools exist.

<br>

### 06.1. TensorBoard 101

TensorFlow included a special tool called TensorBoard, developed to observe and analyze various NN characteristics over training. TensorBoard is a powerful, generic solution with a large community and it looks quite pretty:

<img width="600px" src="assets/tensorboard.png">

TensorBoard is a Python web service which you can start on your computer, passing it the directory where your training process will save values to be analyzed. Then you point your browser to TensorBoard's port (usually 6006), and it shows you an interactive web interface with values updated in real-time. TensorBoard was deployed as a part of TensorFlow, but recently, it has been moved to a separate project and has its own package name. 

In theory, this is all you need to start monitoring your networks, as the tensorflow package provides you with classes to write the data that TensorBoard will be able to read. However, it's not very practical, as those classes are very low level. To overcome this, there are several third-party open-source libraries that provide
a convenient high-level interface. One of my favorites, which is used in this book, is tensorboard-pytorch (https://github.com/lanpa/tensorboard-pytorch). You can install it using the following line:

<br>

<center><code>pip install tensorboard-pytorch</code></center>

In [5]:
!pip install tensorboard-pytorch
!pip install tensorboardX



<br>

### 06.2. Plotting stuff

To give you an impression of how simple tensorboard-pytorch is, let's consider a small example that is not related to NNs, but is just about writing stuff into TensorBoard (the full example code is in Chapter03/02_tensorboard.py).

In [1]:
# Import the libraries
import math
from tensorboardX import SummaryWriter

# Begin the program
if __name__ == "__main__":
    
    # Writer of data
    writer = SummaryWriter()

    # Functions that are going to be visualized
    funcs = {"sin": math.sin, "cos": math.cos, "tan": math.tan}

    # Loop over angle ranges in degrees
    for angle in range(-360, 360):
        
        # Convert the angle range (in degrees) into radians
        angle_rad = angle * math.pi / 180
        
        # Loop over functions and their names
        for name, fun in funcs.items():
            
            # Calculate our functions' values
            val = fun(angle_rad)
            
            # Add every value to the writer
            writer.add_scalar(name, val, angle)

    # Close the writer
    writer.close()

By default, SummaryWriter will create a unique directory under the runs directory for every launch, to be able to compare different launches of training. Names of the new directory include the current date and time, and hostname. To override this, you can pass the log_dir argument to SummaryWriter. You also can add a suffix to the name of the directory by passing a comment option, for example to capture different experiments' semantics, such as dropout=0.3 or strong_regularisation.

Note that the writer does a periodical flush (by default, every two minutes), so even in the case of a lengthy optimization process, you still will see your values.

The result of running this will be zero output on the console, but you will see a new directory created inside the runs directory with a single file. To look at the result, we need to start TensorBoard:

In [None]:
!tensorboard --logdir runs --host localhost

Now you can open http://localhost:6006 in your browser to see something like this:

<img width="800px" src="assets/plot.png">

The graphs are interactive, so you can hover over them with your mouse to see the actual values and select regions to zoom into details. To zoom out, double-click inside the graph. If you run your program several times, then you will see several items
in the "runs" list on the left, which can be enabled and disabled in any combinations, allowing you to compare the dynamics of several optimizations. TensorBoard
allows you to analyze not only scalar values but also images, audio, text data, and embeddings, and it can even show you the structure of your network. Refer to the documentation of tensorboard-pytorch and tensorboard for all those features.

<br>

# 07. Example – GAN on Atari images
---

In this example, we'll train a __Generative Adversarial Networks (GANs)__ to generate screenshots of various Atari games.

The simplest GAN architecture is this: we have two networks and the first works as a "cheater" (that is ___generator___), and the other is a "detective" (that is ___discriminator___). Both networks compete with each other: The generator tries to generate fake data, which will be hard for the discriminator to distinguish from your dataset, and the discriminator tries to detect the generated data samples. Over time, both networks improve their skills. The generator produces more and more realistic data samples, and the discriminator invents more sophisticated ways to distinguish the fake items. 

Practical usage of GANs includes image quality improvement, realistic image generation, and feature learning. So, let's get started. The whole example code is in the file Chapter03/03_atari_ gan.py.

In [2]:
# Import the libraries
import random
import argparse
import cv2
import torch
import torch.nn as nn
import torch.optim as optim
from tensorboardX import SummaryWriter
import torchvision.utils as vutils
import gym
import gym.spaces
import numpy as np

In [3]:
log = gym.logger
log.set_level(gym.logger.INFO)

In [4]:
# Hyperparameters
LATENT_VECTOR_SIZE = 100
DISCR_FILTERS = 64
GENER_FILTERS = 64
BATCH_SIZE = 16
IMAGE_SIZE = 64 # dimension input image will be rescaled
LEARNING_RATE = 0.0001
REPORT_EVERY_ITER = 100
SAVE_IMAGE_EVERY_ITER = 1000

In [5]:
# Input wrapper class that inherits gym.ObservationWrapper
class InputWrapper(gym.ObservationWrapper):
    """
    Preprocessing of input numpy array:
        1. resize image into predefined size
        2. move color channel axis to a first place
    """
    # The constructor
    def __init__(self, *args):
        
        # Call the parent's constructor to initialize itself
        super(InputWrapper, self).__init__(*args)
        
        # The type of observation space should be Box
        assert isinstance(self.observation_space, gym.spaces.Box)
        
        # The old observation space
        old_space = self.observation_space
        
        # The new observation space
        self.observation_space = gym.spaces.Box(self.observation(old_space.low), self.observation(old_space.high), dtype=np.float32)

        
    def observation(self, observation):
        
        # Resize image from 210×160 (standard Atari resolution) to a square size 64×64
        new_obs = cv2.resize(observation, (IMAGE_SIZE, IMAGE_SIZE))
        
        # Move color plane of the image from the last position to the first (e.g. 64x64x3 => 3x64x64) - This is PyTorch convention for using convolution layers
        new_obs = np.moveaxis(new_obs, 2, 0)
        
        # Cast the image from bytes to float + rescale its values to a 0..1 range
        return new_obs.astype(np.float32) / 255.0

Then we define two nn.Module classes: Discriminator and Generator. The first takes our scaled color image as input and, by applying five layers of convolutions, converts it into a single number, passed through a sigmoid nonlinearity. The output from Sigmoid is interpreted as the probability that Discriminator thinks our input image is from the real dataset.

Generator takes as input a vector of random numbers (latent vector) and using the "transposed convolution" operation (it is also known as deconvolution), converts this vector into a color image of the original resolution.

<img width="800px" src="./assets/gymenv.png">

As input, we'll use screenshots from several Atari games played simultaneously by a random agent. Figure 6 is an example of what the input data looks like.

In [6]:
# Discriminator class that inherits nn.Module
class Discriminator(nn.Module):
    """
    Discriminator Network.
    """
    
    # Constructor
    def __init__(self, input_shape):
        
        # Call the parent's constructor to initialize itself
        super(Discriminator, self).__init__()
        
        # Pipe for colvolving image into the single number - output is the probability that network thinks the input image is real
        self.conv_pipe = nn.Sequential(
            nn.Conv2d(in_channels=input_shape[0], out_channels=DISCR_FILTERS, kernel_size=4, stride=2, padding=1),
            nn.ReLU(),

            nn.Conv2d(in_channels=DISCR_FILTERS, out_channels=DISCR_FILTERS*2, kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS*2),
            nn.ReLU(),

            nn.Conv2d(in_channels=DISCR_FILTERS*2, out_channels=DISCR_FILTERS*4, kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS*4),
            nn.ReLU(),

            nn.Conv2d(in_channels=DISCR_FILTERS*4, out_channels=DISCR_FILTERS*8, kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS*8),
            nn.ReLU(),

            nn.Conv2d(in_channels=DISCR_FILTERS*8, out_channels=1, kernel_size=4, stride=1, padding=0),
            nn.Sigmoid())
    
    # Forward function 
    def forward(self, x):
        conv_out = self.conv_pipe(x)
        return conv_out.view(-1, 1).squeeze(dim=1)

In [7]:
# Generator class that inherits nn.Module
class Generator(nn.Module):
    """
    Generator Network.
    """
    # Constructor
    def __init__(self, output_shape):
        
        # Call the parent's constructor to initialize itself
        super(Generator, self).__init__()
        
        # Pip for deconvolving the input vector into 3x64x64 image
        self.pipe = nn.Sequential(
            nn.ConvTranspose2d(in_channels=LATENT_VECTOR_SIZE, out_channels=GENER_FILTERS*8, kernel_size=4, stride=1, padding=0),
            nn.BatchNorm2d(GENER_FILTERS * 8),
            nn.ReLU(),
            
            nn.ConvTranspose2d(in_channels=GENER_FILTERS*8, out_channels=GENER_FILTERS*4, kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS * 4),
            nn.ReLU(),
            
            nn.ConvTranspose2d(in_channels=GENER_FILTERS*4, out_channels=GENER_FILTERS*2, kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS * 2),
            nn.ReLU(),
            
            nn.ConvTranspose2d(in_channels=GENER_FILTERS*2, out_channels=GENER_FILTERS, kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS),
            nn.ReLU(),
            
            nn.ConvTranspose2d(in_channels=GENER_FILTERS, out_channels=output_shape[0], kernel_size=4, stride=2, padding=1),
            nn.Tanh())

    # Forward function 
    def forward(self, x):
        return self.pipe(x)

In [8]:
# Function for getting the batches
def iterate_batches(envs, batch_size = BATCH_SIZE):
    
    # Get the environments and reset them
    batch = [e.reset() for e in envs]
    
    # Environment generator (randomely)
    env_gen = iter(lambda: random.choice(envs), None)
    
    # Infinite loop
    while True:
        
        # Get the next environment
        e = next(env_gen)
        
        # Take a random action and get the tuple of (new observation, reward, terminal state, extra information)
        obs, reward, is_done, _ = e.step(e.action_space.sample())
        
        # If the mean of observations is above 0.01
        if np.mean(obs) > 0.01:
            
            # Append the observations into the batch
            batch.append(obs)
            
        # If length of batch is equal to batch size
        if len(batch) == batch_size:
            
            # Normalising input between -1 to 1
            batch_np = np.array(batch, dtype=np.float32) * 2.0 / 255.0 - 1.0
            
            # Yield the batch tensor
            yield torch.tensor(batch_np)
            
            # Clear the batch
            batch.clear()
            
        # If terminal state
        if is_done:
            
            # Reset
            e.reset()

In [None]:
if __name__ == "__main__":
    
    # Select CPU or GPU - OPTION I
    #parser = argparse.ArgumentParser()
    #parser.add_argument("--cuda", default=False, action='store_true')
    #args = parser.parse_args()
    #device = torch.device("cuda" if args.cuda else "cpu")
    
    # Select CPU or GPU - OPTION II
    device = torch.device("cpu")
    
    # Environment names
    env_names = ('Breakout-v0', 'AirRaid-v0', 'Pong-v0')
    
    # Get the environments
    envs = [InputWrapper(gym.make(name)) for name in env_names]
    
    # Get the input shape
    input_shape = envs[0].observation_space.shape

    # Generator network
    net_discr = Discriminator(input_shape=input_shape).to(device)

    # Discriminator network
    net_gener = Generator(output_shape=input_shape).to(device)

    # Loss function of Binary Cross Entropy (BCL)
    objective = nn.BCELoss()

    # Optimizer for generator
    gen_optimizer = optim.Adam(params=net_gener.parameters(), lr=LEARNING_RATE)

    # Optimizer for discriminator
    dis_optimizer = optim.Adam(params=net_discr.parameters(), lr=LEARNING_RATE)

    # Writer of data
    writer = SummaryWriter()

    # Initinialize a list for accumulated loss for generator
    gen_losses = []

    # Initinialize a list for accumulated loss for discriminator
    dis_losses = []

    # Initialize the iterator counter
    iter_no = 0

    # Initialize variables for true labels
    true_labels_v = torch.ones(BATCH_SIZE, dtype=torch.float32, device=device)

    # Initialize variables for fake labels
    fake_labels_v = torch.zeros(BATCH_SIZE, dtype=torch.float32, device=device)
    
    # Generate a random vector and iterate through it
    for batch_v in iterate_batches(envs):

        # Create a input tensor for genertor in 4D (batch, filters, x, y) + Normalize + Make it in the given device
        gen_input_v = torch.FloatTensor(BATCH_SIZE, LATENT_VECTOR_SIZE, 1, 1).normal_(0, 1).to(device)

        # Make the batch in the given device
        batch_v = batch_v.to(device)

        # Pass the batch into the Generator network
        gen_output_v = net_gener(gen_input_v)

        ###### Train the discriminator: ######

        # Zero out the optimizer for discriminator
        dis_optimizer.zero_grad()

        # Pass the batch into the discriminator's network
        dis_output_true_v = net_discr(batch_v)

        # Pass the generated batch from generator into the discriminator's network
        dis_output_fake_v = net_discr(gen_output_v.detach())       # We call the detach() function on the generator's output to prevent gradients of this training pass from flowing into the generator (detach() is a method of tensor, which makes a copy of it without connection to the parent's operation).

        # Get the loss for the TRUE data samples + Get the loss for the FAKE data samples
        dis_loss = objective(dis_output_true_v, true_labels_v) + objective(dis_output_fake_v, fake_labels_v)

        # Calcuate the gradients for every leaf tensor with require_grad=True 
        dis_loss.backward()

        # Apply the optimizer
        dis_optimizer.step()

        # Append the discriminator loss into dis_losses list
        dis_losses.append(dis_loss.item())
        
        ###### Train the generator: ######

        # Zero out the optimizer for discriminator
        gen_optimizer.zero_grad()

        # Pass the generated batch from generator into the discriminator's network
        dis_output_v = net_discr(gen_output_v)

        # Get the loss
        gen_loss_v = objective(dis_output_v, true_labels_v)

        # Calcuate the gradients for every leaf tensor with require_grad=True 
        gen_loss_v.backward()

        # Apply the optimizer
        gen_optimizer.step()

        # Append the generator loss into gen_losses list
        gen_losses.append(gen_loss_v.item())
        
        ######

        # Increment the iteration counter
        iter_no += 1

        # Every REPORT_EVERY_ITER times
        if iter_no % REPORT_EVERY_ITER == 0:

            # Print the losses
            log.info("Iter %d: gen_loss=%.3e, dis_loss=%.3e", iter_no, np.mean(gen_losses), np.mean(dis_losses))
            writer.add_scalar("gen_loss", np.mean(gen_losses), iter_no)
            writer.add_scalar("dis_loss", np.mean(dis_losses), iter_no)

            # Make the generator losses empty
            gen_losses = []

            # Make the generator discriminator empty
            dis_losses = []

        # Every SAVE_IMAGE_EVERY_ITER times
        if iter_no % SAVE_IMAGE_EVERY_ITER == 0:

            # Feed image samples to TensorBoard
            writer.add_image("fake", vutils.make_grid(gen_output_v.data[:64]), iter_no)
            writer.add_image("real", vutils.make_grid(batch_v.data[:64]), iter_no)

INFO: Making new env: Breakout-v0
INFO: Making new env: AirRaid-v0
INFO: Making new env: Pong-v0
INFO: Iter 100: gen_loss=6.203e+00, dis_loss=2.577e-02
INFO: Iter 200: gen_loss=7.509e+00, dis_loss=1.073e-03
INFO: Iter 300: gen_loss=8.070e+00, dis_loss=5.415e-04
INFO: Iter 400: gen_loss=8.376e+00, dis_loss=3.789e-04
INFO: Iter 500: gen_loss=8.583e+00, dis_loss=2.952e-04
INFO: Iter 600: gen_loss=8.727e+00, dis_loss=2.408e-04
INFO: Iter 700: gen_loss=8.932e+00, dis_loss=1.859e-04
INFO: Iter 800: gen_loss=9.104e+00, dis_loss=1.525e-04
INFO: Iter 900: gen_loss=9.334e+00, dis_loss=1.190e-04
INFO: Iter 1000: gen_loss=9.505e+00, dis_loss=9.962e-05
INFO: Iter 1100: gen_loss=9.925e+00, dis_loss=6.740e-05
INFO: Iter 1200: gen_loss=9.975e+00, dis_loss=6.361e-05
INFO: Iter 1300: gen_loss=9.829e+00, dis_loss=7.089e-05
INFO: Iter 1400: gen_loss=1.014e+01, dis_loss=5.115e-05
INFO: Iter 1500: gen_loss=1.045e+01, dis_loss=3.787e-05
INFO: Iter 1600: gen_loss=1.076e+01, dis_loss=2.826e-05
INFO: Iter 1700:

The training of this example is quite a lengthy process. On a GTX 1080 GPU, 100 iterations take about 40 seconds. At the beginning, the generated images are completely random noise, but after 10k-20k iterations, the generator becomes more and more proficient at its job and the generated images become more and more similar to the real game screenshots.

My experiments gave the following images after 40k-50k of training iterations (several hours on a GPU):

<img width="700px" src="./assets/generator sample.png">



<br>

# 08. Summary
---

In this chapter, we saw a quick overview of PyTorch functionality and features. We talked about basic fundamental pieces such as tensor and gradients, saw how an NN can be made from the basic building blocks, and learned how to implement those blocks ourselves. We discussed loss functions and optimizers, as well as the monitoring of training dynamics. The goal of the chapter was to give a very quick introduction to PyTorch, which will be used later in the book. For the next chapter, we're ready to start dealing with the main subject of this book: RL methods.

***THE END***