### Reference
https://adventuresinmachinelearning.com/pytorch-tutorial-deep-learning/

Pytorch is better than Tensorflow and Keras by:
1. easier debugging
2. dynamic computational graph construction unlike Tensorflow and Keras that use a static one
3. supported by Facebook, Twitter, NVIDIA, etc.
4. works easily with numpy, scipy, scikit-learn, etc.

#### Computational Graph
A Computational graph is a set of calculations. it is formed of nodes such that every node is either the input or the output or a calculation node.
**Advantages:** <br>
1. each node is an independent piece of code => this allows performance optimization methods to be included like *threading, multiprocessing, parallelism, etc.* 
2. all deep learning frameworks (Tensorflow, Theano, PyTorch) involve constructing a computational graph: neural network are built based on these graphs; the gradients in a NN back-propagate through these graphs

#### Tensors
1. A tensor is a matrix data stucture
2. essential components of deep learning libraries
3. essential for efficient computation: operations between tensors are effectively calculated via GPUs
4. numpy slice functionality is available

#### AutoGrad
1. It is a mechanism where error gradients are calculated and back-propagated through a computational graph in PyTorch.
2. The **Variable class** is the main component of this autograd system.
    1. It wraps a tensor T
    2. It allows automatic gradient computation on the tensor T when *.bacward()* function is called
3. the object contains:
    1. data of the tensor
    2. gradient of the tensor (calculated with respect to the loss)
    3. reference to the function that called the variable (the reference is set to null if this function is created by the user)

In [1]:
import torch
from torch.autograd import Variable

### Tensor Example

In [2]:
x = torch.rand(2,3)
x

tensor([[0.3661, 0.6672, 0.1241],
        [0.5327, 0.8319, 0.0384]])

In [3]:
y = torch.ones(2,3) + x
# y[row, column] = value - y[:, column] <==> all rows of column - y[row, :] <==> all columns of row
y[:,2] = 0

In [4]:
y

tensor([[1.3661, 1.6672, 0.0000],
        [1.5327, 1.8319, 0.0000]])

### Autograd Example

In [5]:
x = Variable(torch.ones(2, 2) * 2, requires_grad=True)

In [6]:
x

tensor([[2., 2.],
        [2., 2.]], requires_grad=True)

In the Variable declaration above:
We created of a 2x2 tensor filled of 2-values + specified that this variable requires a gradient. 

*If we were using a variable with **requires_grad = True** in a neural network, this would mean that this **Variable** would be **trainable**. If we set this flag to False, the Variable would not be trained.* 

For this simple example we aren’t training anything, but we do want to interrogate the gradient for this Variable as will be shown below.

In [7]:
z = 2 * (x * x) + 5 * x # another variable from x

z = 2x^2 + 5x <br>
To get the gradient of z with respect to x we do *analytical* dz/dx, thus, we obtain 4x + 5. <br>
i.e. for a 2x2 tensor of 2 everywhere ([[2,2],[2,2]]), the gradient is =  ([[13,13],[13,13]]). <br>
Let us try to do it using autograd pytorch. <br>
1. call *.backward()* function
2. give it an input value (a tensor) to compute the gradient with respect to this value i.e. d/dx
3. output the gradient of x using *.grad* <br>
**NB:** the gradient is stored in the x Variable, in **the property .grad**.

In [8]:
z.backward(torch.ones(2, 2))
print(x.grad)

tensor([[13., 13.],
        [13., 13.]])


### Build a Neural Network to classify MNIST using PyTorch

4-layer network:
1. input layer: 1D vector of 28x28 = 784 nodes (pixel values)
2. 1st hidden layer: a fully connected layer of 200 nodes followed by a ReLU activation function
3. 2nd hidden layer: a fully connected layer of 200 nodes followed by a ReLU activation function
4. output layer: 1D vector of 10 nodes (10 digits : 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9)

#### Creating the neural network class
We will use Python class Inheritance: PyTorch has a neural network nn.Module that we can inherit. <br>
==> we can use all the nn module functionalities (*here torch.nn.functional as F*) and overwrite (*here forward() function is overwritten*) the model construction: forward pass through the network. <br>

**A fully connected neural network layer** is represented by the **nn.Linear** object with **2 arguments**: <br>
1. the 1st argument = the number of nodes in layer l
2. the 2nd argument = the number of nodes in layer l+1

In [2]:
import torch.nn as nn
import torch.nn.functional as F

In [3]:
import torch.optim as optim

from torchvision import datasets, transforms

In [4]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # the skeleton of the network
        
        self.fc1 = nn.Linear(28 * 28, 200)
        self.fc2 = nn.Linear(200, 200)
        self.fc3 = nn.Linear(200, 10)
    def forward(self, x):
        """ defines the forward pass of the network
            1st, the input x is fed to a fully connected layer fc1 => 1st intermediate output o1
            2nd, a relu activation is applied on the output o1 => 2nd intermediate output o2
            3rd, o2 is fed to a fully connected layer fc2 => 3rd intermediate output o3
            4th, a relu activation is applied on the output o3 => 4th intermediate output o4
            5th, o4 is fed to a fully connected layer fc3 => 5th almost final output o5
            6th, a log softmax activation is applied on o5 to output the final output
        """
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return F.log_softmax(x)

In [5]:
# Creating an instance of our  Net Class
net = Net()
print(net)

Net(
  (fc1): Linear(in_features=784, out_features=200, bias=True)
  (fc2): Linear(in_features=200, out_features=200, bias=True)
  (fc3): Linear(in_features=200, out_features=10, bias=True)
)


#### Loading Data
1. torch.utils.data.DataLoader(**dataset**, **batch_size**=1, **shuffle**=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None) is a function to **load data**, we usually change the first 3 bold arguments
2. **Normalize** the data: Neural networks train better when the input data is normalized => the data becomes ranging from -1 to 1 or from 0 to 1. To do that we call **.Compose()** function from the ***torchvision*** package: <br> ***Numerous transforms can be chained together in a list using the Compose() function***. Here, we use to do 2 transformations:
    1. convert the data into a PyTorch Tensor:
        1. A PyTorch tensor is a specific data type used in PyTorch for all of the various data and weight operations within the network
        2.  it is simply a multi-dimensional matrix 
        3. In any case, PyTorch requires the data set to be transformed into a tensor so it can be consumed in the training and testing of the network
    2. normalize the data into a normal distribution of mean 0.1307 and standard deviation = 0.3081
    3. **NB:** if we have many channels we need to provide the mean and std of each channel in that way:
        1. transforms.Normalize((M1, M2, ... Mn), (Std1, Std2, ... Stdn))
        2. the normalization formula is the following : input[channel] = (input[channel] - mean[channel]) / std[channel]

**DataLoader has many advantages:**
1. the ability to **shuffle** the data easily
2. the ability to easily **batch** the data
3. the abilityto make data consumption more efficient via the ability to load the data in parallel using **multiprocessing**.
4. A data loader can be used **as an iterator** – so to **extract the data** we can just **use** the standard Python iterators such as **enumerate**

      

In [6]:
batch_size=200
learning_rate=0.01
epochs=10
log_interval=10

transformation = transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])

train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('./data/MNIST', train=True, download=True,
                       transform = transformtion),
        batch_size=batch_size, shuffle=True)

test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('./data/MNIST', train=False, transform = transformtion),
    batch_size=batch_size, shuffle=True)

#### Training the network
1. Choose the **optimizer** (*here Stochastic Gradient Descent*)
2. set the **learning rate** and **momentum** of the optimization process
3. set the **parameters we want to optmize**; in PyTorch, **.parameters()** method encapsulates all the network's params
4. set the **loss in respect to which the optimization process should be done** (*here we chose the **negative log loss NLL** because the last activation is a log softmax combined with NLL is equivalent to a cros entropy loss needed here since we are dealing with a **multiclass or multinomial classification** more precisely here 10-class classification.
5. train the network by dividing the input into mini-batches for many epochs:
    1. convert data and target to PyTorch Variables
    2. reshape data to fit to the fully connected layer
    3. reset the gradients to 0 using **.zero_grad()** so that it is ready to go for the next back propagation pass. In other libraries this is performed implicitly, **but in PyTorch you have to remember to do it explicitly.**
    4. pass the input data to the network => the forward pass will be called and executed => we have an output 1D-vector of 10 elements
    5. check the validity of the neural network's output by calculating the loss *criterion*
    6. backpropagate the error calculated : **NB:** here we do  not backpropagate with an argument because loss is already a scalar variable and scalar variables in PyTorch when we call ***.backward()*** on them, don’t require arguments – only tensors require a matching sized tensor argument to be passed to the .backward() operation.
    7.  execute a gradient descent step based on the gradients calculated during the .backward() operation using ***optimizer.step()***
    8. **PS:** we can print the network's progress performance by printing the loss value calculated for each batch and epoch. to access the loss value, we call the **.data** property of the nn.NLLoss (negative log likelihood loss). Usually **.data** is an array (a list) but since in our case the loss is one scalar value thus we use **.data[0]**

In [11]:
# create a stochastic gradient descent optimizer
optimizer = optim.SGD(net.parameters(), lr=learning_rate, momentum=0.9)
# create a loss function
criterion = nn.NLLLoss()

# run the main training loop
for epoch in range(epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = Variable(data), Variable(target)
        # resize data from (batch_size, 1, 28, 28) to (batch_size, 28*28) using the *.view()* function
        # since batch_size could be anything we use the "-1" notation
        data = data.view(-1, 28*28)
        # ".zero_grad()" resets all the gradients in the model, so that it is ready to go for the next back propagation pass 
        optimizer.zero_grad()
        net_out = net(data)
        # the nn.NLLLoss() is used when we have multiclass clssification;
        # arguments are : Input shape (N, C) and Target shape (N) 
        # such that N = nbr of samples in the minibatch and C = number of classes,
        loss = criterion(net_out, target)
        loss.backward()
        optimizer.step()
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                       100. * batch_idx / len(train_loader), loss.data))







#### Testing the network
1. predict on test data
2. get the predicted class using **.data.max(1)[1]** *(further explanation in the code)*
3. calculate the accuracy of the model using **.eq()** function

In [17]:
# run a test loop
test_loss = 0
correct = 0
for data, target in test_loader:
    data, target = Variable(data, volatile=True), Variable(target)
    data = data.view(-1, 28 * 28)
    net_out = net(data)
    # sum up batch loss
    test_loss += criterion(net_out, target).data
    # net_out.shape = (batch_size, 10) i.e. for every sample from the batch size we have a 10-elements vector with values
    # equal to the negative log probabilities of the sample belonging to the digit at index i => 
    # net_out[0,0] = probaility of sample 0 to belong to the class digit 0
    # net_out[10,5] = probaility of sample 10 to belong to the class digit 4
    # thus to get the predicted digit we use .data.max(1) i.e. the max value in the 2nd dimension
    # here the 1st dimension = nbr of samples, the 2nd dimension is the classes and we want the classes here
    # .data.max(1) returns 2 arguments: max value and its position (index), since we want the 2nd argument i.e. the index
    # we use .data.max(1)[1] 
    pred = net_out.data.max(1)[1]  # get the index of the max log-probability
    # now we have the prediction, we need to compare it to the actual target class 
    # and count how many times in the batch the network predicted correctly =>
    # use of **.eq()** method: itcompares the values in two tensors and if they match, returns a 1. 
    # If they don’t match, it returns a 0
    correct += pred.eq(target.data).sum()

test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

  """



Test set: Average loss: 0.0003, Accuracy: 9789/10000 (98%)

