## COMP-551 Applied ML



# http://pytorch.org/

## Why we need GPUs ?

* We do a lot of matrix multiplications !
* CPUs are fast and good for sequential tasks.
* GPUs are a bit slower but work really well for parallel tasks.
* GPUs have much smaller memory when compared to CPUs
* For training deep nets GPU with CUDA can lead to a 50 – 100 x boost.* 


\* <sub> (https://github.com/jcjohnson/cnn-benchmarks)



## Why Deep Learning Frameworks ?

* So you don’t have to deal with CUDA libraries.

* Build symbolic graphs of computation. ( No need to manually calculate and code the gradients for each parameter )

* Can take gradients of some scalar loss with weights/parameters.

* Applies the chain rule for you !!




## Popular Deep Learning Frameworks


* TensorFlow – from Google
* PyTorch – from Facebook, based on Torch by NYU
* Caffe2 – from Facebook, based on Caffe by Berkley
* Chainer – from Preferred Networks

All of the above are open source!


## Why PyTorch ?

![py](./pytorch_vs_tf.png)

In [2]:
# Imports 
from __future__ import print_function
import pickle as pkl
import torch 
import numpy as np
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable


ModuleNotFoundError: No module named 'torch'

There are three moajor components to PyTorch:

## Tensors

* Its like a numpy ndarray 
* Doesn't know anything about deep learning or computational graphs or gradients
* **Also runs runs on GPU !**
* Can convert back and forth from the numpy array 


In [4]:
x = torch.Tensor(5, 3)
print(x)


 0.0000e+00 -3.6893e+19  1.6560e+10
-1.5849e+29  1.7153e-34  4.5835e-41
 1.7184e-34  4.5835e-41  1.7184e-34
 4.5835e-41  0.0000e+00  4.2039e-45
 0.0000e+00  0.0000e+00  7.0065e-45
[torch.FloatTensor of size 5x3]



In [5]:
print(x.size())

torch.Size([5, 3])


In [6]:
y = torch.rand(5, 3)
print(x + y)


 8.9475e-02 -3.6893e+19  1.6560e+10
-1.5849e+29  3.0225e-01  9.8345e-01
 9.8165e-01  2.2902e-01  2.5777e-01
 2.0098e-01  8.8830e-01  9.1491e-01
 7.5256e-01  8.5380e-01  9.9057e-01
[torch.FloatTensor of size 5x3]



Supports most numpy operations like broadcasting, arithmetic, reshaping, indexing, etc.

### Converting from and back numpy

In [7]:
z = torch.LongTensor([[1, 3], [2, 9]])
print(z.type())
# Cast to numpy ndarray
print(z.numpy().dtype)

torch.LongTensor
int64


In [8]:
# Data type inferred from numpy
print(torch.from_numpy(np.random.rand(5, 3)).type())
print(torch.from_numpy(np.random.rand(5, 3).astype(np.float32)).type())

torch.DoubleTensor
torch.FloatTensor


## Moving things to GPU

In [9]:
x = torch.FloatTensor(5, 3).uniform_(-1, 1)
print(x)


 0.3390 -0.0033 -0.0531
-0.3678 -0.7003  0.7110
 0.5751 -0.6191 -0.3941
-0.4094  0.1602 -0.1897
 0.1409 -0.2891  0.9764
[torch.FloatTensor of size 5x3]



In [10]:
# move tensor to GPU using
x = x.cuda()
print(x)


# move back to CPU
x = x.cpu()
print(x)

AssertionError: Torch not compiled with CUDA enabled

## Variables 

* Responsible for Automatic Differentiation
* Node in a computational graph; stores data and gradient ( `.data` and `.grad` ) 
    * If `x` is a variable, `x.data` is a tensor
    * `x.grad` is a Variable of gradients (same shape as x.data), with respect to some scalar value.
    * `x.grad.data` is a Tensor of gradients
    


![py](./Variable.png)

* Tensors and Variables have the almost same API

* Variables remember how they were created (for backprop), denoted by `.grad_fn`

* Once you finish your computation you can call `.backward()` and have all the gradients computed automatically.

In [11]:
x = Variable(torch.ones(2, 2), requires_grad=True)
print(x)

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]



In [12]:
y = x + 2
print(y)

Variable containing:
 3  3
 3  3
[torch.FloatTensor of size 2x2]



In [4]:
p = torch.ones(2, 2)
z = y * p * 3
out = z.mean()

print(z, out)

NameError: name 'torch' is not defined

In [14]:
out.backward()

In [15]:
# d(out)/dx

print(x.grad)


Variable containing:
 4.5000  4.5000
 4.5000  4.5000
[torch.FloatTensor of size 2x2]



# Dynamic Computation Graphs

* PyTorch maintains a graph that records all of the operations performed on variables as you execute your operations.
* This results in a directed acyclic graph whose leaves are the input variables and roots are the output variables. 
* By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.


![py](./dynamic_graph.gif)

## Modules

* Differetiable objects; may store state or learnable weights
* Can define a new module; it inputs and outputs Variables and corresponding input and output functions


### torch.nn

Neural networks can be constructed using the **torch.nn** package. Provides pretty much all neural network related functionalities for this course such as :

* Linear layers - nn.Linear, nn.Bilinear
* Convolution Layers - nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.ConvTranspose2d
* Nonlinearities - nn.Sigmoid, nn.Tanh, nn.ReLU, nn.LeakyReLU
* Pooling Layers - nn.MaxPool1d, nn.AveragePool2d
* Recurrent Networks - nn.LSTM, nn.GRU
* Normalization - nn.BatchNorm2d
* Dropout - nn.Dropout, nn.Dropout2d
* Embedding - nn.Embedding
* Loss Functions - nn.MSELoss, nn.CrossEntropyLoss, nn.NLLLoss


In [16]:
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        
        Args:
            - D_in : input dimension of the data
            - H : size of the first hidden layer
            - D_out : size of the output/ second layer
        """
        super(TwoLayerNet, self).__init__() # intialize recursively 
        self.linear1 = torch.nn.Linear(D_in, H) # create a linear layer 
        self.linear2 = torch.nn.Linear(H, D_out) 

    def forward(self, x):
        """
        In the forward function we accept a Variable of input data 
        and we must return a Variable of output data. We can use 
        Modules defined in the constructor as well as arbitrary 
        operators on Variables.
        """
        h_relu = self.linear1(x)
        y_pred = self.linear2(h_relu)
        return y_pred
    

In [17]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.

N, D_in, H, D_out = 64, 1000, 100, 10

In [18]:
# Create random Tensors to hold inputs and outputs, and wrap them in Variables

x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)


In [19]:
# Construct our model by instantiating the class defined above

model = TwoLayerNet(D_in, H, D_out)


### Construct our loss function and an Optimizer. 

The call to **model.parameters()** in the SGD constructor will contain the learnable parameters of the two nn.Linear modules which are part of the model.

In [20]:
# loss function
criterion = torch.nn.MSELoss(size_average=False)

# optimizer 
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

In [21]:
for epoch in range(50):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(epoch, loss.data[0])

    # Reset gradients to zero, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 689.3861083984375
1 590.0618896484375
2 507.8590393066406
3 438.5547790527344
4 379.36199951171875
5 328.35858154296875
6 284.16656494140625
7 245.75306701660156
8 212.31654357910156
9 183.21006774902344
10 157.8936767578125
11 135.90509033203125
12 116.84040069580078
13 100.34255981445312
14 86.09352111816406
15 73.80962371826172
16 63.23798751831055
17 54.15391540527344
18 46.35847473144531
19 39.676605224609375
20 33.95465087890625
21 29.058568954467773
22 24.87167739868164
23 21.292922973632812
24 18.234994888305664
25 15.622568130493164
26 13.390934944152832
27 11.484526634216309
28 9.855705261230469
29 8.46373462677002
30 7.273780822753906
31 6.256101131439209
32 5.385315418243408
33 4.639800548553467
34 4.001124382019043
35 3.453582763671875
36 2.9838263988494873
37 2.5804648399353027
38 2.2338199615478516
39 1.9356528520584106
40 1.6789436340332031
41 1.457719087600708
42 1.2668880224227905
43 1.102110505104065
44 0.959691047668457
45 0.8364707231521606
46 0.7297531962394714


## References:

* MILA Pytorch tutorial ( https://github.com/mila-udem/welcome_tutorials/tree/master/pytorch )
* Justin Johnson's tutorial ( http://pytorch.org/tutorials/beginner/pytorch_with_examples.html )