## Learning PyTorch with Examples

Fundamental concepts of PyTorch through self-contained examples.

[Link to tutorial](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html)

### Warm-up numpy

Numpy provides an n-dimensional array object with many functions to manipulate them. While it doesn't perform any computation graphs, gradients etc right away, it is still very easy to implement a two-layer neural network by manually computing the forward and a backward pass through the network using numpy operations.

In [1]:
import numpy as np

In [2]:
# N = batch_size; D_in = input dimensions
# H = hidden dimensions; D_out = output dimensions

N, D_in, H, D_out = 64, 1000, 100, 10

# create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly init weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6

for t in range(500):
    
    # Forward Pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    if t % 100 == 99:
        print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 w.r.t loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

99 531.1132028511398
199 1.7900854386712282
299 0.008965826537142917
399 5.03740318953559e-05
499 2.9681312725929016e-07


### PyTorch: Tensors

PyTorch Tensors are conceptually identical to a numpy array but behind the scenes they can keep track of the computational graph and gradients. Also they can be run on a GPU to accelerate numeric computations.

Running the above numpy computations with PyTorch tensors.

In [3]:
import torch

In [4]:
dtype = torch.float
device = torch.device("cpu")

In [5]:
# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# random init weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6

for t in range(500):
    
    # Forward pass
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)
        
    # Backpropagation
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

99 375.7138671875
199 1.4995532035827637
299 0.010684078559279442
399 0.00028402096359059215
499 5.4468731832457706e-05


## Autograd

Autograd lets us compute automatic differentiation to automate the computation of backward passes in neural networks. 

When using autograd, the forward pass will create a computational graph where the nodes will be Tensors and the edges will be functions that produce the Tensors from input Tensors. Backpropagating through this graph then allows to easily compute gradients.

Example: If *x* is a tensor that has *x.requires_grad=True*, then *x.grad* is another tensor that is holding the gradient of *x* w.r.t some scalar value.

Implementing the above network with autograd to automate the backward pass.

In [6]:
dtype = torch.float
device = torch.device("cpu")

# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random input and output data
# set requires_grad=False since we do not need to compute gradients for this
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# random tensors for weights
# set requires_grad=True since we need to compute gradients for this during backward pass
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6

for t in range(500):
    
    # Forward pass: Compute predicted y
    # Exactly the same as above
    # But we no longer need to keep references to intermediate values since we 
    # are not implementing backward pass manually
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # Compute and print loss using operations on Tensors
    # loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())
        
    # using autograd to compute the backward pass
    # This will compute the gradient of loss w.r.t all Tensors with requires_grad=True
    # After this call w1.grad and w2.grad will be Tensors holding the gradient of the loss
    # w.r.t w1 and w2 respectively
    loss.backward()
    
    # Manually update weights using gradient descent
    # Need to do this within torch.no_grad() as weights have requires_grad=True
    # but we do not need to track this in autograd
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

99 291.8914794921875
199 1.058808445930481
299 0.006029502488672733
399 0.00015789813187438995
499 2.9902899768785574e-05


## Defining new autograd functions

Each autograd operator is really just two functions that operate on Tensors.

The **forward** function computes output Tensors from input Tensors. The **backward** function recieves the gradient of output Tensors w.r.t some scalar value and computes the gradient of input Tensors w.r.t that same scalar value.

In PyTorch, we can definte our own autograd operator by defining a subclass of *torch.autograd.Function* and implementing the forward and backward functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

Implementing the same two-layer network with custom autograd function for performing ReLU non-linearity.

In [7]:
class myReLu(torch.autograd.Function):
    
    """
    Implementing custom autograd Functions by subclassing torch.autograd.Function
    and implementing forward and backward passes which operate on Tensors.
    """
    
    @staticmethod
    def forward(ctx, input):
        """
        In forward pass we receive a Tensor containing the input and return a Tensor
        containing the output. 
        ctx is a context object that can be used to stash information for backward
        computation. 
        Arbitrary objects can be cached for use in the backward pass using
        ctx.save_for_backward_method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        w.r.t the output and we compute the gradient of the loss w.r.t the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

In [8]:
dtype = torch.float
device = torch.device("cpu")

# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# random init weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6

for t in range(500):
    
    # To apply function we use Function.apply method
    relu = myReLu.apply
    
    # Forward pass: compute predicted y using operations;
    # Computing ReLU using custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)
    
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())
        
    # using aurograd to compute the backward pass
    loss.backward()
    
    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # Zero the gradients
        w1.grad.zero_()
        w2.grad.zero_()

99 801.5262451171875
199 8.402155876159668
299 0.13024316728115082
399 0.002469656290486455
499 0.0001722846063785255


## *nn* module

### Pytorch:nn

While computational graphs and autograd are very powerful for defining complex operators but for large neural networks raw autograd can be a bit too low-level.

We frequently think of neural networks as arranging computation into layers, some of which have learnable parameters which will be optimized during training.

The **Module** under **nn** package in PyTorch is roughly equivalent to neural network layers. It receives input Tensors and computes output Tensors, and also keeps hold of learnable parameters. It also defines a set of commonly used loss functions.

Using **nn** package to implement the same two layer neural network.

In [9]:
# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# nn.Sequential contains other Modules, and applies them in sequence to produce output
# Each Linear Module computes output from input using a linear function and holds
# internal Tensors for its weights and bias
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

# It also has several loss functions implementations;
# Here Mean Squared Error is used (MSE)
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4

for t in range(500):
    
    # Forward pass: Get y from X 
    # Overrides the __call__ operator so it can be used as functions
    y_pred = model(x)
    
    # Compute and print loss. 
    # We pass the predicted and true values of y
    # Loss Function returns a Tensor containing the loss
    loss = loss_fn(y_pred, y)
    
    if t % 100 == 99:
        print(t, loss.item())
        
    # Zero the gradients before running the backward pass
    model.zero_grad()
    
    # Backward pass: Compute the gradient of the loss w.r.t all the learnable params of model
    # Internally, the params of each model are stored in Tensors with requires_grad=True
    # So this will compute gradients for all learnable params in the model
    loss.backward()
    
    # Update the weights using gradient descent
    with torch.no_grad():
        for param in model.parameters():
            
            param -= learning_rate * param.grad

99 2.3831310272216797
199 0.027469538152217865
299 0.0006769889732822776
399 2.754073284449987e-05
499 1.4018295360074262e-06


#### Why do we need to *zero_grad()* the gradients before backward pass?

- In PyTorch, we need to set the gradients to zero before starting to do backpropragation because PyTorch accumulates the gradients on subsequent backward passes. This is convenient while training RNNs. So, the default action is to accumulate (i.e. sum) the gradients on every loss.backward() call.

- Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Else the gradient would point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).

### PyTorch: optim

So far the weights of the models have been updated manually by mutating the Tensors holding learnable parameters *torch.no_grad()* (to avoid tracking history in autograd). This is not a huge burden for simple optimization algorithms like stochastic gradient descent, but in practice we often train neural networks using more sophisticated optimizers like AdaGrad, RMSProp, Adam, etc.

The *optim* package provides implementations of several commonly used optimization algorithms.

In [10]:
# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# nn.Sequential contains other Modules, and applies them in sequence to produce output
# Each Linear Module computes output from input using a linear function and holds
# internal Tensors for its weights and bias
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

# It also has several loss functions implementations;
# Here Mean Squared Error is used (MSE)
loss_fn = torch.nn.MSELoss(reduction='sum')

# use the optim package to define an Optimizer that will manage the weights update.
# Adam is used here
# The first argument to the Adam constructor tells the optimizer
# which Tensors should be updated
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for t in range(500):
    
    # Forward Pass : Compute y from X
    y_pred = model(x)
    
    # Compute loss
    loss = loss_fn(y_pred, y)
    
    if t % 100 == 99:
        print(t, loss.item())
        
    # Before backward pass, use optimizer to zero all gradients for 
    # the variables it will update (which are the weights of the model)
    # This is because gradients are accumulated in buffers (i.e. not overwritten)
    optimizer.zero_grad()
    
    # Backward Pass: Compute gradient of loss w.r.t model params
    loss.backward()
    
    # Update the parameters using step function of optimizer
    optimizer.step()

99 46.92584991455078
199 0.6106742024421692
299 0.0011610741494223475
399 2.366290061672771e-07
499 8.524011357868844e-11


## PyTorch: Custom nn Modules

Sometimes you may need models that are more complex than existing Modules. For such cases custom nn modules can be defined by subclassing nn.Module and defining a *forward* function using other modules or autograd operations on Tensors.

Implementing the above two-layer neural network as a custom Module subclass:

In [11]:
class TwoLayerNet(torch.nn.Module):
    
    def __init__(self, D_in, H, D_out):
        
        """
        In the constructor we instantiate two nn.Linear modules and assign them 
        as member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)
        
    def forward(self, x):
        
        """
        Accept a Tensor of input data and return a Tensor of output data.
        We can use Modules defined in the constructor as well as arbitrary 
        operators on Tensors
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

In [12]:
# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random Tensors to hold inputs and output
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# construct model by instantiating class defined above
model = TwoLayerNet(D_in, H, D_out)

# Loss
criterion = torch.nn.MSELoss(reduction='sum')
# optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

for t in range(500):
    
    # Forward Pass
    y_pred = model(x)
    
    # Calc Loss
    loss = criterion(y_pred, y)
    
    if t % 100 == 99:
        print(t, loss.item())
        
    # zero grad
    optimizer.zero_grad()
    
    # backward pass
    loss.backward()
    
    # update weights
    optimizer.step()

99 3.033494234085083
199 0.04141934961080551
299 0.0009407281177118421
399 2.65647831838578e-05
499 8.788263130554697e-07


## PyTorch: Control Flow + Weight Sharing

An example of dynamic graphs and weight sharing: A fully connected ReLU network that on each forward pass chooses a number between 1 and 4 and selects that number of hidden layers, reusing the same weights multiple times to compute the innermost hidden layers.

In [15]:
import random

class DynamicNet(torch.nn.Module):
    
    def __init__(self, D_in, H, D_out):
    
        """
        Defining three nn.Linear instances that will be used in the forward pass
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)
        
    def forward(self, x):
        
        """
        For forward pass we randomly choose either 0,1,2 or 3 and reuse
        the hidden layer that many times to compute hidden layer representations.

        Since each forward pass builds a dynamic computation graph, we can use
        normal Python control-flow operators like loops or conditional statements
        when defining forward pass of the model.
        
        It is perfectly safe to reuse the same Module many times when defining a 
        computational graph.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred

In [17]:
# init
N, D_in, H, D_out = 64, 1000, 100, 10

# inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# instantiate the model defined
model = DynamicNet(D_in, H, D_out)

# loss function and optimizer
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)

for t in range(500):
    
    # Forward Pass
    y_pred = model(x)
    
    # loss
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())
        
    # zero gradients
    optimizer.zero_grad()
    
    # Backward pass and weight update
    loss.backward()
    optimizer.step()

99 96.77654266357422
199 7.2930755615234375
299 0.9710150957107544
399 0.5101909637451172
499 0.23329325020313263
