## Learning PyTorch with Examples

Fundamental concepts of PyTorch through self-contained examples.

[Link to tutorial](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html)

### Warm-up numpy

Numpy provides an n-dimensional array object with many functions to manipulate them. While it doesn't perform any computation graphs, gradients etc right away, it is still very easy to implement a two-layer neural network by manually computing the forward and a backward pass through the network using numpy operations.

In [1]:
import numpy as np

In [2]:
# N = batch_size; D_in = input dimensions
# H = hidden dimensions; D_out = output dimensions

N, D_in, H, D_out = 64, 1000, 100, 10

# create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly init weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6

for t in range(500):
    
    # Forward Pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    if t % 100 == 99:
        print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 w.r.t loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

99 893.4391539095168
199 4.984153199775717
299 0.048670892078728725
399 0.0005884561088501409
499 7.912235315402832e-06


### PyTorch: Tensors

PyTorch Tensors are conceptually identical to a numpy array but behind the scenes they can keep track of the computational graph and gradients. Also they can be run on a GPU to accelerate numeric computations.

Running the above numpy computations with PyTorch tensors.

In [3]:
import torch

In [4]:
dtype = torch.float
device = torch.device("cpu")

In [5]:
# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# random init weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6

for t in range(500):
    
    # Forward pass
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)
        
    # Backpropagation
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

99 302.541015625
199 0.6131243109703064
299 0.002454133238643408
399 9.12375544430688e-05
499 2.1819892936036922e-05


## Autograd

Autograd lets us compute automatic differentiation to automate the computation of backward passes in neural networks. 

When using autograd, the forward pass will create a computational graph where the nodes will be Tensors and the edges will be functions that produce the Tensors from input Tensors. Backpropagating through this graph then allows to easily compute gradients.

Example: If *x* is a tensor that has *x.requires_grad=True*, then *x.grad* is another tensor that is holding the gradient of *x* w.r.t some scalar value.

Implementing the above network with autograd to automate the backward pass.

In [6]:
dtype = torch.float
device = torch.device("cpu")

# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random input and output data
# set requires_grad=False since we do not need to compute gradients for this
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# random tensors for weights
# set requires_grad=True since we need to compute gradients for this during backward pass
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6

for t in range(500):
    
    # Forward pass: Compute predicted y
    # Exactly the same as above
    # But we no longer need to keep references to intermediate values since we 
    # are not implementing backward pass manually
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # Compute and print loss using operations on Tensors
    # loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())
        
    # using autograd to compute the backward pass
    # This will compute the gradient of loss w.r.t all Tensors with requires_grad=True
    # After this call w1.grad and w2.grad will be Tensors holding the gradient of the loss
    # w.r.t w1 and w2 respectively
    loss.backward()
    
    # Manually update weights using gradient descent
    # Need to do this within torch.no_grad() as weights have requires_grad=True
    # but we do not need to track this in autograd
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

99 511.6468505859375
199 2.5115370750427246
299 0.021443337202072144
399 0.00042859985842369497
499 5.5764678108971566e-05


## Defining new autograd functions

Each autograd operator is really just two functions that operate on Tensors.

The **forward** function computes output Tensors from input Tensors. The **backward** function recieves the gradient of output Tensors w.r.t some scalar value and computes the gradient of input Tensors w.r.t that same scalar value.

In PyTorch, we can definte our own autograd operator by defining a subclass of *torch.autograd.Function* and implementing the forward and backward functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

Implementing the same two-layer network with custom autograd function for performing ReLU non-linearity.

In [7]:
class myReLu(torch.autograd.Function):
    
    """
    Implementing custom autograd Functions by subclassing torch.autograd.Function
    and implementing forward and backward passes which operate on Tensors.
    """
    
    @staticmethod
    def forward(ctx, input):
        """
        In forward pass we receive a Tensor containing the input and return a Tensor
        containing the output. 
        ctx is a context object that can be used to stash information for backward
        computation. 
        Arbitrary objects can be cached for use in the backward pass using
        ctx.save_for_backward_method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        w.r.t the output and we compute the gradient of the loss w.r.t the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

In [8]:
dtype = torch.float
device = torch.device("cpu")

# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# random init weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6

for t in range(500):
    
    # To apply function we use Function.apply method
    relu = myReLu.apply
    
    # Forward pass: compute predicted y using operations;
    # Computing ReLU using custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)
    
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())
        
    # using aurograd to compute the backward pass
    loss.backward()
    
    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # Zero the gradients
        w1.grad.zero_()
        w2.grad.zero_()

99 427.4384460449219
199 1.1275361776351929
299 0.0049002934247255325
399 0.00013411974941845983
499 2.937408135039732e-05


## *nn* module

### Pytorch:nn

While computational graphs and autograd are very powerful for defining complex operators but for large neural networks raw autograd can be a bit too low-level.

We frequently think of neural networks as arranging computation into layers, some of which have learnable parameters which will be optimized during training.

The **Module** under **nn** package in PyTorch is roughly equivalent to neural network layers. It receives input Tensors and computes output Tensors, and also keeps hold of learnable parameters. It also defines a set of commonly used loss functions.

Using **nn** package to implement the same two layer neural network.

In [10]:
# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# nn.Sequential contains other Modules, and applies them in sequence to produce output
# Each Linear Module computes output from input using a linear function and holds
# internal Tensors for its weights and bias
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

# It also has several loss functions implementations;
# Here Mean Squared Error is used (MSE)
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4

for t in range(500):
    
    # Forward pass: Get y from X 
    # Overrides the __call__ operator so it can be used as functions
    y_pred = model(x)
    
    # Compute and print loss. 
    # We pass the predicted and true values of y
    # Loss Function returns a Tensor containing the loss
    loss = loss_fn(y_pred, y)
    
    if t % 100 == 99:
        print(t, loss.item())
        
    # Zero the gradients before running the backward pass
    model.zero_grad()
    
    # Backward pass: Compute the gradient of the loss w.r.t all the learnable params of model
    # Internally, the params of each model are stored in Tensors with requires_grad=True
    # So this will compute gradients for all learnable params in the model
    loss.backward()
    
    # Update the weights using gradient descent
    with torch.no_grad():
        for param in model.parameters():
            
            param -= learning_rate * param.grad

99 2.4043242931365967
199 0.05005412548780441
299 0.00284843728877604
399 0.0002749993873294443
499 3.715698767337017e-05


#### Why do we need to *zero_grad()* the gradients before backward pass?

- In PyTorch, we need to set the gradients to zero before starting to do backpropragation because PyTorch accumulates the gradients on subsequent backward passes. This is convenient while training RNNs. So, the default action is to accumulate (i.e. sum) the gradients on every loss.backward() call.

- Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Else the gradient would point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).

### PyTorch: optim

So far the weights of the models have been updated manually by mutating the Tensors holding learnable parameters *torch.no_grad()* (to avoid tracking history in autograd). This is not a huge burden for simple optimization algorithms like stochastic gradient descent, but in practice we often train neural networks using more sophisticated optimizers like AdaGrad, RMSProp, Adam, etc.

The *optim* package provides implementations of several commonly used optimization algorithms.

In [11]:
# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# nn.Sequential contains other Modules, and applies them in sequence to produce output
# Each Linear Module computes output from input using a linear function and holds
# internal Tensors for its weights and bias
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

# It also has several loss functions implementations;
# Here Mean Squared Error is used (MSE)
loss_fn = torch.nn.MSELoss(reduction='sum')

# use the optim package to define an Optimizer that will manage the weights update.
# Adam is used here
# The first argument to the Adam constructor tells the optimizer
# which Tensors should be updated
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for t in range(500):
    
    # Forward Pass : Compute y from X
    y_pred = model(x)
    
    # Compute loss
    loss = loss_fn(y_pred, y)
    
    if t % 100 == 99:
        print(t, loss.item())
        
    # Before backward pass, use optimizer to zero all gradients for 
    # the variables it will update (which are the weights of the model)
    # This is because gradients are accumulated in buffers (i.e. not overwritten)
    optimizer.zero_grad()
    
    # Backward Pass: Compute gradient of loss w.r.t model params
    loss.backward()
    
    # Update the parameters using step function of optimizer
    optimizer.step()

99 47.2345085144043
199 0.5374274849891663
299 0.0024964988697320223
399 3.3603600968490355e-06
499 1.3696800360563088e-09
