## Learning PyTorch with Examples

Fundamental concepts of PyTorch through self-contained examples.

[Link to tutorial](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html)

### Warm-up numpy

Numpy provides an n-dimensional array object with many functions to manipulate them. While it doesn't perform any computation graphs, gradients etc right away, it is still very easy to implement a two-layer neural network by manually computing the forward and a backward pass through the network using numpy operations.

In [1]:
import numpy as np

In [3]:
# N = batch_size; D_in = input dimensions
# H = hidden dimensions; D_out = output dimensions

N, D_in, H, D_out = 64, 1000, 100, 10

# create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly init weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6

for t in range(500):
    
    # Forward Pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    if t % 100 == 99:
        print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 w.r.t loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

99 251.20427851697787
199 0.45502238791405125
299 0.0017380299155694192
399 8.466744213910492e-06
499 4.3970570705584016e-08


### PyTorch: Tensors

PyTorch Tensors are conceptually identical to a numpy array but behind the scenes they can keep track of the computational graph and gradients. Also they can be run on a GPU to accelerate numeric computations.

Running the above numpy computations with PyTorch tensors.

In [4]:
import torch

In [5]:
dtype = torch.float
device = torch.device("cpu")

In [8]:
# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# random init weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6

for t in range(500):
    
    # Forward pass
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)
        
    # Backpropagation
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

99 496.5876159667969
199 2.450652599334717
299 0.015336411073803902
399 0.0002590751100797206
499 3.859804928652011e-05


## Autograd

Autograd lets us compute automatic differentiation to automate the computation of backward passes in neural networks. 

When using autograd, the forward pass will create a computational graph where the nodes will be Tensors and the edges will be functions that produce the Tensors from input Tensors. Backpropagating through this graph then allows to easily compute gradients.

Example: If *x* is a tensor that has *x.requires_grad=True*, then *x.grad* is another tensor that is holding the gradient of *x* w.r.t some scalar value.

Implementing the above network with autograd to automate the backward pass.

In [22]:
dtype = torch.float
device = torch.device("cpu")

# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random input and output data
# set requires_grad=False since we do not need to compute gradients for this
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# random tensors for weights
# set requires_grad=True since we need to compute gradients for this during backward pass
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6

for t in range(500):
    
    # Forward pass: Compute predicted y
    # Exactly the same as above
    # But we no longer need to keep references to intermediate values since we 
    # are not implementing backward pass manually
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # Compute and print loss using operations on Tensors
    # loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())
        
    # using autograd to compute the backward pass
    # This will compute the gradient of loss w.r.t all Tensors with requires_grad=True
    # After this call w1.grad and w2.grad will be Tensors holding the gradient of the loss
    # w.r.t w1 and w2 respectively
    loss.backward()
    
    # Manually update weights using gradient descent
    # Need to do this within torch.no_grad() as weights have requires_grad=True
    # but we do not need to track this in autograd
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

99 622.78955078125
199 3.378997325897217
299 0.02787669003009796
399 0.0005281756748445332
499 6.523747288156301e-05


## Defining new autograd functions

Each autograd operator is really just two functions that operate on Tensors.

The **forward** function computes output Tensors from input Tensors. The **backward** function recieves the gradient of output Tensors w.r.t some scalar value and computes the gradient of input Tensors w.r.t that same scalar value.

In PyTorch, we can definte our own autograd operator by defining a subclass of *torch.autograd.Function* and implementing the forward and backward functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

Implementing the same two-layer network with custom autograd function for performing ReLU non-linearity.

In [23]:
class myReLu(torch.autograd.Function):
    
    """
    Implementing custom autograd Functions by subclassing torch.autograd.Function
    and implementing forward and backward passes which operate on Tensors.
    """
    
    @staticmethod
    def forward(ctx, input):
        """
        In forward pass we receive a Tensor containing the input and return a Tensor
        containing the output. 
        ctx is a context object that can be used to stash information for backward
        computation. 
        Arbitrary objects can be cached for use in the backward pass using
        ctx.save_for_backward_method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        w.r.t the output and we compute the gradient of the loss w.r.t the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

In [26]:
dtype = torch.float
device = torch.device("cpu")

# init
N, D_in, H, D_out = 64, 1000, 100, 10

# random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# random init weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6

for t in range(500):
    
    # To apply function we use Function.apply method
    relu = myReLu.apply
    
    # Forward pass: compute predicted y using operations;
    # Computing ReLU using custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)
    
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())
        
    # using aurograd to compute the backward pass
    loss.backward()
    
    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # Zero the gradients
        w1.grad.zero_()
        w2.grad.zero_()

99 736.150146484375
199 5.508229732513428
299 0.07747425138950348
399 0.001725903246551752
499 0.0001641077979002148


## *nn* module

### Pytorch:nn