## Autograd

### PyTorch: Variables and autograd

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) to automate the computation of backward passes in neural networks. The `autograd` package in PyTorch provides exactly this functionality. When using `autograd`, the forward pass of your network will define a ***computational graph***; nodes in the graph will be *Tensors*, and edges will be *functions* that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it’s pretty simple to use in practice. We wrap our PyTorch Tensors in `Variable` objects; a `Variable` represents a node in a *computational graph*. If `x` is a `Variable` then `x.data` is a Tensor, and `x.grad` is another Variable holding the gradient of `x` with respect to some scalar value.

PyTorch Variables have the same API as PyTorch Tensors: (almost) any operation that you can perform on a Tensor also works on Variables; the difference is that using Variables defines a computational graph, allowing you to automatically compute gradients.

Here we use PyTorch Variables and `autograd` to implement our two-layer network; now we no longer need to manually implement the backward pass through the network:

In [1]:
# Typical PyTorch import.
import torch

# Import Variable from the PyTorch's autograd package.
from torch.autograd import Variable

In [2]:
# If you have cuda enabled with torch, use it
# otherwise, run on the CPU.
dtype = torch.cuda.FloatTensor if torch.cuda.is_available() else torch.FloatTensor

In [3]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

In [4]:
# Create random Tensors to hold inputs and outputs, and wrap them in
# Variables. Setting requires_grad to False indicates we don't need to
# compute gradients w.r.t. these Variables during the backward pass.
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)

In [5]:
# Create random Tensors for weights and wrap them in Variable. 
# Setting requires_grad to True indicates we want to compute
# gradients w.r.t. these Variables during the backward pass.
w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)

In [6]:
# Learning rate.
lr = 1e-6

In [7]:
# Training iterations.
train_iter = 500

for t in range(train_iter):
    # Forward pass: compute predicted y using operations
    # on Variables; these are exactly the same operations we
    # used to compute the forward pass using Tensors, but
    # we don't need to keep references to intermediate values
    # since we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # Compute and print loss using operations on the Variables.
    # Now loss is a Varaible of shape (1,) and loss.data is a
    # Tensor of shape (1,); loss.data[0] is a scalar value holding
    # the loss.
    loss = (y_pred - y).pow(2).sum()
    print(f'\rt = {t+1:,}\tloss = {loss.data[0]:.2f}', end='')
    
    # Use the autograd to compute the backward pass. This will
    # compute the gradient of loss w.r.t. to all Variables with
    # requires_grad set to True. After this call, w1.grad and
    # w2.grad will be holding the gradients of the loss w.r.t.
    # w1 and w2 respectively.
    loss.backward()
    
    # Update weights using Gradient Descent; w1.data and w2.data are
    # Tensors, w1.grad and w2.grad are variables and w1.grad.data and
    # w2.grad.data are Tensors.
    w1.data -= lr * w1.grad.data
    w2.data -= lr * w2.grad.data
    
    # Manually zero out the gradient buffer after updating 
    # the weights to prevent gradient accumulation.
    w1.grad.data.zero_()
    w2.grad.data.zero_()

t = 500	loss = 0.0089452

### PyTorch: Defining new autograd functions

Under the hood, each primitive *autograd* operator is really two functions that operate on Tensors. The `forward` function computes output Tensors from input Tensors. The `backward` function receives the gradient of the output Tensors w.r.t. some scalar value, and computes the gradient of the input Tensors w.r.t. that same scalar value.

In PyTorch, we can easily define our own *autograd operator* by defining a subclass of `torch.autograd.Function` and implementing the `forward` and `backward` functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Variables containing input data.

In this example we define our own custom autograd function for performing the *ReLU nonlinearity*, and use it to implement our two-layer network:

In [1]:
# Typical PyTorch import
import torch

# Import Variable from the PyTorch's autograd package.
from torch.autograd import Variable

In [2]:
class MyReLU(torch.autograd.Function):
    """
    We can build our own custom autograd functions
    by creating a subclass of the `torch.autograd.Function`
    class and overriding the `forward` and `backward`
    static methods.
    """
    
    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass, we recieve a Tensor containing
        the input and return a tensor containing the computed
        output. In this case, we return the ReLU activation.
        
        `ctx` is a context object that can be used to stack 
        information for backward computation. You can cache 
        arbitrary objects for use in the backward pass using 
        `ctx.save_for_backward` method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        """
        We recieve the gradient of the loss w.r.t. the output.
        Now, we compute the gradient of the loss w.r.t. the input.
        
        `ctx` is a context object that can also be used to get
        stored information in the forward pass. The saved Tensors
        is stored in the `ctx.saved_tesnors`. The `ctx.saved_tensors`
        returns a tuple of saved tensors.
        
        Since we saved a single Tensor, the saved_tesnor contains
        a single value therefore we unpack it by having a comma (,)
        after the variable name:
        
        >>> names = ('John',)  # Comma after the 1st element.
        >>> john, = names
        >>> print(john)
        'John'
        
        >>> names = ('John')  # No comma
        >>> (john) = names
        >>> print(john)
        'John'
        
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0  # ReLU derivative.
        return grad_input

In [3]:
# If you have cuda enabled with torch, use it
# otherwise, run on the CPU.
dtype = torch.cuda.FloatTensor if torch.cuda.is_available() else torch.FloatTensor

In [4]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

In [5]:
# Create random Tensors to hold inputs and outputs, and wrap them in
# Variables. Setting requires_grad to False indicates we don't need to
# compute gradients w.r.t. these Variables during the backward pass.
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)

In [6]:
# Create random Tensors for weights and wrap them in Variable. 
# Setting requires_grad to True indicates we want to compute
# gradients w.r.t. these Variables during the backward pass.
w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)

In [7]:
# Learning rate.
lr = 1e-6

In [8]:
# Training iterations.
train_iter = 500

for t in range(train_iter):
    # We don't create an instance of our custom class,
    # instead we use the `Function.apply` to apply our 
    # custom activation function.
    # We apply because we want PyTorch to still keep
    # the computations inside the current graph, and
    # it's history.
    relu = MyReLU.apply
    
    # Forward pass: We multiply our input by the 1st weight
    # matrix (w1) then we use our custom non-linearity
    # then matrix multiply the 2nd weight to get our prediction.
    y_pred = relu(x.mm(w1)).mm(w2)
    
    # Compute and print loss using operations on the Variables.
    loss = (y_pred - y).pow(2).sum()
    print(f'\rt = {t+1:,}\tloss = {loss.data[0]:.2f}', end='')
    
    # Use the autograd to compute the backward pass. 
    loss.backward()
    
    # Update the weights using Gradient Descent.
    w1.data -= lr * w1.grad.data
    w2.data -= lr * w2.grad.data
    
    # Manually zero out the gradient buffer after updating 
    # the weights to prevent gradient accumulation.
    w1.grad.data.zero_()
    w2.grad.data.zero_()

t = 500	loss = 0.0069152

### TensorFlow: Static Graphs

PyTorch autograd looks a lot like TensorFlow: in both frameworks we define a computational graph, and use automatic differentiation to compute gradients. The biggest difference between the two is that TensorFlow’s computational graphs are **static** and PyTorch uses **dynamic computational graphs**.

In TensorFlow, we define the computational graph once and then execute the same graph over and over again, possibly feeding different input data to the graph. In PyTorch, each forward pass defines a new computational graph.

Static graphs are nice because you can optimize the graph up front; for example a framework might decide to fuse some graph operations for efficiency, or to come up with a strategy for distributing the graph across many GPUs or many machines. If you are reusing the same graph over and over, then this potentially costly up-front optimization can be amortized as the same graph is rerun over and over.

One aspect where static and dynamic graphs differ is control flow. For some models we may wish to perform different computation for each data point; for example a recurrent network might be unrolled for different numbers of time steps for each data point; this unrolling can be implemented as a loop. With a static graph the loop construct needs to be a part of the graph; for this reason TensorFlow provides operators such as `tf.scan` for embedding loops into the graph. With dynamic graphs the situation is simpler: since we build graphs on-the-fly for each example, we can use normal **imperative flow control** to perform computation that differs for each input.

To contrast with the PyTorch autograd example above, here we use TensorFlow to fit a simple two-layer net:

In [1]:
# Typical NumPy import.
import numpy as np

# Standard way to import TensorFlow.
import tensorflow as tf

In [2]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 100, 1000, 10

In [3]:
# Create placeholders for the input and output,
# to serve as gateway for feeding inputs and outputs
# to the network during execution.
x = tf.placeholder(tf.float32, shape=[N, D_in])
y = tf.placeholder(tf.float32, shape=[N, D_out])

In [4]:
# Randomly initialize learnable weights. A TensorFlow
# Variable persists it's value across execution of the graph.
w1 = tf.Variable(tf.random_normal(shape=[D_in, H]))
w2 = tf.Variable(tf.random_normal(shape=[H, D_out]))

In [5]:
# Learning rate.
lr = 1e-6

In [6]:
# Forward pass: Propagate the input through the network
# by performing some operations on TensorFlow's Tensors.
# NOTE: No operation is actually being run at this point,
# we're just setting up the computational graph that'll
# be executed/run later on.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

In [7]:
# Compute the loss: Three different ways to compute Squared error
# in TensorFlow.
loss = tf.reduce_sum(tf.squared_difference(y_pred, y))
# loss = tf.reduce_sum(tf.square(y - y_pred))
# loss = tf.reduce_sum((y - y_pred) ** 2.0)

print(loss)  # Prints the node that holds the operation on loss.

Tensor("Sum:0", shape=(), dtype=float32)


In [8]:
# Compute the gradient of the loss, w.r.t. w1 & w2
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

In [9]:
# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
new_w1 = w1.assign(w1 - lr * grad_w1)
new_w2 = w2.assign(w2 - lr * grad_w2)

In [10]:
# It's time to run our computational graph.
# We run graphs using the TensorFlow's Session.
with tf.Session() as sess:
    # Run the graph ones to initialize the Variables w1 & w2.
    sess.run(tf.global_variables_initializer())
    
    # Create a NumPy array that holds our actual data
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    
    # Training iterations.
    train_iter = 500
    
    for t in range(train_iter):
        # Execute the graph many times. Each time it executes we want to bind
        # x_value to x and y_value to y, specified with the feed_dict argument.
        # Each time we execute the graph we want to compute the values for loss,
        # new_w1, and new_w2; the values of these Tensors are returned as numpy
        # arrays.
        _loss, _, _ = sess.run([loss, new_w1, new_w2], 
                               feed_dict={ x: x_value, y: y_value })
        # Print training progress.
        print(f'\rt = {t+1:,}\tLoss = {_loss:.2f}', end='')

t = 500	Loss = 0.00390050