# Warm up – `NumPy`

Before introducing `PyTorch`, we will first implement the network using `numpy`.

Numpy provides an *n-dimensional* array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations:

In [1]:
# Typical NumPy import.
import numpy as np

In [2]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

In [3]:
# Create random input and output data.
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

In [4]:
# Randomly initialize weights.
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

In [5]:
# Learning rate.
lr = 1e-6

In [6]:
# Training iterations.
train_iter = 500

for t in range(train_iter):
    # Foward pass.
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute loss: Squared error.
    loss = np.square(y - y_pred).sum()
    
    # Print loss and current time step.
    print(f'\rt = {t+1:,}\tloss = {loss:.4f}', end='')
    
    # Back propagation: Compute gradients
    # of w1 and w2 w.r.t. loss.
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # Update weights using Gradient Descent.
    w1 -= lr * grad_w1
    w2 -= lr * grad_w2

t = 500	loss = 0.000001655

# Warm up – `PyTorch`: Tensors

`Numpy` is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of [50x or greater](https://github.com/jcjohnson/cnn-benchmarks), so unfortunately `numpy` won’t be enough for modern deep learning.

Here we introduce the most fundamental `PyTorch` concept: *the Tensor*. **A PyTorch Tensor** is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Like numpy arrays, PyTorch Tensors do not know anything about deep learning or computational graphs or gradients; they are a generic tool for scientific computing.

However unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it to a new datatype.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually implement the forward and backward passes through the network:

In [1]:
# Typical PyTorch import.
import torch

In [2]:
# If you have cuda enabled with torch, use it
# otherwise, run on the CPU.
dtype = torch.cuda.FloatTensor if torch.cuda.is_available() else torch.FloatTensor

In [3]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

In [4]:
# Create random input and output data.
x = torch.randn(N, D_in).type(dtype)
y = torch.randn(N, D_out).type(dtype)

In [5]:
# Randomly initialize weights.
w1 = torch.randn(D_in, H).type(dtype)
w2 = torch.randn(H, D_out).type(dtype)

In [6]:
# Learning rate.
lr = 1e-6

In [7]:
# Training iterations.
train_iter = 500

for t in range(train_iter):
    # Forward pass.
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    
    # Compute loss: Squared error.
    loss = (y_pred - y).pow(2).sum()
    print(f'\rt = {t+1:,}\tloss = {loss:.2f}', end='')
    
    # Back propagation: Compute gradients
    # of w1 and w2 w.r.t. loss.
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # Update weights using Gradient Descent.
    w1 -= lr * grad_w1
    w2 -= lr * grad_w2

t = 500	loss = 0.0083716

## Autograd

### PyTorch: Variables and autograd

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) to automate the computation of backward passes in neural networks. The `autograd` package in PyTorch provides exactly this functionality. When using `autograd`, the forward pass of your network will define a ***computational graph***; nodes in the graph will be *Tensors*, and edges will be *functions* that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it’s pretty simple to use in practice. We wrap our PyTorch Tensors in `Variable` objects; a `Variable` represents a node in a *computational graph*. If `x` is a `Variable` then `x.data` is a Tensor, and `x.grad` is another Variable holding the gradient of `x` with respect to some scalar value.

PyTorch Variables have the same API as PyTorch Tensors: (almost) any operation that you can perform on a Tensor also works on Variables; the difference is that using Variables defines a computational graph, allowing you to automatically compute gradients.

Here we use PyTorch Variables and `autograd` to implement our two-layer network; now we no longer need to manually implement the backward pass through the network:

In [1]:
# Typical PyTorch import.
import torch

# Import Variable from the PyTorch's autograd package.
from torch.autograd import Variable

In [2]:
# If you have cuda enabled with torch, use it
# otherwise, run on the CPU.
dtype = torch.cuda.FloatTensor if torch.cuda.is_available() else torch.FloatTensor

In [3]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

In [4]:
# Create random Tensors to hold inputs and outputs, and wrap them in
# Variables. Setting requires_grad to False indicates we don't need to
# compute gradients w.r.t. these Variables during the backward pass.
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)

In [5]:
# Create random Tensors for weights and wrap them in Variable. 
# Setting requires_grad to True indicates we want to compute
# gradients w.r.t. these Variables during the backward pass.
w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)

In [6]:
# Learning rate.
lr = 1e-6

In [7]:
# Training iterations.
train_iter = 500

for t in range(train_iter):
    # Forward pass: compute predicted y using operations
    # on Variables; these are exactly the same operations we
    # used to compute the forward pass using Tensors, but
    # we don't need to keep references to intermediate values
    # since we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # Compute and print loss using operations on the Variables.
    # Now loss is a Varaible of shape (1,) and loss.data is a
    # Tensor of shape (1,); loss.data[0] is a scalar value holding
    # the loss.
    loss = (y_pred - y).pow(2).sum()
    print(f'\rt = {t+1:,}\tloss = {loss.data[0]:.2f}', end='')
    
    # Use the autograd to compute the backward pass. This will
    # compute the gradient of loss w.r.t. to all Variables with
    # requires_grad set to True. After this call, w1.grad and
    # w2.grad will be holding the gradients of the loss w.r.t.
    # w1 and w2 respectively.
    loss.backward()
    
    # Update weights using Gradient Descent; w1.data and w2.data are
    # Tensors, w1.grad and w2.grad are variables and w1.grad.data and
    # w2.grad.data are Tensors.
    w1.data -= lr * w1.grad.data
    w2.data -= lr * w2.grad.data
    
    # Manually zero out the gradient buffer after updating the weights.
    w1.grad.data.zero_()
    w2.grad.data.zero_()

t = 500	loss = 0.0074408