## Warm up – `NumPy`

Before introducing `PyTorch`, we will first implement the network using `numpy`.

Numpy provides an *n-dimensional* array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations:

In [1]:
# Typical NumPy import.
import numpy as np

In [2]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

In [3]:
# Create random input and output data.
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

In [4]:
# Randomly initialize weights.
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

In [5]:
# Learning rate.
lr = 1e-6

In [6]:
# Training iterations.
train_iter = 500

for t in range(train_iter):
    # Foward pass.
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute loss: Squared error.
    loss = np.square(y - y_pred).sum()
    
    # Print loss and current time step.
    print(f'\rt = {t+1:,}\tloss = {loss:.4f}', end='')
    
    # Back propagation: Compute gradients
    # of w1 and w2 w.r.t. loss.
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # Update weights using Gradient Descent.
    w1 -= lr * grad_w1
    w2 -= lr * grad_w2

t = 500	loss = 0.000072631

## Warm up – `PyTorch`: Tensors

`Numpy` is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of [50x or greater](https://github.com/jcjohnson/cnn-benchmarks), so unfortunately `numpy` won’t be enough for modern deep learning.

Here we introduce the most fundamental `PyTorch` concept: *the Tensor*. **A PyTorch Tensor** is conceptually identical to a numpy array: a Tensor is an *n-dimensional* array, and PyTorch provides many functions for operating on these Tensors. Like numpy arrays, PyTorch Tensors do not know anything about deep learning or computational graphs or gradients; they are a generic tool for scientific computing.

However unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it to a new datatype.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually implement the forward and backward passes through the network:

In [1]:
# Typical PyTorch import.
import torch

In [2]:
# If you have cuda enabled with torch, use it
# otherwise, run on the CPU.
dtype = torch.cuda.FloatTensor if torch.cuda.is_available() else torch.FloatTensor

In [3]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

In [4]:
# Create random input and output data.
x = torch.randn(N, D_in).type(dtype)
y = torch.randn(N, D_out).type(dtype)

In [5]:
# Randomly initialize weights.
w1 = torch.randn(D_in, H).type(dtype)
w2 = torch.randn(H, D_out).type(dtype)

In [6]:
# Learning rate.
lr = 1e-6

In [7]:
# Training iterations.
train_iter = 500

for t in range(train_iter):
    # Forward pass.
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    
    # Compute loss: Squared error.
    loss = (y_pred - y).pow(2).sum()
    print(f'\rt = {t+1:,}\tloss = {loss:.2f}', end='')
    
    # Back propagation: Compute gradients
    # of w1 and w2 w.r.t. loss.
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # Update weights using Gradient Descent.
    w1 -= lr * grad_w1
    w2 -= lr * grad_w2

t = 500	loss = 0.0081636