# Learning PyTorch with Examples

**Author:** Justin Johnson  
    
This tutorial introduces the fundamental concepts of PyTorch through self-contained examples.  
    
At its core, PyTorch provides two main features:  
    
- An n-dimensional Tensor, similar to numpy but can run on GPUs
- Automatic differentiation for building and training neural networks
    
We will use a fully-connected ReLU network as our running example. The network will have a single hidden layer, and will   
be trained with gradient descent to fit random data by minimizing the Euclidean distance between the network output and   
the true output.  

# Tensors

## Warm-up: numpy
    
Before introducing PyTorch, we will first implement the network using numpy.  
    
Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic   
framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients.   
However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and   
backward passes through the network using numpy operations:  

In [3]:
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 32512637.15117716
1 30130367.18497808
2 32112882.331481837
3 32674762.65250075
4 28096102.076332755
5 19245204.9476401
6 10813076.339444432
7 5528340.472117875
8 2936092.707918415
9 1769241.1439698804
10 1224442.4722123006
11 937299.2979155439
12 761522.7290322038
13 638892.8198740529
14 545452.3136335465
15 470590.6392900933
16 408977.59545085375
17 357493.1606781024
18 313995.9559709025
19 276921.9426135201
20 245181.6597324201
21 217855.45493256536
22 194203.05668158483
23 173703.13538940827
24 155808.99770627473
25 140127.3125375644
26 126333.82910770437
27 114160.2537346893
28 103390.41514447962
29 93828.12239409874
30 85316.07853242781
31 77720.89000181489
32 70928.15092001304
33 64837.57048365174
34 59365.616777927804
35 54437.21867535901
36 49989.62123996641
37 45969.04121266958
38 42328.22709177427
39 39022.40968312568
40 36016.05326573577
41 33265.45516730071
42 30759.585512654332
43 28473.947480361705
44 26387.09860553934
45 24477.898643201363
46 22731.042010226985
47 2113

400 0.0016749033141179683
401 0.001606416040506588
402 0.0015407547307367542
403 0.0014777689130455406
404 0.0014173609430441175
405 0.0013594348808299
406 0.0013039143831843607
407 0.0012506526317361231
408 0.0011995703666519352
409 0.0011505750692011553
410 0.0011036019366654534
411 0.001058565006415029
412 0.0010153401462224985
413 0.0009739236168972808
414 0.0009341713280003783
415 0.0008960698022216565
416 0.0008595174646974644
417 0.000824458789706526
418 0.0007908434032905634
419 0.0007586036647089785
420 0.0007276842218193697
421 0.0006980304491920656
422 0.0006695802571610049
423 0.0006422960548513613
424 0.000616125339990481
425 0.0005910301837518044
426 0.0005669625521745329
427 0.0005438761695975088
428 0.0005217279465181735
429 0.0005004936344051014
430 0.0004801258260643072
431 0.0004605820330437538
432 0.00044184051351570233
433 0.0004238728442685487
434 0.00040663149496959663
435 0.0003900946361593656
436 0.0003742300509931636
437 0.0003590147111505833
438 0.00034441496

# PyTorch: Tensors
  
Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural   
networks, GPUs often provide speedups of **50x or greater**, so unfortunately numpy won’t be enough for modern deep   
learning.  
    
Here we introduce the most fundamental PyTorch concept: the **Tensor**. A PyTorch Tensor is conceptually identical to a   
numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors.   
Behind the scenes, Tensors can keep track of a computational graph and gradients, but they’re also useful as a generic   
tool for scientific computing.  
    
Also unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on   
GPU, you simply need to cast it to a new datatype.  
    
Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually   
implement the forward and backward passes through the network:  

In [4]:
# -*- coding: utf-8 -*-

import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device(""cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred=h_relu.mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 27916228.0
1 22337372.0
2 20150272.0
3 18466598.0
4 16090425.0
5 12877976.0
6 9452834.0
7 6476761.0
8 4269442.0
9 2790474.5
10 1857088.75
11 1279612.5
12 920495.5
13 691431.875
14 539902.625
15 435130.0
16 359416.375
17 302342.3125
18 257814.90625
19 222087.0625
20 192813.5625
21 168391.703125
22 147799.296875
23 130264.34375
24 115207.78125
25 102204.4296875
26 90913.015625
27 81056.1015625
28 72431.9609375
29 64860.91015625
30 58195.1953125
31 52309.68359375
32 47101.31640625
33 42479.15234375
34 38365.578125
35 34696.12890625
36 31418.759765625
37 28485.09375
38 25854.30859375
39 23492.6484375
40 21368.31640625
41 19454.56640625
42 17727.716796875
43 16167.0712890625
44 14755.8173828125
45 13478.146484375
46 12319.8623046875
47 11268.794921875
48 10314.8046875
49 9447.27734375
50 8657.9833984375
51 7939.4599609375
52 7284.63037109375
53 6687.28955078125
54 6142.71142578125
55 5645.2431640625
56 5190.6689453125
57 4774.87109375
58 4394.41796875
59 4046.112548828125
60 3726.95019531

420 4.237335087964311e-05
421 4.179903044132516e-05
422 4.118539436603896e-05
423 4.054736200487241e-05
424 4.0005394112085924e-05
425 3.937944711651653e-05
426 3.861612640321255e-05
427 3.7925667129457e-05
428 3.722254768945277e-05
429 3.661458322312683e-05
430 3.6142071621725336e-05
431 3.5552533518057317e-05
432 3.493516851449385e-05
433 3.427968840696849e-05
434 3.361038034199737e-05
435 3.312835906399414e-05
436 3.267878128099255e-05
437 3.206097608199343e-05
438 3.159174957545474e-05
439 3.113192360615358e-05
440 3.0681352654937655e-05
441 3.0252889700932428e-05
442 2.983474769280292e-05
443 2.9379742045421153e-05
444 2.9165188607294112e-05
445 2.8553300580824725e-05
446 2.801929622364696e-05
447 2.77465078397654e-05
448 2.7243389922659844e-05
449 2.6911604436463676e-05
450 2.6496394639252685e-05
451 2.6119874746655114e-05
452 2.5653709599282593e-05
453 2.5276236556237563e-05
454 2.493792271707207e-05
455 2.4637796741444618e-05
456 2.4247947294497862e-05
457 2.389493238297291e-05

# Autograd

## PyTorch: Tensors and autograd
    
In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually   
implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large   
complex networks.  
    
Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The   
**autograd** package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your   
network  will define a **computational graph**; nodes in the graph will be Tensors, and edges will be functions that   
produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute   
gradients.  
    
This sounds complicated, it’s pretty simple to use in practice. Each Tensor represents a node in a computational graph.   
If `x` is a Tensor that has `x.requires_grad=True` then `x.grad` is another Tensor holding the gradient of `x` with respect to   
some scalar value.  
    
Here we use PyTorch Tensors and autograd to implement our two-layer network; now we no longer need to manually implement   
the backward pass through the network:  

In [7]:
# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hisdden dimension; D_out is output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights
# Setting requires_grad = True indicates that we want to compute graidents with
# respect to these Tensors during the backward pass
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactlly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the a scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
    
    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()
    
    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd
    # An alternative way is to operation on weight.data and weights.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 32839986.0
1 28735776.0
2 25800786.0
3 21426596.0
4 15761454.0
5 10358772.0
6 6384536.0
7 3910374.0
8 2497607.25
9 1705025.0
10 1246455.75
11 964584.25
12 778280.375
13 645931.25
14 546290.625
15 468147.375
16 404848.6875
17 352473.5
18 308675.90625
19 271546.03125
20 239851.4375
21 212588.484375
22 189019.90625
23 168533.65625
24 150676.78125
25 135057.84375
26 121343.609375
27 109256.5390625
28 98567.6484375
29 89086.171875
30 80646.5078125
31 73130.875
32 66416.5234375
33 60406.05859375
34 55016.765625
35 50176.6953125
36 45817.44921875
37 41887.75
38 38339.484375
39 35130.82421875
40 32222.53125
41 29585.90234375
42 27191.333984375
43 25014.015625
44 23031.423828125
45 21223.638671875
46 19574.166015625
47 18066.84765625
48 16688.759765625
49 15431.44921875
50 14279.802734375
51 13224.126953125
52 12254.7724609375
53 11363.443359375
54 10543.6875
55 9789.0380859375
56 9093.796875
57 8453.8330078125
58 7863.2333984375
59 7317.99169921875
60 6813.978515625
61 6347.8857421875
62 591