Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations:

In [88]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [60]:
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
# N, D_in, H, D_out = 64, 1000, 100, 10
N, D_in, H, D_out = 3, 4, 2, 2

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

In [61]:
x.shape

(3, 4)

In [62]:
x

array([[-1.67965798, -0.08477309,  0.64281521, -0.34688563],
       [ 2.77727155, -0.09237655,  0.00802048,  0.32063501],
       [-0.04064602,  0.38100734, -1.05766499,  0.15905582]])

In [63]:
w1

array([[ 0.68076699,  1.32732059],
       [ 1.44452066, -1.0294422 ],
       [ 0.63106914,  0.74884408],
       [ 0.35696426,  1.80052292]])

In [64]:
x.dot(w1)

array([[-0.98407712, -2.28538278],
       [ 1.87675168,  4.3647428 ],
       [-0.08797999, -0.95181786]])

In [66]:
w2

array([[-2.15469262, -1.11042737],
       [-0.37643149,  1.21851607]])

In [67]:
y

array([[ 0.1800653 ,  1.1367831 ],
       [ 0.53919874,  0.44426596],
       [-1.10453301, -0.56561339]])

In [72]:
learning_rate = 1e-3
for t in range(50):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

   
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)
   
        
    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 2.864611001842192
1 2.8646110018421913
2 2.864611001842191
3 2.86461100184219
4 2.864611001842189
5 2.8646110018421886
6 2.8646110018421878
7 2.864611001842187
8 2.8646110018421864
9 2.864611001842186
10 2.864611001842185
11 2.8646110018421846
12 2.864611001842184
13 2.8646110018421833
14 2.864611001842183
15 2.8646110018421824
16 2.8646110018421815
17 2.8646110018421815
18 2.8646110018421806
19 2.86461100184218
20 2.8646110018421798
21 2.8646110018421793
22 2.864611001842179
23 2.864611001842179
24 2.864611001842178
25 2.8646110018421775
26 2.864611001842177
27 2.864611001842177
28 2.8646110018421767
29 2.864611001842176
30 2.864611001842176
31 2.8646110018421753
32 2.8646110018421753
33 2.864611001842175
34 2.8646110018421744
35 2.8646110018421744
36 2.864611001842174
37 2.864611001842174
38 2.8646110018421735
39 2.8646110018421735
40 2.8646110018421727
41 2.8646110018421727
42 2.8646110018421727
43 2.864611001842172
44 2.8646110018421718
45 2.8646110018421718
46 2.8646110018421718

PyTorch: Tensors

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so unfortunately numpy won’t be enough for modern deep learning.

Here we introduce the most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Like numpy arrays, PyTorch Tensors do not know anything about deep learning or computational graphs or gradients; they are a generic tool for scientific computing.

However unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it to a new datatype.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually implement the forward and backward passes through the network:

In [75]:
# -*- coding: utf-8 -*-

import torch


dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 25233694.0
1 23002228.0
2 24326592.0
3 25894804.0
4 25161994.0
5 20851028.0
6 14577095.0
7 8811716.0
8 4958921.0
9 2804366.5
10 1697150.875
11 1128627.625
12 821972.625
13 641584.4375
14 524307.875
15 440944.8125
16 377357.71875
17 326647.125
18 284973.75
19 250090.28125
20 220447.3125
21 195002.0
22 173017.59375
23 153939.390625
24 137316.890625
25 122774.7265625
26 110019.1640625
27 98791.625
28 88886.2734375
29 80116.6640625
30 72346.3125
31 65432.6953125
32 59270.57421875
33 53766.96484375
34 48843.35546875
35 44433.08203125
36 40472.25
37 36908.5859375
38 33697.37890625
39 30800.328125
40 28183.82421875
41 25816.892578125
42 23672.603515625
43 21726.181640625
44 19961.556640625
45 18356.17578125
46 16894.205078125
47 15566.021484375
48 14355.1728515625
49 13248.474609375
50 12235.6767578125
51 11308.1669921875
52 10457.587890625
53 9677.5361328125
54 8962.3203125
55 8305.2109375
56 7700.4228515625
57 7143.56640625
58 6630.4501953125
59 6157.33447265625
60 5721.0810546875
61 5318

439 0.0001704092137515545
440 0.0001674112572800368
441 0.00016403938934672624
442 0.00016097180196084082
443 0.00015809535398148
444 0.00015529118536505848
445 0.0001522652746643871
446 0.0001495950564276427
447 0.00014693527191411704
448 0.00014424168330151588
449 0.00014203590399120003
450 0.0001391172845615074
451 0.00013621925609186292
452 0.00013383899931795895
453 0.00013172629405744374
454 0.00012943071487825364
455 0.0001272310473723337
456 0.00012451416114345193
457 0.00012270131264813244
458 0.00012068985233781859
459 0.00011847742280224338
460 0.00011672811524476856
461 0.00011495107901282609
462 0.00011290542170172557
463 0.0001111301826313138
464 0.00010928417032118887
465 0.00010724136518547311
466 0.000105615436041262
467 0.00010394397395430133
468 0.00010230938642052934
469 0.00010035462764790282
470 9.893770038615912e-05
471 9.745088027557358e-05
472 9.582490019965917e-05
473 9.423097799299285e-05
474 9.268492431147024e-05
475 9.129592217504978e-05
476 9.0150111645925

Autograd

PyTorch: Tensors and autograd

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it’s pretty simple to use in practice. Each Tensor represents a node in a computational graph. If x is a Tensor that has x.requires_grad=True then x.grad is another Tensor holding the gradient of x with respect to some scalar value.

Here we use PyTorch Tensors and autograd to implement our two-layer network; now we no longer need to manually implement the backward pass through the network:

In [74]:
# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the a scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 29700598.0
1 25226410.0
2 25419906.0
3 26285604.0
4 25099588.0
5 20640374.0
6 14429671.0
7 8793727.0
8 5029030.5
9 2896818.25
10 1785171.125
11 1203988.875
12 884656.0625
13 693464.5625
14 567827.0625
15 477857.25
16 409187.0625
17 354365.4375
18 309337.3125
19 271738.6875
20 239867.359375
21 212605.765625
22 189129.96875
23 168799.96875
24 151113.515625
25 135676.953125
26 122147.3359375
27 110237.265625
28 99716.90625
29 90397.0859375
30 82117.609375
31 74735.6875
32 68137.7734375
33 62233.5234375
34 56935.30078125
35 52166.49609375
36 47865.4296875
37 43978.921875
38 40459.85546875
39 37268.65234375
40 34368.515625
41 31729.1171875
42 29323.60546875
43 27127.58984375
44 25120.4921875
45 23283.93359375
46 21603.189453125
47 20060.74609375
48 18643.03125
49 17338.4453125
50 16137.263671875
51 15030.3505859375
52 14008.44140625
53 13064.9765625
54 12195.1318359375
55 11389.380859375
56 10643.30078125
57 9951.5830078125
58 9310.341796875
59 8715.5927734375
60 8163.0537109375
61 7649.3

442 0.0004352903342805803
443 0.000425031321356073
444 0.00041443013469688594
445 0.0004048759874422103
446 0.0003944701747968793
447 0.00038468322600238025
448 0.00037608068669214845
449 0.0003668770077638328
450 0.00035825828672386706
451 0.0003499225713312626
452 0.00034090003464370966
453 0.0003332629567012191
454 0.00032575841760262847
455 0.00031867812504060566
456 0.0003112089179921895
457 0.00030399911338463426
458 0.0002974079689010978
459 0.0002906351292040199
460 0.0002837925567291677
461 0.00027763034449890256
462 0.0002720996562857181
463 0.000266393821220845
464 0.0002599817526061088
465 0.00025407355860807
466 0.00024855678202584386
467 0.00024312268942594528
468 0.00023802924260962754
469 0.0002326656976947561
470 0.00022822350729256868
471 0.00022383870964404196
472 0.00021890435891691595
473 0.00021444754383992404
474 0.0002105297171510756
475 0.00020610821957234293
476 0.00020229077199473977
477 0.000197950066649355
478 0.00019415195856709033
479 0.000190640581422485

In [76]:
# -*- coding: utf-8 -*-
import torch


class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 27166500.0
1 20970908.0
2 18985478.0
3 18315846.0
4 17370156.0
5 15410640.0
6 12438221.0
7 9178292.0
8 6286936.5
9 4127377.75
10 2672513.0
11 1755572.375
12 1191745.75
13 845611.1875
14 628256.4375
15 487178.96875
16 391240.25
17 322814.34375
18 271749.53125
19 232152.484375
20 200453.203125
21 174443.296875
22 152718.5
23 134373.6875
24 118769.15625
25 105323.9921875
26 93676.984375
27 83535.3046875
28 74668.0078125
29 66886.703125
30 60038.5703125
31 53990.5859375
32 48640.02734375
33 43893.4921875
34 39672.90625
35 35918.98046875
36 32566.08203125
37 29567.0234375
38 26886.876953125
39 24481.765625
40 22317.48046875
41 20368.8359375
42 18611.880859375
43 17024.146484375
44 15586.87109375
45 14284.61328125
46 13102.888671875
47 12029.373046875
48 11053.1962890625
49 10164.1591796875
50 9355.1064453125
51 8616.654296875
52 7942.0771484375
53 7325.32861328125
54 6760.859375
55 6243.7666015625
56 5769.6474609375
57 5334.5771484375
58 4935.037109375
59 4567.98779296875
60 4230.46337890

378 0.00010606206342345104
379 0.00010349374497309327
380 0.00010111406299984083
381 9.864105959422886e-05
382 9.646104444982484e-05
383 9.452350786887109e-05
384 9.235912875737995e-05
385 9.023745951708406e-05
386 8.815152250463143e-05
387 8.629106741864234e-05
388 8.44119640532881e-05
389 8.23202935862355e-05
390 8.063963468885049e-05
391 7.858459139242768e-05
392 7.688406913075596e-05
393 7.543944229837507e-05
394 7.373918924713507e-05
395 7.214029028546065e-05
396 7.06145801814273e-05
397 6.898605352034792e-05
398 6.752143235644326e-05
399 6.633148586843163e-05
400 6.514985579997301e-05
401 6.352246418828145e-05
402 6.249133730307221e-05
403 6.112233677413315e-05
404 6.0162070440128446e-05
405 5.900589167140424e-05
406 5.779591447208077e-05
407 5.642140968120657e-05
408 5.5429751228075475e-05
409 5.435301864054054e-05
410 5.3424781071953475e-05
411 5.248595334705897e-05
412 5.151376171852462e-05
413 5.076089655631222e-05
414 4.993747643311508e-05
415 4.9007936468115076e-05
416 4.79

In [209]:
# understand the backpropagation and weight

In [211]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 3, 4, 2, 1

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

In [212]:
# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

In [213]:
# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(size_average=False)

In [214]:
learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)
    
#     y_pred
    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t ==499:
        print(t, loss.item())
   
    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access and gradients like we did before.
    with torch.no_grad():
        # what are these parameters?
        for param in model.parameters():            
            param -= learning_rate * param.grad

499 6.60897159576416


In [215]:
x

tensor([[ 1.3065, -0.6920, -2.0258,  0.2677],
        [-0.4559, -0.4160, -0.3827,  0.3101],
        [-1.2335, -1.3084,  0.9000,  1.0761]])

In [216]:
y_pred

tensor([[-0.3008],
        [-0.2863],
        [-0.2863]])

In [217]:
y

tensor([[ 1.4009],
        [-0.7626],
        [ 1.5809]])

In [218]:
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[-0.1920, -0.2201, -0.2183,  0.0186],
        [ 0.1046,  0.3282, -0.1726, -0.0245]])
Parameter containing:
tensor([-0.3493, -0.1472])
Parameter containing:
tensor([[-0.5682, -0.1369]])
Parameter containing:
tensor([-0.2857])


In [130]:
ts1 = torch.tensor([[-0.4191,  0.7407, -0.1794, -0.2367],
        [ 0.3248, -0.2662, -0.2573, -0.0398]])

AttributeError: 'list' object has no attribute 'T'

In [146]:
ts1

tensor([[-0.4191,  0.7407, -0.1794, -0.2367],
        [ 0.3248, -0.2662, -0.2573, -0.0398]])

In [144]:
ts1.size()

torch.Size([2, 4])

In [149]:
ts1t = ts1.transpose(1,0)

In [151]:
a1 = torch.mm(x, ts1t )

In [160]:
a1

tensor([[-0.7253,  1.0946],
        [ 1.3099, -0.5605],
        [ 0.1304,  0.3551]])

In [161]:
ts2 = torch.tensor(([ 0.5273,  0.4526]))#.transpose(1,0)
ts2

tensor([ 0.5273,  0.4526])

In [155]:
ts2.size()

torch.Size([2])

In [157]:
ts2.transpose(-1,0)

tensor([ 0.5273,  0.4526])

In [158]:
ts2.size()

torch.Size([2])

In [162]:
torch.mm(a1, ts2)

RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)

In [203]:
# what exactly does backward do?
x = torch.tensor([[1., -1.], [1., 1.]], requires_grad=True)
out = x.pow(2).sum()
out.backward()
x.grad

tensor([[ 2., -2.],
        [ 2.,  2.]])

In [192]:
x

tensor([[ 1., -1.],
        [ 1.,  1.]])

In [193]:
x.size()

torch.Size([2, 2])

In [194]:
out

tensor(4.)

In [195]:
out.backward()

In [196]:
x.grad

tensor([[ 2., -2.],
        [ 2.,  2.]])

In [200]:
a = x.grad.transpose(0, 1)

In [202]:
a

tensor([[ 2.,  2.],
        [-2.,  2.]])

In [201]:
torch.mm(x, a)

tensor([[ 4.,  0.],
        [ 0.,  4.]])

In [189]:
tensor = torch.ones((2,3), dtype=torch.int8)
data = [[0, 1], [2, 3]]
tensor.new_tensor(data)

tensor([[ 0,  1],
        [ 2,  3]], dtype=torch.int8)

In [188]:
data = [[0, 1], [2, 3]]
torch.tensor(data).dtype

torch.int64

In [190]:
tensor

tensor([[ 1,  1,  1],
        [ 1,  1,  1]], dtype=torch.int8)

In [181]:
tensor.size()

torch.Size([2])