本tutorial介绍了PyTorch的基本概念，所有内容是self-contained的。
总体来讲，PyTorch提供了两个主要的特性：
- 多维张量，和numpy类似，但可以在GPU上运行
- 自动求导，用于构建和训练神经网络

以下例子中我们会使用一个包含ReLU的全连接网络，此网络有一个隐藏层，使用梯度下降法训练随机数据，以最小化真实输出和网络输出的欧式距离。

In [1]:
import torch
torch.__version__

'1.0.0'

### Tensors

#### 热身一下：Numpy

Numpy提供了建立多维数组的机制，并包含很多可以操纵数组的函数，它是用于科学计算的常见框架，但它并不知道任何与计算图、深度学习和梯度有关的东西。然而，我们依然可以简单的使用Numpy来建立一个两层的网络，并手动实现前向和后向传播的过程。

In [2]:
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10


In [3]:
# Create random input and output data
x = np.random.randn(N, D_in) # 64*1000
y = np.random.randn(N, D_out) # 64*10


In [4]:
# Randomly initialize weights
w1 = np.random.randn(D_in, H) # 1000*100
w2 = np.random.randn(H, D_out) #100*10

In [5]:
learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1) # 64*100
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 28958576.350019682
1 22365420.804534644
2 19189189.574356735
3 16584854.506709348
4 13638920.651529526
5 10446100.22967545
6 7495926.168461818
7 5159493.744811254
8 3503850.244585237
9 2410146.796518678
10 1709318.7210664763
11 1259568.7504110925
12 965188.881802811
13 765324.6502102218
14 623667.3539769723
15 518924.7048754768
16 438559.43902589
17 374901.1448789744
18 323263.87139005115
19 280640.53322817874
20 244947.0413094657
21 214706.60657176166
22 188961.4117286838
23 166846.34528998914
24 147753.00418768259
25 131201.49934245006
26 116774.07907213
27 104148.20373561876
28 93076.10145394402
29 83352.18371278283
30 74778.0184429689
31 67197.8715920417
32 60480.002997679534
33 54511.078333231475
34 49201.245771077454
35 44466.04463404311
36 40233.725490821125
37 36446.460512661
38 33054.300161038664
39 30010.210636413765
40 27271.457298920788
41 24806.580712781746
42 22585.93325565823
43 20582.397175215698
44 18771.72441113919
45 17134.532514807892
46 15651.771593680905
47 1430

384 3.8149490376865714e-06
385 3.590543580726219e-06
386 3.379344063316113e-06
387 3.180635374202221e-06
388 2.9935984291880825e-06
389 2.817623323887632e-06
390 2.651972605555563e-06
391 2.4960949651018845e-06
392 2.349433136532214e-06
393 2.2113637072894683e-06
394 2.0814221069900735e-06
395 1.959134481953547e-06
396 1.844041093132007e-06
397 1.735742679344427e-06
398 1.6337972969831432e-06
399 1.537856049372566e-06
400 1.4475552848771188e-06
401 1.3625754425393204e-06
402 1.2825945674659461e-06
403 1.2072981763567572e-06
404 1.1364375965835407e-06
405 1.0697462880980658e-06
406 1.0069689458030403e-06
407 9.478790481869049e-07
408 8.92263030636611e-07
409 8.399139961368447e-07
410 7.906559879741601e-07
411 7.442916345280498e-07
412 7.0063885246619e-07
413 6.595636106434961e-07
414 6.208907337193296e-07
415 5.844857020697227e-07
416 5.502170445429194e-07
417 5.179663992474502e-07
418 4.876054778044821e-07
419 4.5903944817253343e-07
420 4.3214508886735013e-07
421 4.068289362572598e-07


#### PyTorch：tensors

Numpy是个不错的框架，但是它却并不能有效利用GPU进行加速计算。对于现代的深度神经网络而言，一般GPU都能提供50倍甚至更多的加速，因此，很不幸Numpy并未在深度学习场景下继续使用。而PyTorch的Tensor则有效弥补了以上的不足，并且可以从numpy转换为Tensor，在GPU上运行PyTorch的Tensor时,只需要导入到GPU上并进行类型的转换即可。

In [6]:
import torch


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 27811180.0
1 22621152.0
2 21444972.0
3 21299924.0
4 20385646.0
5 17887780.0
6 14070144.0
7 10004315.0
8 6587963.0
9 4179952.0
10 2648428.75
11 1727941.375
12 1182779.875
13 856333.625
14 653502.9375
15 521329.0
16 430104.46875
17 363580.8125
18 312613.0
19 272014.8125
20 238745.5625
21 210867.09375
22 187232.453125
23 166931.875
24 149341.734375
25 133990.0625
26 120530.34375
27 108677.203125
28 98206.765625
29 88920.9921875
30 80659.9375
31 73294.65625
32 66706.671875
33 60816.65234375
34 55540.75390625
35 50791.5625
36 46511.015625
37 42646.5625
38 39152.5703125
39 35989.0703125
40 33118.8203125
41 30510.7421875
42 28137.82421875
43 25974.45703125
44 24001.109375
45 22199.44921875
46 20550.89453125
47 19041.3359375
48 17657.169921875
49 16386.10546875
50 15218.7802734375
51 14144.380859375
52 13155.98046875
53 12245.359375
54 11405.267578125
55 10630.3046875
56 9914.236328125
57 9252.3369140625
58 8641.04296875
59 8075.0888671875
60 7550.53564453125
61 7064.0888671875
62 6612.25488

409 0.0020090541802346706
410 0.0019443284254521132
411 0.0018787927692756057
412 0.0018177597085013986
413 0.0017581165302544832
414 0.0017034081974998116
415 0.0016494517913088202
416 0.0015948006184771657
417 0.0015458507696166635
418 0.0014972281642258167
419 0.0014486000873148441
420 0.0014052803162485361
421 0.001361504546366632
422 0.0013165536802262068
423 0.001277274452149868
424 0.0012400309788063169
425 0.001200544647872448
426 0.0011667788494378328
427 0.0011303613428026438
428 0.001095717423595488
429 0.0010622999398037791
430 0.001031265826895833
431 0.0010012057609856129
432 0.0009732930338941514
433 0.00094562181038782
434 0.0009168715914711356
435 0.0008909876341931522
436 0.0008663850603625178
437 0.000841077824588865
438 0.0008184680482372642
439 0.0007957076304592192
440 0.0007734476821497083
441 0.000752027437556535
442 0.0007310989312827587
443 0.0007130213780328631
444 0.0006943501066416502
445 0.0006744704442098737
446 0.0006563618080690503
447 0.000638870173133

### Autograd

#### PyTorch: Tensors和autograd

In [7]:
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the a scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 25926948.0
1 20848332.0
2 20564796.0
3 22190614.0
4 23376976.0
5 22297840.0
6 18272966.0
7 12830817.0
8 7923541.0
9 4582182.0
10 2641209.75
11 1606126.125
12 1060568.25
13 763505.5
14 590064.75
15 479491.625
16 402520.375
17 344813.34375
18 299194.1875
19 261840.859375
20 230569.453125
21 203966.03125
22 181126.75
23 161374.28125
24 144202.875
25 129211.359375
26 116052.6328125
27 104476.234375
28 94253.390625
29 85209.9375
30 77181.71875
31 70031.984375
32 63649.13671875
33 57936.921875
34 52822.38671875
35 48225.8359375
36 44089.44140625
37 40363.01171875
38 36997.17578125
39 33951.875
40 31189.02734375
41 28680.1640625
42 26399.767578125
43 24324.29296875
44 22432.46875
45 20705.716796875
46 19127.4140625
47 17683.6796875
48 16361.427734375
49 15149.3125
50 14036.28515625
51 13014.1064453125
52 12074.263671875
53 11208.826171875
54 10411.4814453125
55 9676.341796875
56 8998.1015625
57 8371.7705078125
58 7792.9130859375
59 7257.728515625
60 6762.2802734375
61 6303.6962890625
62 587

#### PyTorch: 定义新的autograd函数

可以自定义实现MyReLU,继承自torch.autograd.Function，实现forward和backward方法，如下：

In [8]:
class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 45001164.0
1 43679672.0
2 41144116.0
3 31395926.0
4 18558364.0
5 9181109.0
6 4580579.5
7 2651858.5
8 1816941.5
9 1391145.75
10 1128448.875
11 941960.75
12 798746.9375
13 684180.75
14 590500.0625
15 512758.3125
16 447638.65625
17 392662.84375
18 345857.375
19 305752.8125
20 271250.34375
21 241377.375
22 215411.140625
23 192702.0
24 172814.546875
25 155314.90625
26 139870.890625
27 126210.5
28 114093.2265625
29 103321.296875
30 93711.109375
31 85120.0234375
32 77425.296875
33 70520.5
34 64314.87109375
35 58724.1796875
36 53680.19921875
37 49123.5234375
38 44997.92578125
39 41260.984375
40 37871.953125
41 34793.6640625
42 31994.763671875
43 29446.544921875
44 27124.7890625
45 25005.787109375
46 23069.86328125
47 21298.49609375
48 19676.41015625
49 18190.640625
50 16828.7109375
51 15578.28515625
52 14429.5078125
53 13373.84375
54 12402.4482421875
55 11507.79296875
56 10683.4921875
57 9923.5625
58 9223.044921875
59 8577.1474609375
60 7980.357421875
61 7428.740234375
62 6918.447265625
63 6

392 0.0012766688596457243
393 0.0012360293185338378
394 0.0011969381012022495
395 0.0011607768246904016
396 0.0011253359261900187
397 0.0010900308843702078
398 0.0010561617091298103
399 0.0010238498216494918
400 0.0009933901019394398
401 0.0009628396364860237
402 0.0009339578682556748
403 0.0009061859454959631
404 0.0008779699564911425
405 0.0008517073001712561
406 0.000826644420158118
407 0.0008021410903893411
408 0.0007782618049532175
409 0.0007564020925201476
410 0.000735228939447552
411 0.0007132455939427018
412 0.0006930502131581306
413 0.0006733828922733665
414 0.000654185947496444
415 0.0006354742217808962
416 0.0006175328744575381
417 0.0006004718015901744
418 0.0005844010156579316
419 0.0005676964647136629
420 0.0005524905282072723
421 0.0005382957751862705
422 0.0005232636467553675
423 0.0005103001021780074
424 0.0004965214757248759
425 0.00048314110608771443
426 0.00047040751087479293
427 0.00045820578816346824
428 0.0004456683818716556
429 0.0004339969018474221
430 0.000423

#### TensorFlow: 静态图

TensorFlow是静态图机制，而PyTorch则是动态图机制。换句话说，在TensorFlow中，我们只定义计算图一次，然后在这个图上一遍又一遍的执行，也许喂给它不同的输入数据；但在PyTorch中，我们每次前向传播都会定义一个新的计算图。

In [9]:
import tensorflow as tf
import numpy as np

# First we set up the computational graph:

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)

# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
    # Run the graph once to initialize the Variables w1 and w2.
    sess.run(tf.global_variables_initializer())

    # Create numpy arrays holding the actual data for the inputs x and targets
    # y
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    for _ in range(500):
        # Execute the graph many times. Each time it executes we want to bind
        # x_value to x and y_value to y, specified with the feed_dict argument.
        # Each time we execute the graph we want to compute the values for loss,
        # new_w1, and new_w2; the values of these Tensors are returned as numpy
        # arrays.
        loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                    feed_dict={x: x_value, y: y_value})
        print(loss_value)

22893544.0
16164776.0
12798790.0
10716744.0
9152317.0
7747419.5
6436664.5
5269739.0
4357545.0
3530537.5
2828006.5
2251097.5
1782578.5
1407333.8
1113942.8
885945.06
710336.8
574883.75
470527.56
389668.75
326321.28
276243.75
236266.56
203962.12
177604.88
155844.23
137682.47
122370.484
109357.39
98181.28
88509.75
80071.97
72667.875
66128.15
60326.316
55153.336
50519.35
46355.23
42604.723
39212.824
36143.46
33358.215
30825.684
28514.885
26403.906
24472.105
22702.38
21077.566
19584.258
18209.824
16943.828
15776.969
14702.822
13712.971
12798.141
11951.812
11167.579
10440.693
9766.479
9140.395
8558.902
8018.0757
7515.177
7046.9517
6611.1807
6204.8496
5825.76
5472.078
5142.0464
4833.842
4545.838
4276.539
4024.621
3789.0442
3568.4258
3361.77
3168.267
2986.752
2816.4607
2656.7446
2506.8828
2366.145
2233.9365
2109.752
1992.9669
1883.103
1779.7732
1682.571
1591.065
1504.8848
1423.7108
1347.3025
1275.2222
1207.2258
1143.1221
1082.6945
1025.6704
971.82056
920.97986
872.9823
827.6333
784.78705
744.28

### nn module

#### PyTorch: nn

计算图和自动求导是定义复杂操作和求得导数的非常强大的范式，但是对于大型神经网络来说，原始的自动求导显得有些low-level，因此，在构建神经网络时，我们更常见的是将计算安排到网络层，它们中有可学习的参数，在学习过程中我们对它们进行优化。在PyTorch中，包nn提供了构建网络层模块的功能，它还提供了一些有用的损失函数的集合，我们可以使用它们来训练网络。

In [10]:
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 667.6985473632812
1 620.8004760742188
2 579.9766845703125
3 543.4854125976562
4 510.8024597167969
5 481.4676818847656
6 454.6184387207031
7 429.6540222167969
8 406.3977355957031
9 384.6302795410156
10 364.2885437011719
11 345.0937194824219
12 326.9579772949219
13 309.6826171875
14 293.2193908691406
15 277.5777282714844
16 262.6214904785156
17 248.35557556152344
18 234.72096252441406
19 221.71412658691406
20 209.28143310546875
21 197.3482208251953
22 185.95448303222656
23 175.1243438720703
24 164.80184936523438
25 154.97731018066406
26 145.59194946289062
27 136.71107482910156
28 128.29815673828125
29 120.3134765625
30 112.7813720703125
31 105.6783447265625
32 98.97616577148438
33 92.66381072998047
34 86.73333740234375
35 81.17061614990234
36 75.94145202636719
37 71.03512573242188
38 66.43218231201172
39 62.12113952636719
40 58.085060119628906
41 54.310302734375
42 50.777976989746094
43 47.47459030151367
44 44.38212585449219
45 41.49032211303711
46 38.787872314453125
47 36.265811920166

425 1.885332130768802e-05
426 1.8339169400860555e-05
427 1.7837684936239384e-05
428 1.735145451675635e-05
429 1.6878388123586774e-05
430 1.641749076952692e-05
431 1.5969842934282497e-05
432 1.5535191778326407e-05
433 1.5112698747543618e-05
434 1.4700642168463673e-05
435 1.4300992916105315e-05
436 1.3912416761741042e-05
437 1.3533836863643955e-05
438 1.3164762094675098e-05
439 1.2806764061679132e-05
440 1.2459092431527097e-05
441 1.2120801329729147e-05
442 1.1791194083343726e-05
443 1.1471560355857946e-05
444 1.116071052820189e-05
445 1.0858417226700112e-05
446 1.0562380339251831e-05
447 1.0277381079504266e-05
448 9.99864460027311e-06
449 9.727556971483864e-06
450 9.464391951041762e-06
451 9.20798811421264e-06
452 8.95881385076791e-06
453 8.71674365043873e-06
454 8.480194992444012e-06
455 8.250912287621759e-06
456 8.028404408833012e-06
457 7.810933311702684e-06
458 7.599854143336415e-06
459 7.395019110845169e-06
460 7.194992122094845e-06
461 7.000927780609345e-06
462 6.812192623328883e-

#### PyTorch: optim

在PyTorch中，我们使用optim包来抽象优化算法的思想，并提供常见优化算法的实现，诸如AdaGrad,RMSProp,Adam等，使用此工具可以避免以上代码中必须手动使用torch.no_grad()或者.data来避免追踪autograd的历史。具体如下：

In [11]:
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()


0 634.4086303710938
1 618.0187377929688
2 602.0963745117188
3 586.6731567382812
4 571.6885375976562
5 557.1368408203125
6 543.0049438476562
7 529.3257446289062
8 515.9849853515625
9 503.01068115234375
10 490.3847351074219
11 478.1793212890625
12 466.3462829589844
13 454.8947448730469
14 443.7497863769531
15 432.9693908691406
16 422.4825134277344
17 412.3030700683594
18 402.3389892578125
19 392.6382141113281
20 383.17315673828125
21 373.9485168457031
22 364.9387512207031
23 356.1969909667969
24 347.675048828125
25 339.3837585449219
26 331.27325439453125
27 323.373779296875
28 315.6773986816406
29 308.14520263671875
30 300.8003234863281
31 293.6183166503906
32 286.63336181640625
33 279.7945251464844
34 273.0917663574219
35 266.51708984375
36 260.061279296875
37 253.75604248046875
38 247.59205627441406
39 241.57864379882812
40 235.68722534179688
41 229.96202087402344
42 224.36709594726562
43 218.890380859375
44 213.52285766601562
45 208.26100158691406
46 203.10235595703125
47 198.04032897

377 2.5283210561610758e-05
378 2.3869428332545795e-05
379 2.2534019080922008e-05
380 2.1271118384902366e-05
381 2.007919283641968e-05
382 1.8951657693833113e-05
383 1.7889311493490823e-05
384 1.6883212083484977e-05
385 1.5933881513774395e-05
386 1.5035258002171759e-05
387 1.41892560350243e-05
388 1.3388556908466853e-05
389 1.2632922334887553e-05
390 1.1917649317183532e-05
391 1.1243043445574585e-05
392 1.0607817785057705e-05
393 1.0006028787756804e-05
394 9.438242159376387e-06
395 8.901691217033658e-06
396 8.394894393859431e-06
397 7.917793482192792e-06
398 7.466381248377729e-06
399 7.040027412585914e-06
400 6.637949354626471e-06
401 6.258002031245269e-06
402 5.899550160393119e-06
403 5.5615910241613165e-06
404 5.242736733634956e-06
405 4.941441602568375e-06
406 4.6571180973842274e-06
407 4.388241904962342e-06
408 4.135792551096529e-06
409 3.896581802109722e-06
410 3.6711530810862314e-06
411 3.4583631531859282e-06
412 3.257361640862655e-06
413 3.0685441743116826e-06
414 2.8899946755700

#### PyTorch: Custom Modules

有时候，你想要定义比已经存在的一系列模块更复杂的模型，因此你需要定义自己的模型（继承nn.Module，并且实现forward方法，它接收输入张量，使用其他模块或者对张量自动求导来产生输出）。以下我们自定义了一个两层神经网络作为自定义模块。

In [12]:
import torch


class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 683.7316284179688
1 633.8895263671875
2 590.671142578125
3 552.86328125
4 519.4605102539062
5 489.2613525390625
6 461.6614074707031
7 436.3767395019531
8 412.9505310058594
9 391.03863525390625
10 370.3308410644531
11 350.8232116699219
12 332.4227294921875
13 314.9851989746094
14 298.4110107421875
15 282.64056396484375
16 267.6932067871094
17 253.47125244140625
18 239.99681091308594
19 227.12557983398438
20 214.87789916992188
21 203.1916046142578
22 192.02249145507812
23 181.3760528564453
24 171.23214721679688
25 161.58773803710938
26 152.43545532226562
27 143.73287963867188
28 135.46656799316406
29 127.61888122558594
30 120.17344665527344
31 113.13884735107422
32 106.49720764160156
33 100.19879913330078
34 94.24391174316406
35 88.64103698730469
36 83.36966705322266
37 78.3985824584961
38 73.73113250732422
39 69.33782958984375
40 65.2093276977539
41 61.3371467590332
42 57.70439910888672
43 54.29240798950195
44 51.095970153808594
45 48.095218658447266
46 45.28593444824219
47 42.6446418

480 4.145267666899599e-06
481 4.027604063594481e-06
482 3.9136348277679645e-06
483 3.8029686493246118e-06
484 3.695871782838367e-06
485 3.5915459193347488e-06
486 3.4907484405266587e-06
487 3.39188045472838e-06
488 3.29628483086708e-06
489 3.2037992241384927e-06
490 3.1137913083512103e-06
491 3.026452986887307e-06
492 2.94137771561509e-06
493 2.8590040983544895e-06
494 2.778928092084243e-06
495 2.701332277865731e-06
496 2.6259804144501686e-06
497 2.5527494926791405e-06
498 2.481564706613426e-06
499 2.4125708932842826e-06


#### PyTorch: Control Flow + Weight Sharing

In [13]:
import random
import torch


class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 664.0790405273438
1 620.3710327148438
2 622.1655883789062
3 616.0892333984375
4 615.776123046875
5 610.5848388671875
6 613.6027221679688
7 444.4853515625
8 611.56103515625
9 380.3213195800781
10 339.25830078125
11 580.3104858398438
12 252.95806884765625
13 210.50469970703125
14 167.01426696777344
15 125.942138671875
16 595.7234497070312
17 534.3563232421875
18 585.9985961914062
19 58.68450164794922
20 470.385498046875
21 68.06253051757812
22 68.5880355834961
23 388.017333984375
24 59.717350006103516
25 513.6089477539062
26 305.22100830078125
27 475.09100341796875
28 445.548828125
29 407.6017150878906
30 186.94747924804688
31 102.25706481933594
32 149.20132446289062
33 93.86891174316406
34 246.63917541503906
35 342.28509521484375
36 61.832557678222656
37 188.4534149169922
38 109.4961929321289
39 97.34200286865234
40 128.5032196044922
41 185.66307067871094
42 77.59613037109375
43 97.02117919921875
44 72.08903503417969
45 183.8531036376953
46 74.17464447021484
47 227.40806579589844
48 1

385 17.334943771362305
386 0.936115562915802
387 11.452579498291016
388 18.721126556396484
389 17.995149612426758
390 1.0159624814987183
391 3.8388514518737793
392 28.04918098449707
393 8.060794830322266
394 14.634716987609863
395 1.091517686843872
396 5.343496799468994
397 9.471994400024414
398 9.796541213989258
399 8.877714157104492
400 3.6639461517333984
401 4.467319011688232
402 0.9149323105812073
403 3.8236098289489746
404 3.319032669067383
405 2.742666006088257
406 1.1291255950927734
407 0.7717222571372986
408 8.300225257873535
409 0.41998419165611267
410 0.3257797062397003
411 12.692811965942383
412 0.7627459764480591
413 3.2951242923736572
414 3.7787179946899414
415 8.818142890930176
416 2.7031748294830322
417 1.1190990209579468
418 0.8271508812904358
419 4.238048076629639
420 5.143354892730713
421 11.789674758911133
422 11.835980415344238
423 4.553065776824951
424 8.45146656036377
425 37.04496383666992
426 11.845982551574707
427 3.7197623252868652
428 0.9960892796516418
429 11

### 参考

[PyTorch Tutorials](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#)