# 学习PyTorch例子

PyTorch提供了两个主要功能：
- 一个n维tensor，类似于numpy但可以在gpu上运行
- 构建和训练神经网络的自动微分

我们将使用一个全连接的ReLU网络作为我们的运行示例，该网络将有一个单独的隐藏层，并将通过最小化网络输出和真实输出之间的欧氏距离来对随机数据进行训练。


## Tensors
### 热身:numpy
在介绍PyTorch之前，我们将首先使用numpy实现网络。

Numpy提供了一个n维数组对象，以及许多用于操纵这些数组的函数。Numpy是一个用于科学计算的通用框架，它不知道任何关于计算图、深度学习或梯度的知识。然而，我们可以很容易地使用numpy来实现一个两层的网络，给它输入随机的数据，通过使用numpy操作来手动实现前向和反向传播：

In [7]:
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 39071195.20903869
1 40251569.19603732
2 43185249.62662539
3 39239287.98193921
4 27537593.895930406
5 14680130.57115881
6 6871547.284445027
7 3416465.930285142
8 2037249.8606096457
9 1434524.0968556034
10 1116188.3197998707
11 912720.7332097872
12 764814.8798884479
13 649655.0784746656
14 556913.7918845976
15 480655.77630065655
16 417186.1890909355
17 363852.97666853294
18 318755.2989969051
19 280388.9981744852
20 247554.54628007454
21 219282.55485051087
22 194826.61801176472
23 173623.74695786872
24 155191.24895168672
25 139094.67225409992
26 124953.1778311485
27 112481.6260442081
28 101452.60395917633
29 91674.0741270049
30 82976.04874194015
31 75222.60773628247
32 68284.0468860255
33 62077.43763968386
34 56517.385105248904
35 51535.58830406274
36 47059.26066006157
37 43022.22304204956
38 39374.069407806644
39 36073.260689457806
40 33081.39822644148
41 30369.513705181464
42 27904.518402098875
43 25660.562388205984
44 23616.183097184155
45 21751.76352824879
46 20053.778108382925
47 1

435 2.210841539340442e-05
436 2.1077183192977594e-05
437 2.0094257533969005e-05
438 1.915718218440378e-05
439 1.826382503140739e-05
440 1.7412278976410022e-05
441 1.6600711842749758e-05
442 1.5826882313497722e-05
443 1.5089233263994879e-05
444 1.4386334124297187e-05
445 1.3716016276148366e-05
446 1.3077042638387338e-05
447 1.2467958900626311e-05
448 1.1887410816114386e-05
449 1.133399991195646e-05
450 1.0806288755039822e-05
451 1.030319336200485e-05
452 9.823614514658395e-06
453 9.366380777542751e-06
454 8.930557754809348e-06
455 8.515094723836108e-06
456 8.11890163676978e-06
457 7.741181080892796e-06
458 7.381151922711635e-06
459 7.037898698236437e-06
460 6.710643478158052e-06
461 6.398634251320804e-06
462 6.1011561668808725e-06
463 5.8175434185345315e-06
464 5.547212253155069e-06
465 5.289487671967309e-06
466 5.043713548571806e-06
467 4.8093715270937824e-06
468 4.585917262110242e-06
469 4.372926722272162e-06
470 4.169815258385056e-06
471 3.976185429362954e-06
472 3.7915786868728118e-

### PyTorch: Tensors
Numpy是一个很好的框架，但是它不能利用gpu来加速它的数值计算。对于现代的深层神经网络来说，gpu常常提供50倍或更大的速度，所以不幸的是，numpy对于现代深度学习来说是不够的。

这里我们介绍了最基本的PyTorch概念：Tensor。PyTorch张量在概念上与numpy数组是相同的：张量是一个n维数组，PyTorch提供了许多在这些张量上操作的函数。与numpy数组一样，PyTorch张量对深度学习、计算图或梯度一无所知，它们是科学计算的通用工具。

然而，与numpy不同的是，PyTorch张量可以利用gpu加速它们的数值计算。要在GPU上运行一个PyTorch张量，你只需要把它转换成新的数据类型。

在这里，我们使用PyTorch张量来将随机的数据输入给一个两层的网络。就像上面的numpy示例一样，我们需要手动实现前向和反向传播。

In [8]:
import torch


dtype = torch.float
device = torch.device("cpu")
#device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 32292580.0
1 31060580.0
2 31537196.0
3 28888644.0
4 21786662.0
5 13504984.0
6 7299084.0
7 3882846.75
8 2235437.5
9 1458964.25
10 1064120.5
11 836790.875
12 687146.5
13 578618.75
14 494601.46875
15 426949.5
16 371160.03125
17 324488.09375
18 285126.1875
19 251629.984375
20 222900.453125
21 198133.28125
22 176742.515625
23 158142.328125
24 141875.75
25 127599.4609375
26 115034.0859375
27 103934.21875
28 94102.09375
29 85368.0859375
30 77596.03125
31 70658.84375
32 64452.06640625
33 58882.47265625
34 53875.33203125
35 49365.73046875
36 45292.08984375
37 41604.84375
38 38262.84375
39 35229.05859375
40 32470.328125
41 29957.1171875
42 27666.833984375
43 25576.240234375
44 23663.375
45 21911.580078125
46 20304.9296875
47 18830.55859375
48 17479.048828125
49 16238.3583984375
50 15096.236328125
51 14042.75
52 13070.859375
53 12173.3095703125
54 11343.1572265625
55 10575.130859375
56 9863.81640625
57 9204.73046875
58 8593.5654296875
59 8026.228515625
60 7499.63720703125
61 7010.4736328125
62 

473 7.511164585594088e-05
474 7.391851977445185e-05
475 7.284889579750597e-05
476 7.206273585325107e-05
477 7.081220246618614e-05
478 6.96156348567456e-05
479 6.892380042700097e-05
480 6.798841786803678e-05
481 6.688141729682684e-05
482 6.582759669981897e-05
483 6.486199708888307e-05
484 6.382920400938019e-05
485 6.31639632047154e-05
486 6.219403439899907e-05
487 6.151812704047188e-05
488 6.087141446187161e-05
489 5.9911930293310434e-05
490 5.902994598727673e-05
491 5.8306599385105073e-05
492 5.769906420027837e-05
493 5.692374179488979e-05
494 5.606894410448149e-05
495 5.540060737985186e-05
496 5.471308031701483e-05
497 5.4316700698109344e-05
498 5.374009924707934e-05
499 5.3167757869232446e-05


## Autograd
### PyTorch: Tensors and autograd
在上面的例子中，我们必须手动地实现我们的神经网络的前向和反向传播。值得庆幸的是，我们可以使用自动微分来自动计算神经网络的后传。PyTorch的autograd包提供了这种功能。当使用autograd时，网络的前向传播将定义一个计算图，图中的节点是张量，而边是由输入张量产生输出张量的函数。过这个图进行反向传播，可以方便地计算梯度。

每个张量代表一个计算图中的一个节点。若`x`是一个张量，且`x.requires_grad=True`，那么`x.grad`是另一个张量，它保存了`x`关于某些标量值的梯度。

使用PyTorch张量和autograd来实现两层网络。现在，我们不再需要手动地通过网络实现反向传播。

In [9]:
import torch

dtype = torch.float
device = torch.device("cpu")
#device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2) #这里不再保持中间值了，因为不需要手动实现反向传播

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the a scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 33413614.0
1 32287334.0
2 36719192.0
3 39732540.0
4 35616844.0
5 24275524.0
6 12935127.0
7 6057996.5
8 3010592.75
9 1772891.625
10 1238876.0
11 964418.375
12 793190.5
13 669664.0625
14 573041.125
15 494412.5625
16 429036.53125
17 374046.375
18 327423.03125
19 287647.28125
20 253515.546875
21 224080.84375
22 198591.9375
23 176420.8125
24 157091.125
25 140195.953125
26 125373.3671875
27 112334.6796875
28 100831.875
29 90662.203125
30 81646.2265625
31 73650.359375
32 66527.421875
33 60181.66796875
34 54516.41796875
35 49449.19140625
36 44903.33203125
37 40818.890625
38 37144.87890625
39 33834.75390625
40 30847.51171875
41 28150.91015625
42 25710.9609375
43 23501.576171875
44 21498.767578125
45 19681.66015625
46 18031.66015625
47 16531.138671875
48 15166.2353515625
49 13922.666015625
50 12789.546875
51 11757.498046875
52 10817.34765625
53 9959.4365234375
54 9174.685546875
55 8456.3818359375
56 7797.98876953125
57 7194.6318359375
58 6641.35400390625
59 6133.5556640625
60 5667.203125
61 52

402 0.00029957416700199246
403 0.0002917884849011898
404 0.0002846306888386607
405 0.00027779332594946027
406 0.0002703032805584371
407 0.00026455195620656013
408 0.0002578228886704892
409 0.00025123567320406437
410 0.0002455591456964612
411 0.00023986159067135304
412 0.00023428384156432003
413 0.0002284967340528965
414 0.00022339455608744174
415 0.00021846243180334568
416 0.00021328286675270647
417 0.0002081180427921936
418 0.00020339907496236265
419 0.00019902669009752572
420 0.00019415834685787559
421 0.00018984048801939934
422 0.00018597797316033393
423 0.00018197903409600258
424 0.00017796950123738497
425 0.0001744989276630804
426 0.00016993920144159347
427 0.000166288111358881
428 0.00016341089212801307
429 0.00015941566380206496
430 0.00015660330245736986
431 0.00015322504623327404
432 0.00015030663053039461
433 0.00014719807950314134
434 0.00014356800238601863
435 0.00014084053691476583
436 0.00013790577941108495
437 0.00013470312114804983
438 0.00013221146946307272
439 0.00012

### PyTorch: 定义新的autograd函数
在覆盖下，每个原始的autograd操作符实际上是两个作用于张量的函数。`forward`函数由输入张量计算输出张量。`backward`函数接收输出张量关于某个标量值的梯度，并计算出与该标量值相关的输入张量的梯度。

在PyTorch中，我们可以通过定义一个`torch.autograd.Function`子类来轻松定义自己的autograd操作符，实现`forward`和`backward`函数。然后我们可以构造一个实例来使用新的autograd操作符，像函数一样调用它，传递包含输入数据的张量。

在这个例子中，我们定义了自己的autograd函数来执行ReLU非线性，并使用它来实现我们的两层网络：

In [10]:
import torch


class MyReLU(torch.autograd.Function):
    """
    我们可以通过子类torch.autograd.Function来实现我们自定义autograd函数。
    并实现对张量进行前向和反向传播的操作。
    """

    @staticmethod
    def forward(ctx, input):
        """
        在前向传播中，我们接收一个包含输入的张量，并返回一个包含输出的张量。ctx是一个上下文对象，它可以用来为后向计算存储信息。
        您可以使用ctx.save_for_backward方法缓存任意对象，以便在反向传播中使用。
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        在反向传播中我们得到一个张量，它包含了损失关于输出的梯度，我们需要计算损失关于输入的梯度。
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 39895596.0
1 39561284.0
2 38922060.0
3 32049182.0
4 20852250.0
5 11065561.0
6 5511743.5
7 2971464.75
8 1865241.625
9 1337737.375
10 1044394.625
11 854298.0
12 716406.4375
13 609585.125
14 523963.28125
15 453828.125
16 395357.125
17 346117.21875
18 304352.59375
19 268638.40625
20 237939.484375
21 211414.390625
22 188461.6875
23 168451.75
24 150939.5
25 135564.203125
26 122028.5546875
27 110060.4296875
28 99441.9609375
29 90003.28125
30 81583.2890625
31 74057.2265625
32 67316.0078125
33 61268.38671875
34 55833.1796875
35 50939.70703125
36 46530.44140625
37 42549.6328125
38 38945.96484375
39 35683.8515625
40 32727.09375
41 30040.345703125
42 27597.6953125
43 25374.72265625
44 23349.26171875
45 21500.994140625
46 19812.455078125
47 18269.322265625
48 16857.533203125
49 15565.3916015625
50 14380.8876953125
51 13294.462890625
52 12297.2978515625
53 11382.220703125
54 10541.37109375
55 9767.6630859375
56 9055.2373046875
57 8398.9375
58 7793.80126953125
59 7235.8037109375
60 6721.05078125
61

386 0.0003526775981299579
387 0.0003435269754845649
388 0.00033382821129634976
389 0.0003245858824811876
390 0.00031563217635266483
391 0.00030693504959344864
392 0.0002988700580317527
393 0.00028939684852957726
394 0.00028223206754773855
395 0.0002749452833086252
396 0.00026821403298527
397 0.00026096237706951797
398 0.0002545782190281898
399 0.0002482221752870828
400 0.00024159273016266525
401 0.00023519121168646961
402 0.00022947545221541077
403 0.00022380670998245478
404 0.00021836427913513035
405 0.0002128253981936723
406 0.00020770510309375823
407 0.0002032677730312571
408 0.00019823148613795638
409 0.00019358746067155153
410 0.0001890761632239446
411 0.00018460107094142586
412 0.00018018463742919266
413 0.0001760832965373993
414 0.00017216161359101534
415 0.00016830886306706816
416 0.00016420661995653063
417 0.00016042626521084458
418 0.00015707399870734662
419 0.00015343369159381837
420 0.00015008813352324069
421 0.00014692342665512115
422 0.00014394166646525264
423 0.000140961

### TensorFlow: 静态图
PyTorch autograd看起来很像TensorFlow：在这两个框架中，我们定义了一个计算图，并使用自动微分来计算梯度。**两者之间最大的区别是，TensorFlow的计算图是静态的，而PyTorch使用动态计算图**。

在TensorFlow中，我们对计算图进行一次定义，然后反复执行相同的图，可能会将不同的数据输入到图中。在PyTorch中，每个前向的过程都定义一个新的计算图。

静态图很好，因为你可以预先优化图，例如，由于效率的原因一个框架也许会决定将一些图操作融合，或者提出一个策略在许多gpu或许多机器上分布图。如果你反复地使用相同的图，那么这个潜在的昂贵的预先优化就可以被分摊，因为相同的图被一次又一次地重新运行。

静态和动态图不同的一个方面是控制流。对于某些模型，我们可能希望对每个数据点执行不同的计算，例如，对于每个数据点，可以为不同的时间步骤开启一个循环神经网络。这个可以在一个循环里展开。有了静态图，循环结构需要成为图的一部分；由于这个原因，TensorFlow提供了诸如`tf.scan`之类的操作符,将循环嵌入到图中。有了动态图，情况就更简单了：由于我们为每个样本都动态地构建图，所以我们可以为每个输入使用不同的普通命令流控制来执行计算。

这里我们使用TensorFlow来拟合一个简单的两层网络：

In [11]:
import tensorflow as tf
import numpy as np

# First we set up the computational graph:

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# Create Variables for the weights and initialize them with random data.
# 在图的执行过程中，一个TensorFlow变量会保持它的值。
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# 注意，这段代码实际上并不执行任何数字操作;它只是建立了我们稍后会执行的计算图
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)

# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# 使用梯度下降法更新权重。为了实际更新权重，我们需要在执行图时对neww1和neww2进行评估。
# 注意，在TensorFlow中，更新权重值的行为是计算图的一部分 
# 在PyTorch中，这发生在计算图之外。
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
    # Run the graph once to initialize the Variables w1 and w2.
    sess.run(tf.global_variables_initializer())

    # 创建numpy数组，存放输入x和目标的实际数据
    # y
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    for _ in range(500):
        # 多次执行这个图。每次执行时我们想要将x_value绑定到x，y_value绑定到y, 用feed_dict参数指定。
        # Each time we execute the graph we want to compute the values for loss,
        # new_w1, and new_w2; the values of these Tensors are returned as numpy
        # arrays.
        loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                    feed_dict={x: x_value, y: y_value})
        print(loss_value)

ModuleNotFoundError: No module named 'tensorflow'

## 神经网络nn模块
### PyTorch: nn
对于大型神经网络来说，原始的autograd可能有点太低级了。

在构建神经网络时我们经常考虑将计算安排到层中，其中一些层具有可学习的参数，在学习过程中会得到优化。

在TensorFlow中，像Keras、TensorFlow-Slim TFLearn包对原始的计算图提供了构建神经网络的高级抽象。

在PyTorch中，`nn`包也有同样的用途。`nn`包定义了一组模块，它们大致相当于神经网络层。一个模块接收输入张量并计算输出张量，但也可能包含内部状态，例如包含可学习参数的tensor。`nn`包还定义了一组有用的损失函数，这些函数在训练神经网络时通常使用。

在这个例子中，我们使用nn包来实现我们的两层网络:

In [12]:
import torch
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# 使用nn包将我们的模型定义为一组层。 nn.Sequential是一个包含其他模块的模块，并按顺序应用它们来产生输出
# 每个线性模块使用一个线性函数计算输入输出，并为其权重和偏置保存内部张量。
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# nn包还包含了流行的损失函数的定义 
# in this case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(size_average=False)

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access and gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 663.7093505859375
1 616.6390380859375
2 575.8261108398438
3 540.2291870117188
4 508.6159973144531
5 480.0029296875
6 453.5695495605469
7 429.1519775390625
8 406.4804992675781
9 385.2314758300781
10 365.10235595703125
11 346.14324951171875
12 328.2267761230469
13 311.2895812988281
14 295.1182556152344
15 279.758056640625
16 265.1029052734375
17 251.15463256835938
18 237.8461456298828
19 225.15834045410156
20 213.01055908203125
21 201.46218872070312
22 190.4517059326172
23 179.94390869140625
24 169.94737243652344
25 160.44834899902344
26 151.42315673828125
27 142.8560791015625
28 134.72528076171875
29 127.02009582519531
30 119.7266845703125
31 112.8185043334961
32 106.29139709472656
33 100.12326049804688
34 94.30625915527344
35 88.81932067871094
36 83.64681243896484
37 78.77819061279297
38 74.19483184814453
39 69.88298034667969
40 65.80712127685547
41 61.9797477722168
42 58.385005950927734
43 54.99726104736328
44 51.81507873535156
45 48.8189697265625
46 46.0037956237793
47 43.355243682

395 0.00019319410785101354
396 0.00018744324916042387
397 0.00018185966473538429
398 0.0001764492189977318
399 0.00017119462427217513
400 0.0001660970301600173
401 0.00016116323240567
402 0.00015637102478649467
403 0.00015171935956459492
404 0.00014720720355398953
405 0.0001428359973942861
406 0.0001385946525260806
407 0.00013447555829770863
408 0.0001304812467424199
409 0.0001266032923012972
410 0.0001228474429808557
411 0.00011919681128347293
412 0.00011565543536562473
413 0.00011222995817661285
414 0.00010889588884310797
415 0.00010566398850642145
416 0.00010253120126435533
417 9.949234663508832e-05
418 9.654103632783517e-05
419 9.367888560518622e-05
420 9.090009552892298e-05
421 8.820823859423399e-05
422 8.55917387525551e-05
423 8.305804658448324e-05
424 8.059834362939e-05
425 7.821070175850764e-05
426 7.58942260290496e-05
427 7.364660268649459e-05
428 7.146417192416266e-05
429 6.934795237611979e-05
430 6.729710730724037e-05
431 6.530756218126044e-05
432 6.337389640975744e-05
433 6

### PyTorch: optim
到目前为止，我们已经通过手动改变持有可学习参数的张量来更新模型的权重(利用`torch.no_grad()`或者`.data`避免在autograd中跟踪历史)。对于像随机梯度下降这样的简单优化算法来说，这并不是一个巨大的负担，但在实践中，我们经常使用像AdaGrad，RMSProp, Adam等这样的更复杂的优化器来训练神经网络。

PyTorch中的`optim`包抽象了优化算法的思想，并提供了常用的优化算法的实现。

在这个例子中，我们将使用`nn`包来定义我们的模型，但是我们将使用`optim`软件包提供的`Adam`算法来优化模型：

In [13]:
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(size_average=False)

# 使用optim包来定义一个优化器，它将为我们更新模型的权重
# Adam构造函数的第一个参数告诉优化器它应该更新哪个张量。
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # 在反向传播之前，使用优化器对象把将要更新的变量的所有梯度都归零(这是模型的可学习的权重). 
    # 这是因为默认情况下，每一次 .backward()被调用，梯度是在缓冲区中是累加的( 不是重写) 。详细的内容查看torch.autograd.backward文档
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

0 671.9166259765625
1 654.60009765625
2 637.75439453125
3 621.3634643554688
4 605.4441528320312
5 589.977294921875
6 574.9342651367188
7 560.2990112304688
8 546.0906372070312
9 532.3499145507812
10 518.9791259765625
11 506.02301025390625
12 493.48394775390625
13 481.25823974609375
14 469.35888671875
15 457.7319641113281
16 446.43182373046875
17 435.437744140625
18 424.71026611328125
19 414.2809143066406
20 404.12603759765625
21 394.1982421875
22 384.52618408203125
23 375.0975036621094
24 365.90460205078125
25 357.01080322265625
26 348.3647766113281
27 339.9387512207031
28 331.73065185546875
29 323.6925964355469
30 315.8161315917969
31 308.1129150390625
32 300.631103515625
33 293.3351745605469
34 286.2066345214844
35 279.20635986328125
36 272.3838806152344
37 265.7102355957031
38 259.1704406738281
39 252.76783752441406
40 246.49026489257812
41 240.37808227539062
42 234.39483642578125
43 228.5413818359375
44 222.79962158203125
45 217.15980529785156
46 211.62904357910156
47 206.2116851806

357 7.397195440717041e-05
358 6.927813228685409e-05
359 6.487496284535155e-05
360 6.0745940572815016e-05
361 5.686932854587212e-05
362 5.3234656661516055e-05
363 4.982754762750119e-05
364 4.663382424041629e-05
365 4.3634117901092395e-05
366 4.0825383621267974e-05
367 3.8191734347492456e-05
368 3.572478090063669e-05
369 3.341143383295275e-05
370 3.124417344224639e-05
371 2.921352097473573e-05
372 2.7310734367347322e-05
373 2.55310569627909e-05
374 2.3862503439886495e-05
375 2.2299709598883055e-05
376 2.0835821487708017e-05
377 1.946713382494636e-05
378 1.8184349755756557e-05
379 1.698630876489915e-05
380 1.5862740838201717e-05
381 1.4813365851296112e-05
382 1.383057860948611e-05
383 1.2912205420434475e-05
384 1.2052719284838531e-05
385 1.124723712564446e-05
386 1.0495934475329705e-05
387 9.79468495643232e-06
388 9.137833330896683e-06
389 8.522112693754025e-06
390 7.949604878376704e-06
391 7.412737886625109e-06
392 6.910810043336824e-06
393 6.443691290769493e-06
394 6.0067459344281815e-0

### PyTorch: 自定义神经网络模块
有时你会想要指定比现有模块更复杂的模型，对于这些情况，可以通过子类`nn.Module`定义自己的模块，并且在张量上使用其他模块或其他的autograd操作来定义一个`forward`函数接收输入张量并产生输出张量。

在这个例子中，实现了两层网络作为自定义模块子类：

In [14]:
import torch


class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        在构造函数中，我们实例化两个nn.Linear模块并把它们分配为成员变量。
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# 过实例化上面定义的类来构造我们的模型
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 662.81201171875
1 611.7794189453125
2 568.3722534179688
3 530.99072265625
4 497.9374084472656
5 468.3583679199219
6 441.49468994140625
7 416.9985046386719
8 394.4599609375
9 373.40081787109375
10 353.7264099121094
11 335.3880310058594
12 318.1219177246094
13 301.7886962890625
14 286.3387756347656
15 271.5947570800781
16 257.5471496582031
17 244.22637939453125
18 231.53509521484375
19 219.38796997070312
20 207.81436157226562
21 196.79275512695312
22 186.24588012695312
23 176.19705200195312
24 166.61752319335938
25 157.51490783691406
26 148.8199920654297
27 140.50662231445312
28 132.61431884765625
29 125.11134338378906
30 117.99544525146484
31 111.2489013671875
32 104.8473892211914
33 98.78971099853516
34 93.05940246582031
35 87.62672424316406
36 82.50284576416016
37 77.66847229003906
38 73.11614227294922
39 68.83349609375
40 64.80198669433594
41 61.003108978271484
42 57.42695999145508
43 54.0562744140625
44 50.8875617980957
45 47.91623306274414
46 45.1260871887207
47 42.50768280029297

376 0.0003788210451602936
377 0.0003688350843731314
378 0.00035910660517401993
379 0.0003496432036627084
380 0.0003404446179047227
381 0.0003314804343972355
382 0.00032276453566737473
383 0.00031429401133209467
384 0.0003061079769395292
385 0.0002981401630677283
386 0.00029039097717031837
387 0.0002828491269610822
388 0.0002755024179350585
389 0.0002683513448573649
390 0.000261396897258237
391 0.0002546270261518657
392 0.0002480343682691455
393 0.00024161914188880473
394 0.00023537616652902216
395 0.00022929877741262317
396 0.00022337758855428547
397 0.0002176160633098334
398 0.00021201041818130761
399 0.0002065538428723812
400 0.00020123583090025932
401 0.00019606013665907085
402 0.00019101949874311686
403 0.00018611511040944606
404 0.00018133936100639403
405 0.00017668567306827754
406 0.0001721588778309524
407 0.00016774525283835828
408 0.00016345623589586467
409 0.0001592708722455427
410 0.00015520087617915124
411 0.0001512452436145395
412 0.00014737623860128224
413 0.00014361352077

### PyTorch: 控制流 + 权重共享
作为动态图和权重共享的一个例子，我们实现了一个非常奇怪的模型：一个全连接的ReLU网络，在每一次前向传播中选择一个在1到4之间的随机数字，并使用许多隐含层，重复使用相同的权重来计算最内层的隐藏层。

对于这个模型，我们可以使用常规的Python流控制来实现循环，我们可以在最内层的层中实现权重共享，只要在定义前向传播时重复使用相同的模块。

我们可以很容易地将这个模型作为一个模块子类实现：

In [15]:
import random
import torch


class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
       在构造函数中我们构造三个将在前向传播中使用的nn.Linear实例
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        对于模型的前向传播，随机选择0、1、2或3，并多次重复使用中间线性模块来计算隐藏层的表达。

        因为每次前向传播都构建一个动态计算图,在定义模型的前向传播时，可以使用常规的Python控制流操作符，比如循环或条件语句。

        在这里，我们还可以看到在定义计算图时多次重用同一个模块是完全安全的。这是Lua Torch的一个巨大改进，原来的Torch中，每个模块只能使用一次。
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# 通过实例化上面定义的类来构造我们的模型
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 659.8651123046875
1 651.5489501953125
2 658.4490356445312
3 655.7042236328125
4 613.88037109375
5 648.2877197265625
6 652.7698974609375
7 640.0823974609375
8 635.5528564453125
9 561.7874145507812
10 675.159423828125
11 626.7305908203125
12 546.9501953125
13 641.5557861328125
14 618.0827026367188
15 527.4638061523438
16 637.4470825195312
17 635.2836303710938
18 632.4288940429688
19 598.169189453125
20 209.58197021484375
21 478.1297607421875
22 457.5307312011719
23 607.3297119140625
24 598.9736328125
25 136.50843811035156
26 127.12139892578125
27 506.7463073730469
28 489.5917663574219
29 539.0492553710938
30 96.84986114501953
31 87.17533111572266
32 403.6473083496094
33 460.5151672363281
34 59.448455810546875
35 52.188270568847656
36 225.99461364746094
37 206.72984313964844
38 181.8202667236328
39 268.3722839355469
40 243.5713348388672
41 60.00242614746094
42 279.6679382324219
43 62.45329284667969
44 164.39317321777344
45 55.12975311279297
46 83.1775894165039
47 75.76084899902344
48 11

371 9.200246810913086
372 0.3190368711948395
373 0.5158654451370239
374 8.967663764953613
375 6.525460720062256
376 0.7959885597229004
377 1.2791703939437866
378 4.193874835968018
379 4.022400856018066
380 0.9740561842918396
381 0.9067827463150024
382 1.0042133331298828
383 9.341996192932129
384 6.082249164581299
385 0.7155448794364929
386 2.4169487953186035
387 3.117792844772339
388 2.819190740585327
389 1.6765252351760864
390 4.409648895263672
391 6.913157939910889
392 3.5174758434295654
393 5.776248455047607
394 8.6685791015625
395 7.299280166625977
396 1.2427598237991333
397 10.755184173583984
398 6.063807964324951
399 3.4013984203338623
400 19.627605438232422
401 13.113521575927734
402 1.8217564821243286
403 7.831426620483398
404 6.63269567489624
405 3.5836942195892334
406 2.242685556411743
407 1.220637559890747
408 3.590414047241211
409 1.772204041481018
410 2.306589365005493
411 1.3408851623535156
412 1.732959270477295
413 1.1590219736099243
414 1.7966840267181396
415 1.16374528