这个教程通过 PyTorch 自带的例子来介绍一些 pytorch 的基本概念。

PyTorch 的核心主要为两个方面：
1. 像 numpy 一样的多维 tensor。可以通过 GPU 来计算。
2. 可以对神经网络自动求解微分

我们用一个全连接的 ReLu 网络来做例子。这个网络有一个隐藏层，其 loss 是网络输出和真正输出的欧氏距离。我们采用梯度下降法来训练它，使得它的 loss 变小。
## Tensors
热身：numpy
在我们介绍 pytorch 之前，让我们先用 numpy 搭建一个神经网络。

NumPy 提供了一个多位的数组对象和一系列可以对数组对象进行操作的方法。NumPy是个原生的科学计算包，他对神经网络，深度学习，计算图等一无所知。但是我们可以通过 NumPy 很容易地构建一个两层的网络。我们可以手动的实现前向和反向，使得这个网络去拟合我们的随机数据。

In [20]:
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 20906465.065902412
1 14227937.79717914
2 10776258.605970811
3 8715497.928467307
4 7326177.700569801
5 6244811.466690807
6 5338562.848111852
7 4519270.828561199
8 3789650.607038034
9 3131884.1773236627
10 2565368.5417481503
11 2079659.3818183597
12 1678938.2594843348
13 1349330.9844425316
14 1085176.3536188398
15 873268.5190502892
16 706117.4939495663
17 573558.6432866682
18 469091.73450063297
19 386021.37920733157
20 320292.4210825541
21 267612.51662272
22 225264.57284129082
23 190910.60456121957
24 162955.28242505953
25 139989.60327322074
26 121037.23006386695
27 105257.23907174717
28 92047.43109921299
29 80917.36386264175
30 71469.57758106489
31 63399.14696900124
32 56463.52130649614
33 50473.89576143497
34 45272.98972120971
35 40730.55121319842
36 36747.067349677374
37 33237.49780646419
38 30137.038638655053
39 27382.7808194434
40 24930.468567983255
41 22739.692550784992
42 20776.761326666347
43 19014.21001272743
44 17427.396935974866
45 15994.31502691857
46 14698.197203074109
47 

## PyTorch: Tensors
Numpy 是个很好的框架，但是它并不支持 GPU 来加速数值计算。而现代神经网络使用 GPU 计算可以获得 <font color=red>50x</font> 以上的加速。

首先介绍 PyTorch 中的 Tensor。Tensor 的概念和 numpy array 很像。但是由于 PyTorch 的支持，Tensors 可以跟踪计算图和梯度。当然，它也可以和 numpy 一样，支持原生的科学计算。

和 numpy array 不同的是，Tensor 支持 GPU 的加速。我们只需要将它 cast 到一个新的数据类型。

下面我们用 PyTorch Tensors 来训练一个两层网络，使得它能够拟合随机数据。和上面 numpy 的例子一样，我们手动实现了网络的前向和反向过程。

In [21]:
# -*- coding: utf-8 -*-

import torch


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 30779596.0
1 27251474.0
2 26100854.0
3 23798004.0
4 19198232.0
5 13468350.0
6 8454856.0
7 5049365.5
8 3055320.25
9 1964674.5
10 1366129.25
11 1021370.4375
12 806607.25
13 661185.5
14 555262.1875
15 473572.65625
16 408196.46875
17 354524.9375
18 309604.71875
19 271654.3125
20 239324.203125
21 211569.96875
22 187632.890625
23 166881.953125
24 148847.15625
25 133083.625
26 119259.546875
27 107094.75
28 96361.9140625
29 86866.2890625
30 78442.3359375
31 70952.5
32 64273.484375
33 58311.578125
34 52977.98828125
35 48193.3046875
36 43896.03125
37 40030.76953125
38 36545.12109375
39 33397.6015625
40 30550.927734375
41 27973.244140625
42 25636.462890625
43 23517.927734375
44 21592.197265625
45 19840.453125
46 18244.6328125
47 16790.140625
48 15462.1826171875
49 14249.080078125
50 13138.740234375
51 12122.349609375
52 11191.2431640625
53 10337.3798828125
54 9553.873046875
55 8834.40625
56 8173.2314453125
57 7565.3603515625
58 7005.90576171875
59 6490.7548828125
60 6016.32763671875
61 5578.920

## 自动微分（Autograd）
### PyTorch：Tensors 和 autograd
在以上的例子里，我们手动的实现了网络的前向和反向。这对于一个两层简单的网络而言，不是什么难事。但是当网络的结构变的复杂，反向的过程实现起来就相当繁琐。

所幸的是，PyTorch 支持<font color=red>自动求取微分</font>。它能够支持自动计算神经网络的反向过程。**autograd**包就是为了完成这样的功能而编写的。当我们运用自动求取微分时，网络的前向会定义一个**计算图**。图中的节点就是 Tensors，边是从输入节点的 Tensors 得到输出节点 Tensors 的方程。通过这个图进行反向传播就使得计算梯度就变得很简单。

听起来很复杂，但是实践起来非常简单。每个 Tensor 代表一个节点。如果 x 节点的 requires_grad 是 True 的话（即 <font color=red>x.requires_grad=True</font>)，则 <font color=red>x.grad</font> 是另一个 Tensor。它存放着 x 相对于一些标量值的梯度。

下面，我们运用 autograd 来实现我们的两层网络。现在，我们不再需要手动实现神经网络的反向过程：

In [22]:
# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the a scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 25804096.0
1 19378746.0
2 16154960.0
3 13881237.0
4 11752503.0
5 9545616.0
6 7421660.0
7 5542492.5
8 4041871.25
9 2913414.5
10 2107581.25
11 1543397.25
12 1153300.625
13 881651.25
14 690392.0
15 552860.6875
16 451855.71875
17 375814.3125
18 317135.375
19 270785.125
20 233423.046875
21 202800.546875
22 177331.640625
23 155890.078125
24 137675.921875
25 122085.078125
26 108616.75
27 96935.640625
28 86752.71875
29 77817.8125
30 69957.03125
31 63017.2890625
32 56872.61328125
33 51418.1328125
34 46565.40234375
35 42237.71484375
36 38367.578125
37 34900.453125
38 31789.271484375
39 28992.724609375
40 26473.66796875
41 24201.486328125
42 22148.763671875
43 20292.080078125
44 18610.203125
45 17083.845703125
46 15697.3349609375
47 14436.638671875
48 13288.7509765625
49 12242.103515625
50 11287.037109375
51 10415.0
52 9616.962890625
53 8886.75
54 8217.5009765625
55 7603.8896484375
56 7040.86669921875
57 6523.7568359375
58 6048.47265625
59 5611.1220703125
60 5208.376953125
61 4837.390625
62 449

### PyTorch：定义新的 autograd 方程
透过现象看本质，我们发现其实原始的（PyTorch自己定义的） autograd 实际上是操作 Tensors 的两个方程。
froward 方程的接受输入 tensors 输出 tensors。backward 方程接受输出 tensors 对于一个标量值的梯度，然后计算输入 tensors 对于同一个标量值的梯度。

在 PyTorch 中，我们可以通过定义一个 <font color=red>torch.autograd.Function</font> 的子类来构建我们自己的 autograd 操作。然后只要实现子类中的 <font color=red>forward</font> 和 <font color=red>backward</font> 两个方法。我们可以通过实例化它，并像调用方法一样调用它。

在下面的例子里，我们定义了我们自己自定义的 autograd 方程用以实现非线性的 ReLU。并且用它来实现我们两层的神经网络：

In [23]:
# -*- coding: utf-8 -*-
import torch


class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 40182000.0
1 44604100.0
2 51147420.0
3 47243488.0
4 30326618.0
5 13479520.0
6 5230170.0
7 2465083.25
8 1557415.25
9 1172155.25
10 947315.8125
11 787694.25
12 663797.6875
13 564145.3125
14 482571.0625
15 415119.125
16 358957.59375
17 311837.15625
18 272020.0625
19 238185.671875
20 209342.6875
21 184634.78125
22 163334.484375
23 144912.5
24 128923.9296875
25 114992.3671875
26 102820.8984375
27 92155.2421875
28 82782.4140625
29 74516.4296875
30 67208.9609375
31 60732.9375
32 54978.046875
33 49855.94921875
34 45287.6328125
35 41198.71875
36 37534.20703125
37 34244.05859375
38 31286.48046875
39 28620.798828125
40 26214.890625
41 24043.93359375
42 22077.896484375
43 20295.58984375
44 18677.83984375
45 17207.4296875
46 15870.03125
47 14651.9013671875
48 13540.2099609375
49 12524.3505859375
50 11595.248046875
51 10745.1201171875
52 9967.275390625
53 9253.1171875
54 8597.046875
55 7993.78662109375
56 7438.693359375
57 6927.36962890625
58 6455.8466796875
59 6020.46337890625
60 5618.2783203125


383 0.007019145414233208
384 0.006793208420276642
385 0.0065741403959691525
386 0.006370732560753822
387 0.006168598309159279
388 0.005969941150397062
389 0.005783785600215197
390 0.00560460239648819
391 0.005429211538285017
392 0.005253802984952927
393 0.005087706726044416
394 0.004935507196933031
395 0.004776965361088514
396 0.004633964039385319
397 0.004492755979299545
398 0.004353789146989584
399 0.004222291521728039
400 0.004090385045856237
401 0.003963903523981571
402 0.0038454143796116114
403 0.003729637013748288
404 0.003620771924033761
405 0.0035131131298840046
406 0.00340801733545959
407 0.00330742378719151
408 0.003211765317246318
409 0.003113222075626254
410 0.003022616496309638
411 0.002935245167464018
412 0.0028514559380710125
413 0.002769692335277796
414 0.0026885350234806538
415 0.002611204283311963
416 0.002539660781621933
417 0.0024681666400283575
418 0.002396887866780162
419 0.0023297625593841076
420 0.002266267780214548
421 0.002197726396843791
422 0.002137326402589

### TensorFlow：静态图（Static Graphs）
PyTorch 的 autograd 看起来很像 TensorFlow：这两个框架都定义了计算图，都能用自动微分机制来计算梯度。二者最大的不同就是 TensorFlow 的计算图是静态的，而 PyTorch 的计算图是动态的。

在 TensorFlow 中，一旦我们定义了计算图，则我们将一直循环执行该定义好的计算图。当然，我们可能会给计算图喂入不同的输入数据。而在 PyTorch 中，每次前向过程都会定义一张新的计算图。

具体的静态图和动态图的不同以及优劣请查看[官方文档](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#tensors)

为了对比上述的 pytorch autograd 的例子，下面给出了 tensorflow 版的实现：

In [24]:
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np

# First we set up the computational graph:

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)

# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
    # Run the graph once to initialize the Variables w1 and w2.
    sess.run(tf.global_variables_initializer())

    # Create numpy arrays holding the actual data for the inputs x and targets
    # y
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    for _ in range(500):
        # Execute the graph many times. Each time it executes we want to bind
        # x_value to x and y_value to y, specified with the feed_dict argument.
        # Each time we execute the graph we want to compute the values for loss,
        # new_w1, and new_w2; the values of these Tensors are returned as numpy
        # arrays.
        loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                    feed_dict={x: x_value, y: y_value})
        print(loss_value)

27708024.0
20404152.0
16943356.0
14545310.0
12597423.0
11169524.0
9489526.0
7606004.5
5824834.0
4284547.0
3091786.0
2225090.0
1633637.9
1210321.2
912617.6
701452.1
551049.4
441974.53
361619.62
301015.44
254341.16
217597.45
188121.06
164038.61
144062.28
127283.51
112991.47
100704.98
90060.125
80780.04
72648.79
65484.4
59147.312
53522.08
48515.438
44044.0
40042.047
36454.777
33232.61
30332.95
27718.41
25358.312
23223.012
21289.332
19537.148
17945.166
16497.773
15179.957
13979.285
12885.818
11887.873
10975.775
10141.784
9377.699
8677.127
8034.2334
7443.503
6900.7847
6401.8467
5942.539
5519.3213
5129.0537
4768.8467
4436.3457
4129.1445
3845.1182
3582.3564
3338.8076
3113.413
2905.464
2712.5474
2533.5078
2367.2432
2212.7483
2069.1138
1935.425
1811.0018
1695.1992
1587.2727
1486.6848
1392.9102
1305.4362
1223.7642
1147.5046
1076.3052
1009.80145
947.6404
889.5327
835.19666
784.38416
736.77966
692.2229
650.48987
611.39813
574.7767
540.4502
508.28613
478.10565
449.7967
423.2304
398.31354
374.90967


## nn 模块
### pytorch：nn
虽然计算图和自动求取微分的机制对于复杂的运算和求取微分作用巨大，但是在复杂的神经网络面前，还是显得有些太接近底层了。

在 tensorflow 中，像 Keras， TensorFlow-Slim 和 TFLearn 这些包就是用来提供抽象计算概念的上层包。

在 PyTorch 中，nn 这个包也是服务于上层的。在 nn 包中，作者定义了很多模块。这些模块基本上等同于我们神经网络的层。一个模块接受 tensors 输入，并输出 tensors。也支持一些中间状态的 tensors。他们包含了可学习的参数。nn 这个包还定义了一系列有用的损失函数（loss function）用于训练神经网络。

接下来的例子中，我们用 nn 包来实现上面的两层神经网络：

In [25]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(size_average=False)

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)
    
    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
    
    # Zero the gradients before running the backward pass.
    model.zero_grad()
    
    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()
    
    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 684.7862548828125
1 631.8350219726562
2 586.1903686523438
3 546.3281860351562
4 510.8560791015625
5 478.83428955078125
6 449.90350341796875
7 423.55419921875
8 399.4801330566406
9 377.142333984375
10 356.2527160644531
11 336.6210021972656
12 318.0293273925781
13 300.39898681640625
14 283.6458740234375
15 267.7770080566406
16 252.6905059814453
17 238.42552185058594
18 224.80099487304688
19 211.87591552734375
20 199.60760498046875
21 187.9495391845703
22 176.88211059570312
23 166.38902282714844
24 156.44691467285156
25 147.04124450683594
26 138.11964416503906
27 129.70370483398438
28 121.7616958618164
29 114.26985931396484
30 107.19932556152344
31 100.55964660644531
32 94.30105590820312
33 88.41084289550781
34 82.87235260009766
35 77.67949676513672
36 72.80027770996094
37 68.20752716064453
38 63.91462707519531
39 59.8980827331543
40 56.11278533935547
41 52.57206344604492
42 49.25984191894531
43 46.16121292114258
44 43.260765075683594
45 40.5465087890625
46 38.01221466064453
47 35.64551

427 9.650840183894616e-06
428 9.352876077173278e-06
429 9.066947313840501e-06
430 8.788272680249065e-06
431 8.51783715916099e-06
432 8.256973160314374e-06
433 8.003274160728324e-06
434 7.758379979350138e-06
435 7.5208286034467164e-06
436 7.290140274562873e-06
437 7.067822025419446e-06
438 6.852201750007225e-06
439 6.641794698225567e-06
440 6.438526725105476e-06
441 6.243170901143458e-06
442 6.052041499060579e-06
443 5.8683663155534305e-06
444 5.689677436748752e-06
445 5.515743396244943e-06
446 5.348531431081938e-06
447 5.184470865060575e-06
448 5.028105988458265e-06
449 4.875534159509698e-06
450 4.726908628072124e-06
451 4.5837246034352574e-06
452 4.4446837819123175e-06
453 4.310079930291977e-06
454 4.179532425041543e-06
455 4.052982603752753e-06
456 3.930430921172956e-06
457 3.811515625784523e-06
458 3.695799023262225e-06
459 3.584662636058056e-06
460 3.4765994314511772e-06
461 3.3710691695887363e-06
462 3.270291699664085e-06
463 3.1711786050436785e-06
464 3.0762128062633565e-06
465 2

### PyTorch: 优化器（optim）
到现在为止，我们还是通过手动来更新我们模型的权重。通过使用 **with torch.no_grad()** 或者 **.data** 来避免 autograd 追踪更新权重的计算历史（因为这些计算不需要 loss 对它们求梯度）。在我们的优化算法还比较简单时（如现在使用的 SGD 算法），这还不算是什么负担。但是在实践中，我们通常使用更为复杂的更新算法，如 AdaGrad， RMSprop，Adam等等，而这时，还用手动写这些代码的方式就显得麻烦了。

optim 包抽象优化算法的思想而且提供了常用优化算法的实现。

在下面的例子中，我们将使用之前用 nn 定义好的网络。用 optim 包中提供的 Adam 算法来优化我们的模型，更新模型的权重：

In [26]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(size_average=False)

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)
    
    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
    
    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()
    
    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()
    
    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

0 734.5089111328125
1 716.6924438476562
2 699.4212646484375
3 682.7864990234375
4 666.6070556640625
5 650.8390502929688
6 635.4844970703125
7 620.4832153320312
8 605.82666015625
9 591.6136474609375
10 577.7608032226562
11 564.3089599609375
12 551.2717895507812
13 538.5928955078125
14 526.2965087890625
15 514.34716796875
16 502.7384033203125
17 491.4090881347656
18 480.3922424316406
19 469.6605529785156
20 459.1554870605469
21 448.9193420410156
22 438.98681640625
23 429.3438415527344
24 419.90411376953125
25 410.65850830078125
26 401.6029357910156
27 392.72998046875
28 384.0431823730469
29 375.5223083496094
30 367.17034912109375
31 358.99322509765625
32 350.9848937988281
33 343.140625
34 335.4920654296875
35 327.9921875
36 320.6512756347656
37 313.45037841796875
38 306.3865661621094
39 299.4817810058594
40 292.7358703613281
41 286.1057434082031
42 279.6031799316406
43 273.22283935546875
44 266.9549560546875
45 260.8047180175781
46 254.77830505371094
47 248.8665771484375
48 243.068801879

373 0.00010490290878806263
374 9.98975956463255e-05
375 9.512828546576202e-05
376 9.058920841198415e-05
377 8.626213093521073e-05
378 8.214269473683089e-05
379 7.821746112313122e-05
380 7.448037649737671e-05
381 7.092149462550879e-05
382 6.753413617843762e-05
383 6.430737266782671e-05
384 6.123378989286721e-05
385 5.830320151289925e-05
386 5.551668073167093e-05
387 5.286150917527266e-05
388 5.033198249293491e-05
389 4.79196387459524e-05
390 4.562483445624821e-05
391 4.343812179286033e-05
392 4.135718336328864e-05
393 3.9376813219860196e-05
394 3.7485315260710195e-05
395 3.568711326806806e-05
396 3.3977001294260845e-05
397 3.234485848224722e-05
398 3.079184898524545e-05
399 2.9310836907825433e-05
400 2.790126745821908e-05
401 2.6559015168459155e-05
402 2.528329423512332e-05
403 2.4064574972726405e-05
404 2.29056258831406e-05
405 2.180227602366358e-05
406 2.0750572730321437e-05
407 1.97505814867327e-05
408 1.8795883079292253e-05
409 1.7888152797240764e-05
410 1.70243984030094e-05
411 1.6

### PyTorch：自定义 nn 模块
有些时候你想自定义一些比已有的模块更复杂的模块，这时你可以自定义你自己的模块通过继承 <font color=red>nn.Module</font> 建一个子类，并且完成子类中的 <font color=red>forward</font> 方法。这个方法通过用其他模块或者其他 autograd 操作把输入的 tensors 变成想要的输出 tensors。

在下面的例子中，我们事先下之前两层网络的自定义模块：

In [27]:
# -*- coding: utf-8 -*-
import torch

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)
    
    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 647.2852783203125
1 594.937744140625
2 550.7913818359375
3 512.53955078125
4 478.7217712402344
5 448.4839782714844
6 421.3774108886719
7 396.7521667480469
8 374.2297058105469
9 353.339599609375
10 333.8704528808594
11 315.5453796386719
12 298.35516357421875
13 282.15692138671875
14 266.75946044921875
15 252.1375732421875
16 238.3279571533203
17 225.24188232421875
18 212.8157958984375
19 201.0371551513672
20 189.83340454101562
21 179.16014099121094
22 168.9839324951172
23 159.31640625
24 150.1078338623047
25 141.36097717285156
26 133.0697784423828
27 125.18647766113281
28 117.70693969726562
29 110.61231231689453
30 103.8929214477539
31 97.54997253417969
32 91.55021667480469
33 85.9017562866211
34 80.57682800292969
35 75.5542984008789
36 70.83242797851562
37 66.40731811523438
38 62.25764465332031
39 58.36204528808594
40 54.70687484741211
41 51.28962326049805
42 48.08914566040039
43 45.0911865234375
44 42.2833366394043
45 39.66024398803711
46 37.21010208129883
47 34.91743850708008
48 32

405 3.892788299708627e-05
406 3.789678157772869e-05
407 3.6891826312057674e-05
408 3.5916335036745295e-05
409 3.496699355309829e-05
410 3.4045388019876555e-05
411 3.3144202461699024e-05
412 3.227366687497124e-05
413 3.142184141324833e-05
414 3.0598253943026066e-05
415 2.9795450245728716e-05
416 2.9015600375714712e-05
417 2.82557539321715e-05
418 2.751828833424952e-05
419 2.6799421902978793e-05
420 2.609907096484676e-05
421 2.5419401936233044e-05
422 2.4756973289186135e-05
423 2.41136658587493e-05
424 2.348708949284628e-05
425 2.2877960873302072e-05
426 2.228602534160018e-05
427 2.1709554857807234e-05
428 2.1148045561858453e-05
429 2.060057158814743e-05
430 2.006962677114643e-05
431 1.955235165951308e-05
432 1.904921191453468e-05
433 1.8558128431322984e-05
434 1.808127126423642e-05
435 1.761790190357715e-05
436 1.7165417375508696e-05
437 1.6724459783290513e-05
438 1.6298428818117827e-05
439 1.5880792489042506e-05
440 1.5476016415050253e-05
441 1.507936030975543e-05
442 1.469513972551794

### PyTorch：控制流 + 权值共享
接下来我们通过实现一个奇怪的网络来说明动态计算图和权值共享。这个网络是个 Relu 的全连接网络。其中我们在前向过程中随机去 1 到 4 数作为中间层的数目。并且复用同一个模块的权值。

对于这个网络，我们可以用 python 流来控制我们的循环。我们通过在前向过程中复用同一模块多次来实现权值的共享。

我们可以通过继承 nn.Module 来创建我们自己的模型：

In [28]:
# -*- coding: utf-8 -*-
import random
import torch


class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 744.5186767578125
1 684.8750610351562
2 678.1241455078125
3 522.2393798828125
4 660.3095703125
5 670.9830932617188
6 355.5094299316406
7 312.73919677734375
8 266.21026611328125
9 666.8522338867188
10 659.3114013671875
11 663.8992309570312
12 661.396484375
13 129.5880584716797
14 595.8967895507812
15 95.91957092285156
16 647.8789672851562
17 620.425048828125
18 69.75782775878906
19 510.2338562011719
20 482.1440124511719
21 570.2634887695312
22 89.85477447509766
23 372.3770751953125
24 576.2099609375
25 483.392822265625
26 89.67530059814453
27 80.15411376953125
28 242.4844512939453
29 394.5157470703125
30 197.1688690185547
31 333.33184814453125
32 301.5603942871094
33 53.70644760131836
34 51.46847915649414
35 328.835693359375
36 291.7894287109375
37 180.2469024658203
38 169.86001586914062
39 49.474815368652344
40 43.96903610229492
41 35.52791213989258
42 166.0587615966797
43 25.647275924682617
44 152.04454040527344
45 98.12335205078125
46 189.38558959960938
47 123.11140441894531
48 304