## 3.13 丢弃法

除了前一节介绍的权重衰减以外，深度学习模型常常使用丢弃法(dropout)来应对过拟合问题。丢弃法有一些不同的变体。本节中提到的丢弃法特指倒置丢弃法(inverted dropout)。

### 3.13.1 方法

回忆一下，[“多层感知机”](mlp.ipynb)一节的图3.3描述了一个单隐藏层的多层感知机。其中输入个数为4，隐藏层单元个数为5，且隐藏单元$h_i(i=1,...,5)$的计算表达式为

 这里$\phi$是激活函数，$x_1, \ldots, x_4$是输入，隐藏单元$i$的权重参数为$w_{1i}, \ldots, w_{4i}$，偏差参数为$b_i$。当对该隐藏层使用丢弃法时，该层的隐藏单元将有一定概率被丢弃掉。设丢弃概率为$p$，那么有$p$的概率$h_i$会被清零，有$1-p$的概率$h_i$会除以$1-p$做拉伸。丢弃概率是丢弃法的超参数。具体来说，设随机变量$\xi_i$为0和1的概率分别为$p$和$1-p$。使用丢弃法时我们计算新的隐藏单元$h_i'$

$$h_i' = \frac{\xi_i}{1-p} h_i.$$

由于$E(\xi_i) = 1-p$，因此

$$E(h_i') = \frac{E(\xi_i)}{1-p}h_i = h_i.$$

即丢弃法不改变其输入的期望值。让我们对图3.3中的隐藏层使用丢弃法，一种可能的结果如图3.5所示，其中$h_2$和$h_5$被清零。这时输出值的计算不再依赖$h_2$和$h_5$，在反向传播时，与这两个隐藏单元相关的权重的梯度均为0。由于在训练中隐藏层神经元的丢弃是随机的，即$h_1, \ldots, h_5$都有可能被清零，输出层的计算无法过度依赖$h_1, \ldots, h_5$中的任一个，从而在训练模型时起到正则化的作用，并可以用来应对过拟合。在测试模型时，我们为了得到更加确定性的结果，一般不使用丢弃法。

![隐藏层使用了丢弃法的多层感知机](../img/dropout.svg)

### 3.13.2 从零开始实现

根据丢弃法的定义，我们可以很容易地实现它。下面的dropout函数将以drop_prob的概率丢弃NDArray输入X中的元素。

In [29]:
import d2lzh as d2l
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import loss as gloss, nn

def dropout(X, drop_prob):
    assert 0 <= drop_prob <= 1
    keep_prob = 1 - drop_prob
    # 这种情况下把全部元素都丢弃
    if keep_prob == 0:
        return X.zeros_like()
    mask = nd.random_uniform(0, 1, X.shape) < keep_prob
    return mask * X / keep_prob

我们运行几个例子来测试一下`dropout`函数。其中丢弃概率分别为0、0.5和1。

In [30]:
X = nd.arange(16).reshape((2, 8))
dropout(X, 0)


[[ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 8.  9. 10. 11. 12. 13. 14. 15.]]
<NDArray 2x8 @cpu(0)>

In [31]:
dropout(X, 0.5)


[[ 0.  2.  0.  6.  0. 10. 12. 14.]
 [16.  0.  0.  0. 24.  0. 28.  0.]]
<NDArray 2x8 @cpu(0)>

In [32]:
dropout(X, 1)


[[0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]
<NDArray 2x8 @cpu(0)>

#### 定义模型参数

实验中，我们依然使用[“softmax回归的从零开始实现”](softmax-regression-scratch.ipynb)一节中介绍的Fashion-MNIST数据集。我们将定义一个包含两个隐藏层的多层感知机，其中两个隐藏层的输出个数都是256。

In [33]:
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256

W1 = nd.random_normal(scale=0.01, shape=(num_inputs, num_hiddens1))
b1 = nd.zeros(num_hiddens1)
W2 = nd.random_normal(scale=0.01, shape=(num_hiddens1, num_hiddens2))
b2 = nd.zeros(num_hiddens2)
W3 = nd.random_normal(scale=0.01, shape=(num_hiddens2, num_outputs))
b3 = nd.zeros(num_outputs)

params = [W1, b1, W2, b2, W3, b3]
for param in params:
    param.attach_grad()

#### 定义模型

In [34]:
drop_prob1, drop_prob2 = 0.2, 0.5

def net(X):
    X = X.reshape((-1, num_inputs))
    H1 = (nd.dot(X, W1) + b1).relu()
    if autograd.is_training(): # 只在训练模式时使用丢弃法
        H1 = dropout(H1, drop_prob1) # 在第一层全连接层后添加丢弃层
    H2 = (nd.dot(H1, W2) + b2).relu()
    if autograd.is_training():
        H2 = dropout(H2, drop_prob2) # 在第二层全连接层后添加丢弃层
    return nd.dot(H2, W3) + b3

#### 训练和测试模型

这部分与之前多层感知机的训练和测试类似。

In [35]:
num_epochs, lr, batch_size = 30, 0.5, 256
loss = gloss.SoftmaxCrossEntropyLoss()
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
             params, lr)

epoch 1, loss 1.1398, train acc 0.557, test acc 0.768
epoch 2, loss 0.6074, train acc 0.773, test acc 0.826
epoch 3, loss 0.5094, train acc 0.812, test acc 0.842
epoch 4, loss 0.4603, train acc 0.833, test acc 0.838
epoch 5, loss 0.4275, train acc 0.844, test acc 0.860
epoch 6, loss 0.4060, train acc 0.852, test acc 0.856
epoch 7, loss 0.3906, train acc 0.858, test acc 0.862
epoch 8, loss 0.3726, train acc 0.865, test acc 0.860
epoch 9, loss 0.3673, train acc 0.868, test acc 0.873
epoch 10, loss 0.3552, train acc 0.871, test acc 0.877
epoch 11, loss 0.3465, train acc 0.873, test acc 0.871
epoch 12, loss 0.3398, train acc 0.875, test acc 0.882
epoch 13, loss 0.3289, train acc 0.879, test acc 0.883
epoch 14, loss 0.3243, train acc 0.881, test acc 0.884
epoch 15, loss 0.3176, train acc 0.883, test acc 0.882
epoch 16, loss 0.3102, train acc 0.885, test acc 0.873
epoch 17, loss 0.3039, train acc 0.887, test acc 0.877
epoch 18, loss 0.2996, train acc 0.888, test acc 0.883
epoch 19, loss 0.29

### 3.13.3 简洁实现

在Gluon中，我们只需要在全连接层后添加`Dropout`层并指定丢弃概率。在训练模型时，`Dropout`层将以指定的丢弃概率随机丢弃上一层的输出元素；在测试模型时，`Dropout`层并不发挥作用。

In [36]:
net = nn.Sequential()
net.add(nn.Dense(256, activation="relu"),
       nn.Dropout(drop_prob1), # 在第一个全连接层后添加丢弃层
       nn.Dense(256, activation="relu"),
       nn.Dropout(drop_prob2), # 在第二个全连接层后添加丢弃层
       nn.Dense(10))
net.initialize(init.Normal(sigma=0.01))

下面训练并测试模型。

In [37]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None,
             None, trainer)

epoch 1, loss 1.2124, train acc 0.531, test acc 0.768
epoch 2, loss 0.6007, train acc 0.777, test acc 0.832
epoch 3, loss 0.5024, train acc 0.816, test acc 0.848
epoch 4, loss 0.4574, train acc 0.833, test acc 0.861
epoch 5, loss 0.4265, train acc 0.844, test acc 0.856
epoch 6, loss 0.4060, train acc 0.853, test acc 0.862
epoch 7, loss 0.3982, train acc 0.856, test acc 0.861
epoch 8, loss 0.3799, train acc 0.861, test acc 0.875
epoch 9, loss 0.3665, train acc 0.865, test acc 0.870
epoch 10, loss 0.3567, train acc 0.870, test acc 0.869
epoch 11, loss 0.3487, train acc 0.873, test acc 0.881
epoch 12, loss 0.3379, train acc 0.877, test acc 0.878
epoch 13, loss 0.3380, train acc 0.875, test acc 0.882
epoch 14, loss 0.3282, train acc 0.879, test acc 0.873
epoch 15, loss 0.3210, train acc 0.882, test acc 0.871
epoch 16, loss 0.3142, train acc 0.884, test acc 0.883
epoch 17, loss 0.3126, train acc 0.885, test acc 0.886
epoch 18, loss 0.3062, train acc 0.887, test acc 0.888
epoch 19, loss 0.30

### 小结

- 我们可以通过使用丢弃法应对过拟合。
- 丢弃法只在训练模型时使用。

### 练习

- 如果把本节中的两个丢弃概率超参数对调，会有什么结果？

In [38]:
drop_prob1, drop_prob2 = 0.5, 0.2

net = nn.Sequential()
net.add(nn.Dense(256, activation="relu"),
       nn.Dropout(drop_prob1), # 在第一个全连接层后添加丢弃层
       nn.Dense(256, activation="relu"),
       nn.Dropout(drop_prob2), # 在第二个全连接层后添加丢弃层
       nn.Dense(10))
net.initialize(init.Normal(sigma=0.01))

In [39]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None,
             None, trainer)

epoch 1, loss 1.1171, train acc 0.563, test acc 0.771
epoch 2, loss 0.5877, train acc 0.780, test acc 0.832
epoch 3, loss 0.5078, train acc 0.812, test acc 0.845
epoch 4, loss 0.4690, train acc 0.827, test acc 0.854
epoch 5, loss 0.4422, train acc 0.838, test acc 0.856
epoch 6, loss 0.4231, train acc 0.844, test acc 0.863
epoch 7, loss 0.4049, train acc 0.851, test acc 0.866
epoch 8, loss 0.3971, train acc 0.855, test acc 0.868
epoch 9, loss 0.3834, train acc 0.859, test acc 0.866
epoch 10, loss 0.3775, train acc 0.860, test acc 0.876
epoch 11, loss 0.3659, train acc 0.864, test acc 0.874
epoch 12, loss 0.3592, train acc 0.867, test acc 0.877
epoch 13, loss 0.3568, train acc 0.868, test acc 0.876
epoch 14, loss 0.3477, train acc 0.872, test acc 0.880
epoch 15, loss 0.3464, train acc 0.871, test acc 0.883
epoch 16, loss 0.3375, train acc 0.875, test acc 0.882
epoch 17, loss 0.3341, train acc 0.877, test acc 0.881
epoch 18, loss 0.3312, train acc 0.877, test acc 0.880
epoch 19, loss 0.32

- 增大迭代周期数，比较使用丢弃法与不使用丢弃法的结果。

In [41]:
# 使用丢弃法
num_epochs = 40
drop_prob1, drop_prob2 = 0.2, 0.5

net = nn.Sequential()
net.add(nn.Dense(256, activation="relu"),
       nn.Dropout(drop_prob1), # 在第一个全连接层后添加丢弃层
       nn.Dense(256, activation="relu"),
       nn.Dropout(drop_prob2), # 在第二个全连接层后添加丢弃层
       nn.Dense(10))
net.initialize(init.Normal(sigma=0.01))

In [42]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None,
             None, trainer)

epoch 1, loss 1.1370, train acc 0.563, test acc 0.775
epoch 2, loss 0.5851, train acc 0.781, test acc 0.839
epoch 3, loss 0.4923, train acc 0.819, test acc 0.855
epoch 4, loss 0.4484, train acc 0.836, test acc 0.833
epoch 5, loss 0.4230, train acc 0.846, test acc 0.864
epoch 6, loss 0.3966, train acc 0.856, test acc 0.868
epoch 7, loss 0.3843, train acc 0.858, test acc 0.869
epoch 8, loss 0.3701, train acc 0.865, test acc 0.870
epoch 9, loss 0.3554, train acc 0.871, test acc 0.878
epoch 10, loss 0.3509, train acc 0.872, test acc 0.877
epoch 11, loss 0.3368, train acc 0.876, test acc 0.880
epoch 12, loss 0.3308, train acc 0.878, test acc 0.882
epoch 13, loss 0.3258, train acc 0.879, test acc 0.876
epoch 14, loss 0.3179, train acc 0.883, test acc 0.880
epoch 15, loss 0.3113, train acc 0.886, test acc 0.884
epoch 16, loss 0.3032, train acc 0.888, test acc 0.882
epoch 17, loss 0.2996, train acc 0.889, test acc 0.887
epoch 18, loss 0.2913, train acc 0.892, test acc 0.889
epoch 19, loss 0.29

In [43]:
# 不使用丢弃法
num_epochs = 40
drop_prob1, drop_prob2 = 0, 0

net = nn.Sequential()
net.add(nn.Dense(256, activation="relu"),
       nn.Dropout(drop_prob1), # 在第一个全连接层后添加丢弃层
       nn.Dense(256, activation="relu"),
       nn.Dropout(drop_prob2), # 在第二个全连接层后添加丢弃层
       nn.Dense(10))
net.initialize(init.Normal(sigma=0.01))

In [44]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None,
             None, trainer)

epoch 1, loss 1.1036, train acc 0.565, test acc 0.783
epoch 2, loss 0.5718, train acc 0.784, test acc 0.829
epoch 3, loss 0.4613, train acc 0.828, test acc 0.844
epoch 4, loss 0.4159, train acc 0.845, test acc 0.852
epoch 5, loss 0.3806, train acc 0.859, test acc 0.871
epoch 6, loss 0.3639, train acc 0.864, test acc 0.867
epoch 7, loss 0.3478, train acc 0.871, test acc 0.869
epoch 8, loss 0.3345, train acc 0.876, test acc 0.875
epoch 9, loss 0.3184, train acc 0.881, test acc 0.880
epoch 10, loss 0.3110, train acc 0.883, test acc 0.869
epoch 11, loss 0.2988, train acc 0.888, test acc 0.883
epoch 12, loss 0.2899, train acc 0.891, test acc 0.886
epoch 13, loss 0.2795, train acc 0.895, test acc 0.884
epoch 14, loss 0.2745, train acc 0.897, test acc 0.890
epoch 15, loss 0.2644, train acc 0.900, test acc 0.886
epoch 16, loss 0.2628, train acc 0.901, test acc 0.885
epoch 17, loss 0.2505, train acc 0.905, test acc 0.893
epoch 18, loss 0.2457, train acc 0.908, test acc 0.885
epoch 19, loss 0.24

- 如果将模型改得更复杂，如增加隐藏层单元，使用丢弃法应对过拟合的效果是否更加明显？

In [45]:
num_epochs = 40
drop_prob1, drop_prob2 = 0.2, 0.5

net = nn.Sequential()
net.add(nn.Dense(512, activation="relu"),
       nn.Dropout(drop_prob1), # 在第一个全连接层后添加丢弃层
       nn.Dense(512, activation="relu"),
       nn.Dropout(drop_prob2), # 在第二个全连接层后添加丢弃层
       nn.Dense(10))
net.initialize(init.Normal(sigma=0.01))

In [46]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None,
             None, trainer)

epoch 1, loss 1.1726, train acc 0.549, test acc 0.789
epoch 2, loss 0.5851, train acc 0.782, test acc 0.824
epoch 3, loss 0.4888, train acc 0.819, test acc 0.838
epoch 4, loss 0.4400, train acc 0.838, test acc 0.850
epoch 5, loss 0.4084, train acc 0.849, test acc 0.865
epoch 6, loss 0.3897, train acc 0.857, test acc 0.872
epoch 7, loss 0.3723, train acc 0.863, test acc 0.866
epoch 8, loss 0.3617, train acc 0.868, test acc 0.876
epoch 9, loss 0.3465, train acc 0.873, test acc 0.883
epoch 10, loss 0.3370, train acc 0.875, test acc 0.877
epoch 11, loss 0.3279, train acc 0.878, test acc 0.884
epoch 12, loss 0.3208, train acc 0.881, test acc 0.873
epoch 13, loss 0.3131, train acc 0.884, test acc 0.884
epoch 14, loss 0.3059, train acc 0.885, test acc 0.883
epoch 15, loss 0.3011, train acc 0.888, test acc 0.886
epoch 16, loss 0.2938, train acc 0.890, test acc 0.887
epoch 17, loss 0.2909, train acc 0.890, test acc 0.888
epoch 18, loss 0.2849, train acc 0.894, test acc 0.885
epoch 19, loss 0.28

- 以本节中的模型为例，比较使用丢弃法与权重衰减的效果。如果同时使用丢弃法和权重衰减，效果会如何？

丢弃法效果比权重衰减效果更好。同时使用丢弃法和权重衰减并不理想，还不如单独使用丢弃法或者权重衰减。