# Robustness through Perturbations 

Three ways to let model do well on unseen test data:
    1. Constrain the dimension of hypothesis function.
    2. Constrain the norm of the weights.
    3. Add noise to every layer's input.

A key challenge is how to add noise without introducing undue bias. For intermediate layers, we do this by perturbing coordinates as follows:
$$
\begin{aligned}
h' = 
\begin{cases}
    0 & \text{ with probability } p \\
    \frac{h}{1-p} & \text{ otherwise}
    \end{cases}
\end{aligned}
$$

# Implementation from Scratch

In [1]:
import d2l
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import loss as gloss, nn

In [3]:
def dropout(X, drop_prob):
    assert 0 <= drop_prob <= 1
    if drop_prob == 1:
        return X.zeros_like()
    mask = nd.random.uniform(0, 1, X.shape) > drop_prob
    return mask * X / (1.0 - drop_prob)

In [4]:
X = nd.arange(16).reshape((2, 8))
print(dropout(X, 0))
print(dropout(X, 0.5))
print(dropout(X, 1))


[[ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 8.  9. 10. 11. 12. 13. 14. 15.]]
<NDArray 2x8 @cpu(0)>

[[ 0.  0.  0.  0.  8. 10. 12.  0.]
 [16.  0. 20. 22.  0.  0.  0. 30.]]
<NDArray 2x8 @cpu(0)>

[[0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]
<NDArray 2x8 @cpu(0)>


## Defining Model Parameters

In [20]:
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256

W1 = nd.random.normal(scale=0.01, shape=(num_inputs, num_hiddens1))
b1 = nd.zeros(num_hiddens1)
W2 = nd.random.normal(scale=0.01, shape=(num_hiddens1, num_hiddens2))
b2 = nd.zeros(num_hiddens2)
W3 = nd.random.normal(scale=0.01, shape=(num_hiddens2, num_outputs))
b3 = nd.zeros(num_outputs)

params = [W1, b1, W2, b2, W3, b3]
for param in params:
    param.attach_grad()

## Define the Model

In [21]:
drop_prob1, drop_prob2 = 0.2, 0.5

def net(X):
    X = X.reshape((-1, num_inputs))
    H1 = (nd.dot(X, W1) + b1).relu()
    if autograd.is_training():
        H1 = dropout(H1, drop_prob1)
    H2 = (nd.dot(H1, W2) + b2).relu()
    if autograd.is_training():
        H2 = dropout(H2, drop_prob2)
    return nd.dot(H2, W3) + b3

## Training and Testing

In [22]:
num_epochs, lr, batch_size = 10, 0.5, 256
loss = gloss.SoftmaxCrossEntropyLoss()
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, params, lr)

epoch 1, loss 1.1804, train acc 0.546, test acc 0.768
epoch 2, loss 0.5933, train acc 0.780, test acc 0.832
epoch 3, loss 0.4965, train acc 0.819, test acc 0.848
epoch 4, loss 0.4506, train acc 0.837, test acc 0.862
epoch 5, loss 0.4173, train acc 0.848, test acc 0.865
epoch 6, loss 0.4005, train acc 0.853, test acc 0.869
epoch 7, loss 0.3857, train acc 0.860, test acc 0.874
epoch 8, loss 0.3697, train acc 0.866, test acc 0.877
epoch 9, loss 0.3585, train acc 0.871, test acc 0.874
epoch 10, loss 0.3499, train acc 0.872, test acc 0.882


## Concise Implementation

In [16]:
net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'),
       nn.Dropout(drop_prob1),
       nn.Dense(256, activation='relu'),
       nn.Dropout(drop_prob2),
       nn.Dense(10))
net.initialize(init.Normal(sigma=0.01))

In [17]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None, None, trainer)

epoch 1, loss 1.2270, train acc 0.522, test acc 0.763
epoch 2, loss 0.6112, train acc 0.774, test acc 0.834
epoch 3, loss 0.5155, train acc 0.812, test acc 0.842
epoch 4, loss 0.4666, train acc 0.830, test acc 0.849
epoch 5, loss 0.4382, train acc 0.840, test acc 0.867
epoch 6, loss 0.4152, train acc 0.850, test acc 0.863
epoch 7, loss 0.4007, train acc 0.855, test acc 0.868
epoch 8, loss 0.3839, train acc 0.862, test acc 0.874
epoch 9, loss 0.3704, train acc 0.866, test acc 0.875
epoch 10, loss 0.3641, train acc 0.868, test acc 0.875


# Problems

1. Try out what happens if you change the dropout probabilities for layers 1 and 2. In particular, what happens if you switch the ones for both layers?

2. Increase the number of epochs and compare the results obtained when using dropout with those when not using it.

3. Compute the variance of the the activation random variables after applying dropout.

4. Why should you typically not using dropout?

5. If changes are made to the model to make it more complex, such as adding hidden layer units, will the effect of using dropout to cope with overfitting be more obvious?

6. Using the model in this section as an example, compare the effects of using dropout and weight decay. What if dropout and weight decay are used at the same time?

7. What happens if we apply dropout to the individual weights of the weight matrix rather than the activations?

8. Replace the dropout activation with a random variable that takes on values of $[0, \gamma/2, \gamma]$. Can you design something that works better than the binary dropout function? Why might you want to use it? Why not?