# Team members
Name: Harsh Agarwal  
Matrikelnummer: 7024725  
email: haag00001@stud.uni-saarland.de

Name: Zurana Mehrin Ruhi  
Matrikelnummer: 7023892  
email: zuru00001@stud.uni-saarland.de


# 7.4 Dropout and Regularization
## DO NOT EDIT THIS FILE FROM HERE ONWARDS

Bishop et al proposed a simple idea to regularize L2 loss function by injecting noise in the data.
Then, in 2014, Srivastava et al. [Srivastava et al., 2014] developed a clever idea for how to apply Bishop’s idea to the internal layers of a network, too. Namely, they proposed to inject noise into each layer of the network before calculating the subsequent layer during training. They realized that when training a deep network with many layers, injecting noise enforces smoothness just on the input-output mapping.

Their idea, called dropout, involves injecting noise while computing each internal layer during forward propagation, and it has become a standard technique for training neural networks. The method is called dropout because we literally drop out some neurons during training. Throughout training, on each iteration, standard dropout consists of zeroing out some fraction of the nodes in each layer before calculating the subsequent layer.

The key challenge then is how to inject this noise. One idea is to inject the noise in an unbiased manner so that the expected value of each layer—while fixing the others—equals to the value it would have taken absent noise.

In Bishop’s work, he added Gaussian noise to the inputs to a linear model. At each training iteration, he added noise sampled from a distribution with mean zero  $\epsilon \sim \mathcal{N}(0, \sigma^2)$  to the input  $x$, yielding a perturbed point  $x'= x+\epsilon$ . In expectation,  E[x'] = x .

In inverted dropout regularization, one debiases each layer by normalizing by the fraction of nodes that were retained (not dropped out). In other words, with dropout probability  p , each intermediate activation h is replaced by a random variable h' as follows:

h'  = 0 with probability p  
h'  = h / (1-p) otherwise  

By design, the expectation remains unchanged, i.e.,  𝐸[ℎ′]=ℎ.
Also read the interesting dropout paper in the link below

https://jmlr.org/papers/v15/srivastava14a.html 

In [2]:
import numpy as np
from activations import ReLU, LeakyReLU, Tanh, Softmax, Sigmoid
from losses import CrossEntropy, MSELoss
from layers import Linear
from layers import L2regularization, Dropout
from model import Model
import torch

## 7.4.1 Dropout (1.5 points)
In this exercise we are going to implement inverted dropout. 
We implement dropout as a layer wrapper where Dropout class takes two arguments
Dropout (layer, probability). Although dropout can be applied to several types
of layers, we only apply it to linear layers in this exercise. Use inverted dropout
in this exercise. Implement dropout in ./layers/Dropout.py which transforms the input by setting randomly chosen activations to 0

### Implement the forward pass in ./layers/Dropout.py (1 point)
You have to implement the \__call\__() function which applies dropout to 
the linear layer.

### Implement the backward pass in ./layers/Dropout.py (0.5 point)


In [3]:
np.random.seed(123)

layer1 = Linear(1000, 100)
activation1 = ReLU()
layer2 = Dropout(Linear(100, 10), p=0.5)
activation2 = ReLU()
loss = CrossEntropy()

x = np.random.randn(2, 1000)
y_true = np.zeros((2, 10,))
y_true[0, 4] = 1
y_true[1, 1] = 1
m = Model([layer1, activation1, layer2, activation2])
out = m.forward(x)

# numpy seed is fixed so you should get the same value after each run
print(loss(out, y_true)) # = 2.15028 with tolerance 5e-3

2.0169578738329816


## 7.4.2 L2 Regularization (1.5 point)
In this exercise we are going to implement a layer wrapper to Linear layer
that applies L2 regularization to the layer. We add the squared norm of the weights
of one layer to the loss function during the forward pass. We modify the backward pass
of the linear layer to incorporate the regularization term

In [4]:
# Implement a simple two layer model in our framework
# with L2 regularization loss added in the end

from model import Model
np.random.seed(123)

reg_coeff = 0.01

layer1 = Linear(1000, 100)
activation1 = ReLU()
layer2 = L2regularization(Linear(100, 10), coefficient=reg_coeff)
activation2 = ReLU()
loss = CrossEntropy()

x = np.random.randn(2, 1000)
y_true = np.zeros((2, 10,))
y_true[0, 4] = 1
y_true[1, 1] = 1
m = Model([layer1, activation1, layer2, activation2])
out = m.forward(x)

regularization = reg_coeff * np.sum(np.square(layer2.weights))

# Print the regularized loss value
print(loss(out, y_true) + regularization)

1.9887497161910503


In [5]:
# Create a similar model in pytorch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Net(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.layer1 = nn.Linear(1000, 100, bias=True)
        self.layer2 = nn.Linear(100, 10, bias=True)
    
    def forward(self, x):
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        return x
criterion = nn.CrossEntropyLoss()
net = Net()

# initialize it to the same weights
with torch.no_grad():
    net.layer1.weight.copy_(torch.tensor(layer1.weights).t())
    net.layer1.bias.copy_(torch.tensor(layer1.bias[0,:]))
    net.layer2.weight.copy_(torch.tensor(layer2.weights).t())
    net.layer2.bias.copy_(torch.tensor(layer2.bias[0,:]))

optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0)

x_torch = torch.tensor(x, dtype=torch.float32)
out = net(x_torch)
y_true_torch = torch.tensor(y_true, dtype=torch.float32)
loss_torch = criterion(out, y_true_torch) + (reg_coeff * torch.sum(torch.square( net.layer2.weight)))
print(loss_torch.item())

1.9887497425079346


In [6]:
# Calculate the gradients of our network in our toy framework
grads = m.backward(loss.grad())
m.update_parameters(grads, 0.001)
out = m.forward(x)
regularization = reg_coeff * np.sum(np.square(layer2.weights))
# Print the loss with the updated gradients
model_loss_ours = loss(out, y_true) + regularization
print(model_loss_ours)

1.8874745546591278


In [7]:
# Calculate the gradients of our pytorch model
loss_torch.backward()
optimizer.step()
# Print the loss with the updated gradients
out = net(x_torch)
model_loss_pt = criterion(out, y_true_torch).item() + (reg_coeff * torch.sum(torch.square( net.layer2.weight)))
print(model_loss_pt)

# Similar to A6 we compare the loss of pytorch model and our model
np.allclose(model_loss_ours, model_loss_pt.detach(), atol=1e-3)

tensor(1.8871, grad_fn=<AddBackward0>)


True

The difference between the loss with the updated gradients of our model and the pytorch model
should not be more that 0.001