# Team members
Name: Harsh Agarwal  
Matrikelnummer: 7024725  
email: haag00001@stud.uni-saarland.de

Name: Zurana Mehrin Ruhi  
Matrikelnummer: 7023892  
email: zuru00001@stud.uni-saarland.de


# Exercise 6.2 (6 points)

In this exercise you will continue the concept of toy library that you started in the previous assignment. As the main topic of this week's lecture was backpropagation, you are asked to perform backpropagation on a neural network model that you will build in this exercise.  
  
In this toy library, we are not implementing the functionalities of autograd or any other automatic differentiation. Still it will be extremely helpful for you to know the basics about how the PyTorch autograd functionality works (e.g. for checking your implementation of gradient calculations). A good starting point would be [PyTorch autograd tutorial](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html).   

All classes that you implemented in the previous week must now have a `grad` function which would compute and return gradients. The `grad` function in the classes of the following *loss* funtions (`MSELoss` and `CrossEntropyLoss`) must compute gradients of the loss w.r.t. its input. The `grad` function for *activation functions* must take the incoming gradient (possibly from the previous layer or the loss function) and compute gradients of the loss w.r.t. its input. The `grad` function for *layers* (in this exercise we have only `Linear` layer) must take the incoming gradient and compute gradients of the loss w.r.t. both its input and weights (you can ignore computing the gradients w.r.t. biases).

For each gradient calculation, we are providing some low-dimensional data. After you finish the implementation of each `grad` function, simply run the corresponding cell (**do not** change the contents of these cells). To check for the correctness of the implementation, we ask you to *call* the corresponding function from PyTorch on the same input data, compute gradients, and compare them with the gradients from your implementation. If you have a correct solution, then they must be the same (or maybe with some very small <$10^{-3}$ differences).

We are providing the correct solution for the previous assignment so that you can still work on this one, even if you had some mistakes in your solution. Simply replace the files or work from a different directory.

Please remember that everything is processed in minibatches and gradients must be calculated accordingly. The input for each of the model components has dimensions of `N*D` where `N` is the number of datapoints in minibatch and `D` is the number of features. Of course, all the gradient computations must, ideally, be implemented in vectorized form (without using any loops).

In [2]:
import numpy as np
from activations import ReLU, LeakyReLU, Tanh, Softmax, Sigmoid
from losses import CrossEntropy, MSELoss
from layers import Linear
import torch
np.random.seed(23)

## Exercise 6.2.1 Implement gradient calculation for MSE loss (0.5 points)
Check the correctness of the gradient by calculating it on the same data using PyTorch  

In [3]:
y_pred = np.array([0, 1, 2])
y_true = np.array([1, 3, 3])
loss = MSELoss()

print('predictions')
print(y_pred)
print('true values')
print(y_true)

print('MSE loss')
print(loss(y_pred, y_true))

print('MSE gradient')
print(loss.grad())

predictions
[0 1 2]
true values
[1 3 3]
MSE loss
2.0
MSE gradient
[-0.66666667 -1.33333333 -0.66666667]


In [4]:
y_pred = torch.tensor([0, 1, 2], requires_grad=True, dtype=torch.float32)
y_true = torch.tensor([1, 3, 3], dtype=torch.float32)

print('predictions')
print(y_pred)
print('true values')
print(y_true)

print('MSE loss')
loss = torch.nn.MSELoss()
actual_loss = loss(y_pred, y_true)
print(actual_loss)

print('MSE gradient')
actual_loss.backward()
print(y_pred.grad)

predictions
tensor([0., 1., 2.], requires_grad=True)
true values
tensor([1., 3., 3.])
MSE loss
tensor(2., grad_fn=<MseLossBackward0>)
MSE gradient
tensor([-0.6667, -1.3333, -0.6667])



## Exercise 6.2.2 Implement gradient calculation for Cross Entropy Loss (1.5 points)
First prove that the partial derivative of the loss w.r.t. one of the input variables is <br> $\frac{\delta L}{\delta o_i} = p_i - y_i$  
where <br> $o_i$ - one of the input variables,   
$p_i$ - probability for that input variable calculated using softmax,   
$y_i$ - label for that input variable ($y_i \in \{0, 1\}$).  
For simplicity of the proof, you can prove it for just one datapoint, but in the code, you should properly extrapolate it for computing the gradients for the whole minibatch (`N` datapoints).  

Write the solution in the markdown cell below.
  
Please remember that a typical Cross Entropy Loss implementation, including ours, implicitly applies Softmax before calculating the CE loss.
  
Check the correctness of the gradient by calculating it on the same data using PyTorch  

$L = -(y_i*log(p_i) + (1-y_i)*log(1-p_i))$ <br>
$\frac{\delta L}{\delta o_i} = \frac{\delta L}{\delta p_i} * \frac{\delta p_i}{\delta o_i}$ <br>
$\frac{\delta L}{\delta p_i} = -\frac{y_i}{p_i} + \frac{1-y_i}{1-p_i}$ (Cross Entropy Derivative) <br> 
$\frac{\delta p_i}{\delta o_i} = p_i(1 - p_i)$ (Softmax Derivative) <br> 
$\frac{\delta L}{\delta o_i} = (-\frac{y_i}{p_i} + \frac{1-y_i}{1-p_i})*(p_i(1 - p_i)) = p_i - y_i$  

In [5]:
ce_loss = CrossEntropy(average=True)
predictions = np.array([[0.4,0.35,0.71,0.30],
                        [0.01,0.01,0.01,0.65]])
targets = np.array([[0,0,1,0],
                  [0,0,0,1]])

print('predictions')
print(predictions)
print('targets')
print(targets)

print('cross entropy loss')
print(ce_loss(predictions, targets))

print('gradient of the cross entropy loss')
print(ce_loss.grad())

predictions
[[0.4  0.35 0.71 0.3 ]
 [0.01 0.01 0.01 0.65]]
targets
[[0 0 1 0]
 [0 0 0 1]]
cross entropy loss
1.0391157169091105
gradient of the cross entropy loss
[[ 0.11849768  0.11271848 -0.33843729  0.10722113]
 [ 0.10211415  0.10211415  0.10211415 -0.30634246]]


In [6]:
predictions = torch.tensor([[0.4,0.35,0.71,0.30],
                            [0.01,0.01,0.01,0.65]], dtype=torch.float32, requires_grad=True)
targets = torch.tensor([[0,0,1,0],
                        [0,0,0,1]], dtype=torch.float32)

print('predictions')
print(predictions)
print('targets')
print(targets)

print('cross entropy loss')
loss = torch.nn.CrossEntropyLoss()
actual_loss = loss(predictions, targets)
print(actual_loss)

print('gradient of the cross entropy loss')
actual_loss.backward()
print(predictions.grad)

predictions
tensor([[0.4000, 0.3500, 0.7100, 0.3000],
        [0.0100, 0.0100, 0.0100, 0.6500]], requires_grad=True)
targets
tensor([[0., 0., 1., 0.],
        [0., 0., 0., 1.]])
cross entropy loss
tensor(1.0391, grad_fn=<DivBackward1>)
gradient of the cross entropy loss
tensor([[ 0.1185,  0.1127, -0.3384,  0.1072],
        [ 0.1021,  0.1021,  0.1021, -0.3063]])


## Exercise 6.2.3 Implement gradient calculation for linear layer (1.5 points)

First prove that $\frac{\delta L}{\delta X} = \frac{\delta L}{\delta Y} W^T$ and $\frac{\delta L}{\delta W} = X^T \frac{\delta L}{\delta Y}$  
where $Y = XW$ <br> (X - input data matrix of dimension `N * in_features` and W is a weight matrix of dimension `in_features * out_features`),  
$\frac{\delta L}{\delta Y}$ is the incoming gradient of dimension `N * out_features` (e.g. from the loss function that is applied on the outputs of the linear layer).  

Write the solution in the markdown cell below.

Check the correctness of the gradient by calculating it on the same data using PyTorch.

$\frac{\delta L}{\delta X} = \frac{\delta L}{\delta Y} * \frac{\delta Y}{\delta X}$ <br>
$=>\frac{\delta Y}{\delta X} = W^T$ <br>
$=>\frac{\delta L}{\delta X} = \frac{\delta L}{\delta Y} * W^T$
<br><br>
$\frac{\delta L}{\delta W} = \frac{\delta Y}{\delta W} * \frac{\delta L}{\delta Y}$ <br>
$=>\frac{\delta Y}{\delta W} = X^T$ <br>
$=>\frac{\delta L}{\delta W} = X^T * \frac{\delta L}{\delta Y} $

In [7]:
minibatch_size = 4
in_features = 5
out_features = 2
minibatch = np.random.randn(minibatch_size, in_features)
print('input data')
print(minibatch)

layer = Linear(in_features, out_features)
print('output of the linear layer')
print(layer(minibatch))

in_gradient = np.ones((minibatch_size, out_features,))
gradient_weights, gradient_input = layer.grad(in_gradient)
print('gradient w.r.t weights')
print(gradient_weights)
print('gradient w.r.t. inputs')
print(gradient_input)

input data
[[ 0.66698806  0.02581308 -0.77761941  0.94863382  0.70167179]
 [-1.05108156 -0.36754812 -1.13745969 -1.32214752  1.77225828]
 [-0.34745899  0.67014016  0.32227152  0.06034293 -1.04345   ]
 [-1.00994188  0.44173637  1.12887685 -1.83806777 -0.93876863]]
output of the linear layer
[[ 0.0549832   0.11302227]
 [ 0.08216551 -0.18497617]
 [ 0.08671981  0.08423143]
 [ 0.1108269  -0.12477559]]
gradient w.r.t weights
[[-1.74149438 -1.74149438]
 [ 0.7701415   0.7701415 ]
 [-0.46393073 -0.46393073]
 [-2.15123853 -2.15123853]
 [ 0.49171144  0.49171144]]
gradient w.r.t. inputs
[[ 0.04217654  0.06751403 -0.03557016  0.05654907 -0.05248106]
 [ 0.04217654  0.06751403 -0.03557016  0.05654907 -0.05248106]
 [ 0.04217654  0.06751403 -0.03557016  0.05654907 -0.05248106]
 [ 0.04217654  0.06751403 -0.03557016  0.05654907 -0.05248106]]


Note that, to get the same output, the *weights* and *biases* of the `Linear` layer instantiated above and the `Linear` layer from PyTorch must be the same. 

In [8]:
print('input data')
minibatch = torch.tensor(minibatch, dtype=torch.float32, requires_grad=True)
print(minibatch)

print('output of the linear layer')
m = torch.nn.Linear(in_features, out_features)
m.weight = torch.nn.Parameter(torch.tensor(layer.weights, dtype=torch.float32, requires_grad=True).T)
m.bias = torch.nn.Parameter(torch.tensor(layer.bias, dtype=torch.float32, requires_grad=True))
linear_output = m(minibatch)
print(linear_output)

print('gradient w.r.t weights')
linear_output.backward(torch.tensor(in_gradient))
print(m.weight.grad.T)

print('gradient w.r.t. inputs')
print(minibatch.grad)


input data
tensor([[ 0.6670,  0.0258, -0.7776,  0.9486,  0.7017],
        [-1.0511, -0.3675, -1.1375, -1.3221,  1.7723],
        [-0.3475,  0.6701,  0.3223,  0.0603, -1.0434],
        [-1.0099,  0.4417,  1.1289, -1.8381, -0.9388]], requires_grad=True)
output of the linear layer
tensor([[ 0.0550,  0.1130],
        [ 0.0822, -0.1850],
        [ 0.0867,  0.0842],
        [ 0.1108, -0.1248]], grad_fn=<AddmmBackward0>)
gradient w.r.t weights
tensor([[-1.7415, -1.7415],
        [ 0.7701,  0.7701],
        [-0.4639, -0.4639],
        [-2.1512, -2.1512],
        [ 0.4917,  0.4917]])
gradient w.r.t. inputs
tensor([[ 0.0422,  0.0675, -0.0356,  0.0565, -0.0525],
        [ 0.0422,  0.0675, -0.0356,  0.0565, -0.0525],
        [ 0.0422,  0.0675, -0.0356,  0.0565, -0.0525],
        [ 0.0422,  0.0675, -0.0356,  0.0565, -0.0525]])


## Exercise 6.2.4 Implement gradient calculation for activation functions (1 point)
Check the correctness of the gradients by calculating them on the same data using PyTorch.  

In [9]:
x = np.array([[0.1, -0.3, 0.5, 0.9, 0, -1.0],
              [0.2, -0.4, 1.1, 0.4, 0.3, 0]])
sigmoid = Sigmoid()
print(sigmoid(x))

in_gradient = np.ones((2, 6,))
print(sigmoid.grad(in_gradient))

[[0.52497919 0.42555748 0.62245933 0.7109495  0.5        0.26894142]
 [0.549834   0.40131234 0.75026011 0.59868766 0.57444252 0.5       ]]
[[0.24937604 0.24445831 0.23500371 0.20550031 0.25       0.19661193]
 [0.24751657 0.24026075 0.18736988 0.24026075 0.24445831 0.25      ]]


In [10]:
x_torch = torch.tensor(x, requires_grad=True)
m = torch.nn.Sigmoid()
output = m(x_torch)
print(output)
output.backward(torch.tensor(in_gradient))
print(x_torch.grad)

tensor([[0.5250, 0.4256, 0.6225, 0.7109, 0.5000, 0.2689],
        [0.5498, 0.4013, 0.7503, 0.5987, 0.5744, 0.5000]], dtype=torch.float64,
       grad_fn=<SigmoidBackward0>)
tensor([[0.2494, 0.2445, 0.2350, 0.2055, 0.2500, 0.1966],
        [0.2475, 0.2403, 0.1874, 0.2403, 0.2445, 0.2500]], dtype=torch.float64)


In [11]:
tanh = Tanh()
print(tanh(x))
print(tanh.grad(in_gradient))

[[ 0.09966799 -0.29131261  0.46211716  0.71629787  0.         -0.76159416]
 [ 0.19737532 -0.37994896  0.80049902  0.37994896  0.29131261  0.        ]]
[[0.99006629 0.91513696 0.78644773 0.48691736 1.         0.41997434]
 [0.96104298 0.85563879 0.35920132 0.85563879 0.91513696 1.        ]]


In [12]:
x_torch = torch.tensor(x, requires_grad=True)
m = torch.nn.Tanh()
output = m(x_torch)
print(output)
output.backward(torch.tensor(in_gradient))
print(x_torch.grad)

tensor([[ 0.0997, -0.2913,  0.4621,  0.7163,  0.0000, -0.7616],
        [ 0.1974, -0.3799,  0.8005,  0.3799,  0.2913,  0.0000]],
       dtype=torch.float64, grad_fn=<TanhBackward0>)
tensor([[0.9901, 0.9151, 0.7864, 0.4869, 1.0000, 0.4200],
        [0.9610, 0.8556, 0.3592, 0.8556, 0.9151, 1.0000]], dtype=torch.float64)


In [13]:
relu = ReLU()
print(relu(x))
print(relu.grad(in_gradient))

[[ 0.1 -0.   0.5  0.9  0.  -0. ]
 [ 0.2 -0.   1.1  0.4  0.3  0. ]]
[[1. 0. 1. 1. 0. 0.]
 [1. 0. 1. 1. 1. 0.]]


In [14]:
x_torch = torch.tensor(x, requires_grad=True)
m = torch.nn.ReLU()
output = m(x_torch)
print(output)
output.backward(torch.tensor(in_gradient))
print(x_torch.grad)
# TODO - Implement: your code goes here

tensor([[0.1000, 0.0000, 0.5000, 0.9000, 0.0000, 0.0000],
        [0.2000, 0.0000, 1.1000, 0.4000, 0.3000, 0.0000]], dtype=torch.float64,
       grad_fn=<ReluBackward0>)
tensor([[1., 0., 1., 1., 0., 0.],
        [1., 0., 1., 1., 1., 0.]], dtype=torch.float64)


In [15]:
leaky_relu = LeakyReLU()
print(leaky_relu(x))
print(leaky_relu.grad(in_gradient))

[[ 0.1   -0.003  0.5    0.9    0.    -0.01 ]
 [ 0.2   -0.004  1.1    0.4    0.3    0.   ]]
[[1.   0.01 1.   1.   0.01 0.01]
 [1.   0.01 1.   1.   1.   0.01]]


In [16]:
x_torch = torch.tensor(x, requires_grad=True)
m = torch.nn.LeakyReLU()
output = m(x_torch)
print(output)
output.backward(torch.tensor(in_gradient))
print(x_torch.grad)

tensor([[ 0.1000, -0.0030,  0.5000,  0.9000,  0.0000, -0.0100],
        [ 0.2000, -0.0040,  1.1000,  0.4000,  0.3000,  0.0000]],
       dtype=torch.float64, grad_fn=<LeakyReluBackward0>)
tensor([[1.0000, 0.0100, 1.0000, 1.0000, 0.0100, 0.0100],
        [1.0000, 0.0100, 1.0000, 1.0000, 1.0000, 0.0100]], dtype=torch.float64)


## Exercise 6.2.5 Implement a model class (1.5 points)
Implement a model class which stores a list of components of the model (in this exercise those are only the *layers* and *activation functions*). 
It must perform the forward pass and also be able to calculate and store the gradients for all the layers, and perform a parameter update step (here we deviate from PyTorch since we don't use *autograd*).  
For simplicity, you don't have to compare the value of each parameter of the model with PyTorch implementation, but just check the value of the resultant loss (before and after the parameter update step). We provide all the code, including the code for PyTorch below. You don't have to change the cells below, but just check whether your implementation of the model achieves the same decrease in loss as the equivalent implementation in PyTorch.

In [17]:
from model import Model
np.random.seed(123)

layer1 = Linear(1000, 100)
activation1 = ReLU()
layer2 = Linear(100, 10)
activation2 = ReLU()
loss = CrossEntropy()

x = np.random.randn(2, 1000)
y_true = np.zeros((2, 10,))
y_true[0, 4] = 1
y_true[1, 1] = 1
m = Model([layer1, activation1, layer2, activation2])
out = m.forward(x)
print(loss(out, y_true))

1.9648272572841394


In [18]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Net(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.layer1 = nn.Linear(1000, 100, bias=True)
        self.layer2 = nn.Linear(100, 10, bias=True)
    
    def forward(self, x):
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        return x
criterion = nn.CrossEntropyLoss()
net = Net()

with torch.no_grad():
    net.layer1.weight.copy_(torch.tensor(layer1.weights).t())
    net.layer1.bias.copy_(torch.tensor(layer1.bias[0,:]))
    net.layer2.weight.copy_(torch.tensor(layer2.weights).t())
    net.layer2.bias.copy_(torch.tensor(layer2.bias[0,:]))

optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0)

x_torch = torch.tensor(x, dtype=torch.float32)
out = net(x_torch)
y_true_torch = torch.tensor(y_true, dtype=torch.float32)
loss_torch = criterion(out, y_true_torch)
print(loss_torch.item())

1.964827299118042


In [19]:
grads = m.backward(loss.grad())
m.update_parameters(grads, 0.001)
out = m.forward(x)
model_loss_ours = loss(out, y_true)
print(model_loss_ours)

1.8635426227579552


In [20]:
loss_torch.backward()
optimizer.step()

out = net(x_torch)
model_loss_pt = criterion(out, y_true_torch).item()
print(model_loss_pt)

1.863140344619751


In [21]:
# sanity check within some acceptable tolerance level

np.allclose(model_loss_ours, model_loss_pt, atol=1e-3)

True