You will write your own implementation of the backpropagation algorithm for training your own neural network, as
well as a few other features such as activation and loss functions.

## **Note**:

### - Complete following implementation of neural networks from scratch using numpy, you may refer first 1 hour of [this](https://www.youtube.com/live/KdjBONblhHw?si=Gb-wujeVmgYwmaYP). 
### - You have to submit this notebook as well, after implemetations. Where ever required write your understandings and code logic in comments !

## Task 1: Activations

Implement the `forward` and `derivative` class methods for each activation function.
* The identity function has been implemented for you as an example.
* The output of the activation should be stored in the `self.state` variable of the class. The `self.state`
variable should be used for calculating the derivative during the backward pass.

In [None]:
import numpy as np
import math

In [None]:
class Activation(object):

    """
    Interface for activation functions (non-linearities).

    In all implementations, the state attribute must contain the result,
    i.e. the output of forward.
    """

    # No additional work is needed for this class, as it acts like an
    # abstract base class for the others

    # Note that these activation functions are scalar operations. I.e, they
    # shouldn't change the shape of the input.

    def __init__(self):
        self.state = None

    def __call__(self, x):
        return self.forward(x)

    def forward(self, x):
        raise NotImplemented

    def derivative(self):
        raise NotImplemented

In [None]:
class Identity(Activation):

    """
    Identity function (already implemented).
    """

    # This class is a gimme as it is already implemented for you as an example

    def __init__(self):
        super(Identity, self).__init__()

    def forward(self, x):
        self.state = x
        return x

    def derivative(self):
        return 1.0

In [None]:
class Sigmoid(Activation):

    """
    Sigmoid non-linearity
    """

    # Remember do not change the function signatures as those are needed
    # to stay the same for AutoLab.

    def __init__(self):
        super(Sigmoid, self).__init__()

    def forward(self, x):
        # Might we need to store something before returning?
        raise NotImplemented

    def derivative(self):
        # Maybe something we need later in here...
        raise NotImplemented

In [None]:
class Tanh(Activation):

    """
    Tanh non-linearity
    """

    def __init__(self):
        super(Tanh, self).__init__()

    def forward(self, x):
        raise NotImplemented

    def derivative(self):
        raise NotImplemented

In [None]:
class ReLU(Activation):

    """
    ReLU non-linearity
    """

    def __init__(self):
        super(ReLU, self).__init__()

    def forward(self, x):
        raise NotImplemented

    def derivative(self):
        raise NotImplemented

## Task 2: Loss
Implement the forward and derivative methods for `SoftmaxCrossEntropy`.
* This class inherits the base `Criterion` class.
* We will be using the softmax cross entropy loss detailed in the appendix of this writeup; use the
LogSumExp trick to ensure numerical stability.

The LogSumExp trick is used to prevent numerical underflow and overflow which can occur when the exponent is very large or very small. For example, try looking at the results of trying to exponentiate in python shown below:

```python
import math
print(math.e**1000)  # throws an error
print(math.e**(-1000)
```



As you will see, for exponents that are too large, python throws an overflow error, and for exponents that are too small, it rounds down to zero.
We can avoid these errors by using the LogSumExp trick:

![alt text](https://imgur.com/download/L0P17iv)

You can read more about the derivation of the equivalence [here](https://www.xarg.org/2016/06/the-log-sum-exp-trick-in-machine-learning/) and [here](https://blog.feedly.com/tricks-of-the-trade-logsumexp/)

In [None]:
# The following Criterion class will be used again as the basis for a number
# of loss functions (which are in the form of classes so that they can be
# exchanged easily (it's how PyTorch and other ML libraries do it))

class Criterion(object):
    """
    Interface for loss functions.
    """

    # Nothing needs done to this class, it's used by the following Criterion classes

    def __init__(self):
        self.logits = None
        self.labels = None
        self.loss = None

    def __call__(self, x, y):
        return self.forward(x, y)

    def forward(self, x, y):
        raise NotImplemented

    def derivative(self):
        raise NotImplemented

* Implement the softmax cross entropy operation on a batch of output vectors.
  *  Hint: Add a class attribute to keep track of intermediate values necessary for the backward computation
* Calculate the ‘derivative’ of softmax cross entropy using intermediate values saved in the forward pass.

In [None]:
class SoftmaxCrossEntropy(Criterion):
    """
    Softmax loss
    """

    def __init__(self):
        super(SoftmaxCrossEntropy, self).__init__()

    def forward(self, x, y):
        """
        Argument:
            x (np.array): (batch size, 10)
            y (np.array): (batch size, 10)
        Return:
            out (np.array): (batch size, )
        """
        self.logits = x
        self.labels = y

        raise NotImplemented

    def derivative(self):
        """
        Return:
            out (np.array): (batch size, 10)
        """

        raise NotImplemented

## Task 3: Linear Layer
Implement the forward and backward methods for the `Linear` class.
* Hint: Add a class attribute to keep track of intermediate values necessary for the backward computation.

Write the code for the backward method of Linear. 
* The input delta is the derivative of the loss with respect to the output of the linear layer. It has the same shape as the linear layer output. 
* Calculate `self.dW` and `self.db` for the backward method. `self.dW` and `self.db` represent the gradients of the loss (averaged across the batch) w.r.t `self.W` and `self.b`. Their shapes are the same as the weight `self.W` and the bias `self.b`.
* Calculate the return value for the backward method. `dx` is the derivative of the loss with respect to the input of the linear layer and has the same shape as the input.

In [None]:
class Linear():
    def __init__(self, in_feature, out_feature, weight_init_fn, bias_init_fn):

        """
        Argument:
            W (np.array): (in feature, out feature)
            dW (np.array): (in feature, out feature)
            momentum_W (np.array): (in feature, out feature)

            b (np.array): (1, out feature)
            db (np.array): (1, out feature)
            momentum_B (np.array): (1, out feature)
        """

        self.W = weight_init_fn(in_feature, out_feature)
        self.b = bias_init_fn(out_feature)

        # TODO: Complete these but do not change the names.
        self.dW = np.zeros(None)
        self.db = np.zeros(None)

        self.momentum_W = np.zeros(None)
        self.momentum_b = np.zeros(None)

    def __call__(self, x):
        return self.forward(x)

    def forward(self, x):
        """
        Argument:
            x (np.array): (batch size, in feature)
        Return:
            out (np.array): (batch size, out feature)
        """
        raise NotImplemented

    def backward(self, delta):

        """
        Argument:
            delta (np.array): (batch size, out feature)
        Return:
            out (np.array): (batch size, in feature)
        """
        raise NotImplemented

## Task 4: Simple MLP
In this section of the homework, you will be implementing a Multi-Layer  Perceptron with an API similar to popular Automatic Differentiation Libraries like PyTorch.
Go through the functions of the given `MLP` class thoroughly and make sure you understand what each function in the class does so that you can create a generic implementation that supports an arbitrary number of layers, types of activations and network sizes.

The parameters for the MLP class are:
* `input size`: The size of each individual data example.
* `output size`: The number of outputs.
* `hiddens`: A list with the number of units in each hidden layer.
* `activations`: A list of Activation objects for each layer.
* `weight init fn`: A function applied to each weight matrix before training.
* `bias init fn`: A function applied to each bias vector before training.
* `criterion`: A Criterion object to compute the loss and its derivative.
* `lr`: The learning rate.

The attributes of the MLP class are:
* `@linear layers`: A list of Linear objects.
* `@bn layers`: A list of BatchNorm objects. (Should be None until completing 3.3).
The methods of the MLP class are:
* `forward`: Forward pass. Accepts a mini-batch of data and return a batch of output activations.
* `backward`: Backward pass. Accepts ground truth labels and computes gradients for all parameters.
Hint: Use state stored in activations during forward pass to simplify your code.
* `zero grads`: Set all gradient terms to 0.
* `step`: Apply gradients computed in backward to the parameters.
* `train` (Already implemented): Set the mode of the network to train.
* `eval` (Already implemented): Set the mode of the network to evaluation.

Note: Pay attention to the data structures being passed into the constructor and the class attributes specified initially.

Sample constructor call:
```python
MLP(784, 10, [64, 64, 32], [Sigmoid(), Sigmoid(), Sigmoid(), Identity()],
weight_init_fn, bias_init_fn, SoftmaxCrossEntropy(), 0.008)
```

In [None]:
class MLP(object):

    """
    A simple multilayer perceptron
    """

    def __init__(self, input_size, output_size, hiddens, activations, weight_init_fn,
                 bias_init_fn, criterion, lr):

        # Don't change this -->
        self.train_mode = True
        self.nlayers = len(hiddens) + 1
        self.input_size = input_size
        self.output_size = output_size
        self.activations = activations
        self.criterion = criterion
        self.lr = lr
        # <---------------------

        # Don't change the name of the following class attributes,
        # the autograder will check against these attributes. But you will need to change
        # the values in order to initialize them correctly

        # Initialize and add all your linear layers into the list 'self.linear_layers'
        # (HINT: self.foo = [ bar(???) for ?? in ? ])
        # (HINT: Can you use zip here?)
        self.linear_layers = None


    def forward(self, x):
        """
        Argument:
            x (np.array): (batch size, input_size)
        Return:
            out (np.array): (batch size, output_size)
        """
        # Complete the forward pass through your entire MLP.
        raise NotImplemented

    def zero_grads(self):
        # Use numpyArray.fill(0.0) to zero out your backpropped derivatives in each
        # of your linear and batchnorm layers.
        raise NotImplemented

    def step(self):
        # Apply a step to the weights and biases of the linear layers.
        # (You will add momentum later in the assignment to the linear layers)

        for i in range(len(self.linear_layers)):
            # Update weights and biases here
            pass
        # Do the same for batchnorm layers

        raise NotImplemented

    def backward(self, labels):
        # Backpropagate through the activation functions, batch norm and
        # linear layers.
        # Be aware of which return derivatives and which are pure backward passes
        # i.e. take in a loss w.r.t it's output.
        raise NotImplemented

    def error(self, labels):
        return (np.argmax(self.output, axis = 1) != np.argmax(labels, axis = 1)).sum()

    def total_loss(self, labels):
        return self.criterion(self.output, labels).sum()

    def __call__(self, x):
        return self.forward(x)

    def train(self):
        self.train_mode = True

    def eval(self):
        self.train_mode = False



## Task 5: Momentum
Modify the `step` function present in the MLP class to include momentum in your gradient descent. We will be using the following momentum update equation:

![alt text](https://imgur.com/download/ZVA66FC)

The momentum value will be passed as a parameter to the `MLP`.
Copy the rest of your code from above.

In [None]:
class MLP(object):

    """
    A simple multilayer perceptron
    """

    def __init__(self, input_size, output_size, hiddens, activations, weight_init_fn,
                 bias_init_fn, criterion, lr, momentum=0.0):

        # Don't change this -->
        self.train_mode = True
        self.nlayers = len(hiddens) + 1
        self.input_size = input_size
        self.output_size = output_size
        self.activations = activations
        self.criterion = criterion
        self.lr = lr
        self.momentum = momentum
        # <---------------------

        # Initialize and add all your linear layers into the list 'self.linear_layers'
        # (HINT: self.foo = [ bar(???) for ?? in ? ])
        # (HINT: Can you use zip here?)
        self.linear_layers = None

    def forward(self, x):
        """
        Argument:
            x (np.array): (batch size, input_size)
        Return:
            out (np.array): (batch size, output_size)
        """
        # Complete the forward pass through your entire MLP.
        raise NotImplemented

    def zero_grads(self):
        # Use numpyArray.fill(0.0) to zero out your backpropped derivatives in each
        # of your linear and batchnorm layers.
        raise NotImplemented

    def step(self):
        # Apply a step to the weights and biases of the linear layers.
        # (You will add momentum later in the assignment to the linear layers)

        for i in range(len(self.linear_layers)):
            # Update weights and biases here
            pass
        # Do the same for batchnorm layers

        raise NotImplemented

    def backward(self, labels):
        # Backpropagate through the activation functions, batch norm and
        # linear layers.
        # Be aware of which return derivatives and which are pure backward passes
        # i.e. take in a loss w.r.t it's output.
        raise NotImplemented

    def error(self, labels):
        return (np.argmax(self.output, axis = 1) != np.argmax(labels, axis = 1)).sum()

    def total_loss(self, labels):
        return self.criterion(self.output, labels).sum()

    def __call__(self, x):
        return self.forward(x)

    def train(self):
        self.train_mode = True

    def eval(self):
        self.train_mode = False


### Now upload this in your github repo