<p style="font-size:20px; text-align:center">Assignment 1</p>
<p style="font-size:18px; text-align:center">CS6910: Fundamentals of Deep Learning</p>
<p>Sujay Bokil: ME17B120<br>
Avyay Rao: ME17B130</p>

In [2]:
# import necessary libraries
from copy import deepcopy
import numpy as np 
from sklearn.datasets import make_classification

# import templates that we have created to make different kinds of layers, losses, optimizers etc.
from sujay.templates import AutoDiffFunction, Layer, Loss, Optimizer

<b>Outline of the framework</b>

<b>Activation functions</b> 

For this assignment we implement the following activation functions:

1. Sigmoid activation

$$y = \sigma(x) = \frac{1}{1 + e^{-x}}$$

$$\frac{dy}{dx} = \frac{-e^{-x}}{(1 + e^{-x})^2} = \sigma(x)(1 - \sigma(x))$$

2. ReLU activation

$$y = ReLU(x) = max(0, x)$$

$$ \frac{dy}{dx} = \left\{
\begin{array}{ll}
      1 & x\geq 0\\
      0 & x\leq 0 \\
\end{array} 
\right.$$

3. Tanh activation
$$y = tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$
$$\frac{dy}{dx} = \frac{4}{(e^x + e^{-x})^2} = 1 - (tanh(x))^2$$

In [12]:
class Sigmoid(AutoDiffFunction):
    """ 
    Represents the Sigmoid Activation function
    """
    def __init__(self) -> None:
        super().__init__()

    def forward(self, x):
        self.saved_for_backward = 1/(1 + np.exp(-x))
        return self.saved_for_backward

    def compute_grad(self, x):
        y = self.saved_for_backward

        return {"x": y*(1-y)}

    def backward(self, dy):
        return dy * self.grad["x"]      


class ReLU(AutoDiffFunction):
    """ 
    Represents the RelU Activation function
    """
    def __init__(self) -> None:
        super().__init__()

    def forward(self, x):
        self.saved_for_backward = np.where(x>0.0, 1.0, 0.0)

        return x * self.saved_for_backward

    def compute_grad(self, x):
        return {"x": self.saved_for_backward}

    def backward(self, dy):
        return dy * self.grad["x"]
    
class Tanh(AutoDiffFunction):
    """ 
    Represents the Tanh Activation function
    """
    def __init__(self) -> None:
        super().__init__()

    def forward(self, x):
        self.saved_for_backward = (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))

        return self.saved_for_backward

    def compute_grad(self, x):
        y = self.saved_for_backward

        return {"x": 1 - y**2}

    def backward(self, dy):
        return dy * self.grad["x"]

<b>Layers</b>

For this assignment, we only use fully connected OR Dense layers where each input neuron is connected to each output neuron, along with a bias unit. Below is a representation of a fully connected layer.

![Representation of a fully connected layer](FullyConnectedLayer.png)

The equation for such a layer is simply

$$y = FullyConnected(x) = wx + b$$

$$\frac{dy}{dw} = x^T \quad\frac{dy}{dx} = w^T  \quad\frac{dy}{db} = 1$$

In [13]:
class FC(Layer):
    """
    Class representing a fully connected layer, the weights inside the class decide it's output
    """
    def __init__(self, in_dim, out_dim) -> None:
        super().__init__()
        self.initialize_weights(in_dim, out_dim)

    def initialize_weights(self, in_dim, out_dim):
        
        self.weights["w"] = np.random.randn(in_dim, out_dim)
        self.weights["b"] = np.random.randn(1, out_dim)

    def compute_grad(self, x):
        
        gradients = {}

        # y = x * w + b        
        # we compute gradients wrt w and x 
        # gradient wrt b is not required explicitly since we know that it's value is 1
        gradients["w"] = self.saved_for_backward["x"].T
        gradients["x"] = self.weights["w"].T

        return gradients


    def forward(self, x):
        
        output = x @ self.weights["w"] + self.weights["b"]
        self.saved_for_backward["x"] = x
        
        return output

    def backward(self, dy):
        dx = dy @ self.grad["x"]
        
        # calculating gradients wrt weights
        dw = self.grad["w"] @ dy
        db = np.sum(dy, axis=0, keepdims=True)

        self.absolute_gradients = {"w": dw, "b": db}

        return dx

    def update_weights(self):
        self.optimizer.step(self)

<b>Loss function</b>

The loss function dictates how good the output of the neural network is. Since we use MNIST dataset, our job is classification and hence we use the Categorical CrossEntropy loss function. The equation is given as


$$L(p, y) = \Sigma_{i=1}^{N} \Sigma_{k=1}^{K} y_{ik} \log p_{ik}$$ 

where $$y_{ik} = \left\{
\begin{array}{ll}
      1 & x \in class-k\\
      0 & else \\
\end{array} 
\right.$$

$p_{ik} =$ probability that $i^{th}$ sample falls in $k^{th}$ class

In our implementation, the given loss function is applied along with the activation function for the last layer i.e. Softmax activation. It's formula is given by the following equation

$ f: [x_1, x_2, ... x_k] \rightarrow [p_1, p_2, ... p_k]$ such that $p_i = \frac{e^{x_i}}{\Sigma_{k=1}^{K} e^{x_i}}$

Now, to find the derivative of loss w.r.t input we have apply the chain rule. Let $p(x)$ represent the softmax activation and $L$ represent the loss. Then the expression turns out to be

$$\frac{\partial L}{\partial x} = \frac{\partial L(p, y)}{\partial p} \frac{\partial p(x)}{\partial x} = p - y$$