<p style="font-size:20px; text-align:center">Assignment 1</p>
<p style="font-size:18px; text-align:center">CS6910: Fundamentals of Deep Learning</p>
<p>Sujay Bokil: ME17B120<br>
Avyay Rao: ME17B130</p>

In [1]:
# import necessary libraries
from copy import deepcopy
import numpy as np 
from sklearn.datasets import make_classification

# import templates that we have created to make different kinds of layers, losses, optimizers etc.
from sujay.templates import AutoDiffFunction, Layer, Loss, Optimizer

<b>Outline of the framework</b>

<b>Activation functions</b> 

For this assignment we, write 2 activation functions 

1. Sigmoid activation

$$y = \sigma(x) = \frac{1}{1 + e^x}$$

$$\frac{dy}{dx} = \frac{-e^x}{(1 + e^x)^2} = \sigma(x)(1 - \sigma(x))$$

2. ReLU activation

$$y = ReLU(x) = max(0, x)$$

$$ \frac{dy}{dx} = \left\{
\begin{array}{ll}
      1 & x\geq 0\\
      0 & x\leq 0 \\
\end{array} 
\right.$$

In [3]:
class Sigmoid(AutoDiffFunction):
    """ Represents the Sigmoid Activation function
    """
    def __init__(self) -> None:
        super().__init__()

    def forward(self, x):
        self.saved_for_backward = 1/(1 + np.exp(-x))
        return self.saved_for_backward

    def compute_grad(self, x):
        y = self.saved_for_backward

        return {"x": y*(1-y)}

    def backward(self, dy):
        return dy * self.grad["x"]      


class ReLU(AutoDiffFunction):
    """ Represents the RelU Activation function
    """
    def __init__(self) -> None:
        super().__init__()

    def forward(self, x):
        self.saved_for_backward = np.where(x>0.0, 1.0, 0.0)

        return x * self.saved_for_backward

    def compute_grad(self, x):
        return {"x": self.saved_for_backward}

    def backward(self, dy):
        return dy * self.grad["x"]

<b>Layers</b>

For this assignment, we only fully connected OR Dense layers where each input neuron is connected to each output neuron, along with a bias unit. Below is a representation of a fully connected layer.

![Representation of a fully connected layer](FullyConnectedLayer.png)

The equation for such a layer is simply

$$y = FullyConnected(x) = wx + b$$

$$\frac{dy}{dw} = x^T \quad\frac{dy}{dx} = w^T  \quad\frac{dy}{db} = 1$$

In [4]:
class FC(Layer):
    """Class representing a fully connected layer, the weights inside the class decide it's output
    """
    def __init__(self, in_dim, out_dim) -> None:
        super().__init__()
        self.initialize_weights(in_dim, out_dim)

    def initialize_weights(self, in_dim, out_dim):
        
        self.weights["w"] = np.random.randn(in_dim, out_dim)
        self.weights["b"] = np.random.randn(1, out_dim)

    def compute_grad(self, x):
        
        gradients = {}

        # y = x * w + b        
        # we compute gradients wrt w and x 
        # gradient wrt b is not required explicitly since we know that it's value is 1
        gradients["w"] = self.saved_for_backward["x"].T
        gradients["x"] = self.weights["w"].T

        return gradients


    def forward(self, x):
        
        output = x @ self.weights["w"] + self.weights["b"]
        self.saved_for_backward["x"] = x
        self.saved_for_backward["w"] = w
        
        return output

    def backward(self, dy):
        dx = dy @ self.grad["x"]
        
        # calculating gradients wrt weights
        dw = self.grad["w"] @ dy
        db = np.sum(dy, axis=0, keepdims=True)

        self.absolute_gradients = {"w": dw, "b": db}

        return dx

    def update_weights(self):
        self.optimizer.step(self)

<b>Loss function</b>

The loss function dictates how good the output of the neural network is. Since we use MNIST dataset, our job is classification and hence we use the Categorical CrossEntropy loss function. The equation is given as


$$L(p, y) = \Sigma_{i=1}^{N} \Sigma_{k=1}^{K} y_{ik} \log p_{ik}$$ 

where $$y_{ik} = \left\{
\begin{array}{ll}
      1 & x \in class-k\\
      0 & else \\
\end{array} 
\right.$$

$p_{ik} =$ probability that $i^{th}$ sample falls in $k^{th}$ class

In our implementation, the given loss function is applied along with the activation function for the last layer i.e. Softmax activation. It's formula is given by the following equation

$ f: [x_1, x_2, ... x_k] \rightarrow [p_1, p_2, ... p_k]$ such that $p_i = \frac{e^{x_i}}{\Sigma_{k=1}^{K} e^{x_i}}$

Now, to find the derivative of loss w.r.t input we have apply the chain rule. Let $p(x)$ represent the softmax activation and $L$ represent the loss. Then the expression turns out to be

$$\frac{\partial L}{\partial x} = \frac{\partial L(p, y)}{\partial p} \frac{\partial p(x)}{\partial x} = p - y$$

In [8]:
class CrossEntropyLossFromLogits(Loss):
    """ Class that holds CrossEntropy loss along with softmax activation
    """
    @staticmethod
    def softmax(x):
        v = np.exp(x)

        return v / np.sum(v, axis=1, keepdims=True)

    @staticmethod
    def encode(y): 
        d = len(np.unique(y))
        encoded_y = np.zeros(shape=(len(y), d))

        for i in range(len(y)):
            encoded_y[i,y[i]] = 1

        return encoded_y

    def forward(self, y_pred, y_true):
         
        probabilities = self.softmax(y_pred)
        y_true_encoded = self.encode(y_true)

        loss_value = np.mean(np.sum(- y_true_encoded * np.log(probabilities), axis=1))

        self.saved_for_backward["probabilities"] = probabilities
        self.saved_for_backward["y_true"] = y_true_encoded

        return loss_value

    def compute_grad(self, y_pred, y_true):

        return {"x": self.saved_for_backward["probabilities"] - self.saved_for_backward["y_true"]} 
    

class MSE(Loss):
    @staticmethod
    def softmax(x):
        v = np.exp(x)

        return v / np.sum(v, axis=1, keepdims=True)
    
    @staticmethod
    def encode(y): 
        d = len(np.unique(y))
        encoded_y = np.zeros(shape=(len(y), d))

        for i in range(len(y)):
            encoded_y[i,y[i]] = 1

        return encoded_y
    
    def forward(y_pred, y_true):
        
        probabilities = self.softmax(y_pred)
        y_true_encoded = self.encode(y_true)
        
        loss_value = np.mean(np.sum((probabilities - y_true_encoded)**2, axis=1))
        
        self.saved_for_backward["probabilities"] = probabilities
        self.saved_for_backward["y_true"] = y_true_encoded
        
        return loss_value
    
    def compute_grad(y_pred, y_true):
        return {"x": 2 * (self.saved_for_backward["probabilities"] - self.saved_for_backward["y_true"])}

<b>Optimizers</b>

For this assignment, we have created the 6 optimizers given in the question

1. sgd
2. momentum based gradient descent
3. nesterov accelerated gradient descent
4. rmsprop
5. adam
6. nadam

In [9]:
class SGD(Optimizer):
    def __init__(self, lr):
        self.lr = lr
    
    def step(self, layer):
        for weight_name, _ in layer.weights.items():
            layer.weights[weight_name] = layer.weights[weight_name] - self.lr * layer.absolute_gradients[weight_name]

<b>Structuring the neural network</b>

We create a class to hold all the above components together, so that we can initialize a custom neural network with required layer sizes, activations and weights.

In [10]:
class NeuralNet():
    def __init__(self, layers) -> None:
        self.layers = layers

    def __call__(self, *args, **kwds):
        return self.forward(*args, **kwds)

    def compile(self, loss, optimizer):
        self.loss = loss

        for layer in self.layers:
            if isinstance(layer, Layer):
                layer.optimizer = deepcopy(optimizer)

    def calculate_loss(self, y_pred, y_true):
        return self.loss(y_pred, y_true)

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)

        return x

    def backward(self):

        gradient = self.loss.backward()
        for layer in reversed(self.layers):
            gradient = layer.backward(gradient)

        return gradient

    def update_weights(self):

        for layer in reversed(self.layers):
            if isinstance(layer, Layer):
                layer.update_weights()

<b>Utility functions</b>

We add some utility functions to preprocess and batch the dataset given, and fit the model to the dataset

In [11]:
def create_batches(X, y, batch_size=32):
    """Creates batches of the dataset
    """
    batches = []

    for i in range(len(y) // batch_size):
        start_idx = batch_size * i
        end_idx = batch_size * (i + 1)

        batches.append([X[start_idx: end_idx], y[start_idx: end_idx]])

    return batches


def fit_model(model, batches, optimizer, epochs=10):
    """ Trains the model on the given data
    """
    training_stats = []
    num_batches = len(batches)
 
    loss = CrossEntropyLossFromLogits()
    model.compile(loss=loss, optimizer=optimizer)

    for epoch in range(1, epochs+1):

        total_loss = 0
        total_accuracy = 0

        for X, y in batches:

            preds = model(X)
            total_loss += model.loss(preds, y)
            total_accuracy += accuracy_score(preds, y)

            _ = model.backward()
            model.update_weights()

        loss_per_epoch = total_loss / num_batches
        accuracy = total_accuracy / num_batches

        print(f"Epoch: {epoch} Train Loss: {loss_per_epoch} Train Accuracy: {accuracy}")

        training_stats.append({"Epoch" : epoch, 
                                "Train Loss": loss_per_epoch,
                                "Train Accuracy": accuracy})

    
    return training_stats

<b>Downloading the Fashion MNIST dataset using kers</b>