# Assignment #3
## P556: Applied Machine Learning

More often than not, we will use a deep learning library (Tensorflow, Pytorch, or the wrapper known as Keras) to implement our models. However, the abstraction afforded by those libraries can make it hard to troubleshoot issues if we don't understand what is going on under the hood. In this assignment you will implement a fully-connected and a convolutional neural network from scratch. To simplify the implementation, we are asking you to implement static architectures, but you are free to support variable number of layers/neurons/activations/optimizers/etc. We recommend that you make use of private methods so you can easily troubleshoot small parts of your model as you develop them, instead of trying to figure out which parts are not working correctly after implementing everything. Also, keep in mind that there is code from your fully-connected neural network that can be re-used on the CNN. 

Problem #1.1 (40 points): Implement a fully-connected neural network from scratch. The neural network will have the following architecture:

- Input layer
- Dense hidden layer with 512 neurons, using relu as the activation function
- Dropout with a value of 0.2
- Dense hidden layer with 512 neurons, using relu as the activation function
- Dropout with a value of 0.2
- Output layer, using softmax as the activation function

The model will use categorical crossentropy as its loss function. 
We will optimize the gradient descent using RMSProp, with a learning rate of 0.001 and a rho value of 0.9.
We will evaluate the model using accuracy.

Why this architecture? We are trying to reproduce from scratch the following [example from the Keras documentation](https://keras.io/examples/mnist_mlp/). This means that you can compare your results by running the Keras code provided above to see if you are on the right track.

In [0]:
class NeuralNetwork():
    def __init__(self, epochs, learning_rate):
        self.epochs = epochs
        self.learning_rate = learning_rate
        
    def fit(self,X,y):        
        def relu(X):
            return np.maximum(0,X)
        def relu_derivative(X):
            if(X.all()>0):
                return 1
            else:
                return 0
#The dropout concept and python insight has been referred from https://gluon.mxnet.io/chapter03_deep-neural-networks/mlp-dropout-scratch.html"
        def dropout(x,drop):
            p = 1 - drop
            mask = np.random.uniform(0, 1.0, x.shape) < p
            if p > 0.0:
                scale = (1/p)
            else:
                scale = 0.0
            return mask * x * scale
#The softmax concept and entropy loss details has been referred from https://deepnotes.io/softmax-crossentropy"
        def softmax(X):
            exponent = np.exp(X)
            return exponent / np.sum(exponent, axis=1, keepdims=True)
        def crossentropy_loss(y,y_pred):
            m = y.shape[0]
            loss = -1/m * np.sum(y * np.log(y_pred))
            return loss
        def loss_derivative(y1,y2):
            return (y2-y1)
#While trying different weight initializations, finally came up with Xavier initialization whose concept referred from https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79       
        def initialize_weight():
            np.random.seed(0)
            w1 = np.random.randn(784, 512)*np.sqrt(2/784)
            b1 = np.zeros((1,512))
            w2 = np.random.randn(512, 512)*np.sqrt(2/512)
            b2 = np.zeros((1, 512))
            w3 = np.random.randn(512,10)*np.sqrt(2/512)
            b3 = np.zeros((1, 10))
            return w1,w2,w3,b1,b2,b3
        
        w1,w2,w3,b1,b2,b3 = initialize_weight()
        grad_w1=grad_w2=grad_w3=grad_b1=grad_b2=grad_b3 = 0

#The mini batch approach has been referred from  https://ml-cheatsheet.readthedocs.io/en/latest/optimizers.html#sgd        
        for i in range(self.epochs):
            x_batch = X
            y_batch = y
            step_size = 10000
            for j in range (6):
                x_train=x_batch[j:j+step_size]
                y_train=y_batch[j:j+step_size]
           
#The vectorized approach below has been referred from #https://www.kaggle.com/daphnecor/week-1-3-layer-nn
#https://gluon.mxnet.io/chapter03_deep-neural-networks/mlp-dropout-scratch.html#Define-the-model
                h1 = x_train.dot(w1) + b1
                z1 = relu(h1)
                z1 = dropout(z1,0.2)
                h2 = z1.dot(w2) + b2
                z2 = relu(h2)
                z2 = dropout(z2,0.2)
                h3 = z2.dot(w3) + b3
                z3 = softmax(h3)

#Scaling of the gradients is done by dividing it by the number of datapoints
                n = y_train.shape[0]
                g3 = loss_derivative(y_train,z3)
                dw3 = 1/n*(z2.T).dot(g3) 
                db3 = 1/n*np.sum(g3, axis=0)

                g2 = np.multiply(g3.dot(w3.T),relu_derivative(z2))
                dw2 = 1/n*np.dot(z1.T, g2)
                db2 = 1/n*np.sum(g2, axis=0)

                g1 = np.multiply(g2.dot(w2.T),relu_derivative(z1))
                dw1 = 1/n*np.dot(x_train.T,g1)
                db1 = 1/n*np.sum(g1,axis=0) 

#The rmsprop concept and approach has been referred from https://towardsdatascience.com/understanding-rmsprop-faster-neural-network-learning-62e116fcf29a               
                rho = 0.9
                grad_w1 = rho * grad_w1 + 0.1 * dw1 * dw1
                grad_w2 = rho * grad_w2 + 0.1 * dw2 * dw2
                grad_w3 = rho * grad_w3 + 0.1 * dw3 * dw3
                grad_b1 = rho * grad_b1 + 0.1 * db1 * db1
                grad_b2 = rho * grad_b2 + 0.1 * db2 * db2
                grad_b3 = rho * grad_b3 + 0.1 * db3 * db3
                
                #clip has been used to avoid the divide by zero error
                w1 = w1 - (self.learning_rate / np.sqrt(grad_w1.clip(min = 0.00000001))) * dw1
                b1 = b1 - (self.learning_rate / np.sqrt(grad_b1.clip(min = 0.00000001))) * db1
                w2 = w2 - (self.learning_rate / np.sqrt(grad_w2.clip(min = 0.00000001))) * dw2
                b2 = b2 - (self.learning_rate / np.sqrt(grad_b2.clip(min = 0.00000001))) * db2
                w3 = w3 - (self.learning_rate / np.sqrt(grad_w3.clip(min = 0.00000001))) * dw3
                b3 = b3 - (self.learning_rate / np.sqrt(grad_b3.clip(min = 0.00000001))) * db3

                y_dash = np.argmax(y_train,axis=1)
                y_pred = np.argmax(z3,axis=1)
                scores = []
                acc = []
                from sklearn.metrics import accuracy_score
                scores.append(accuracy_score(y_dash, y_pred, normalize=True)*100)
            acc.append(np.mean(scores))
        print("The training accuracy is: ",max(acc))
        return w1,w2,w3,b1,b2,b3
        
    def evaluate(self,x_test,y_test,W1,W2,W3,B1,B2,B3):        
        def relu(X):
            return np.maximum(0,X)
        def softmax(X):
            exponent = np.exp(X)
            return exponent / np.sum(exponent, axis=1, keepdims=True)
        h1 = x_test.dot(W1) + B1  
        z1 = relu(h1)
        h2 = z1.dot(W2) + B2
        z2 = relu(h2)
        h3 = z2.dot(W3) + B3
        z3 = softmax(h3)
        y_dash = np.argmax(y_test,axis=1)
        y_pred = np.argmax(z3,axis=1)
        from sklearn.metrics import accuracy_score
        return accuracy_score(y_dash, y_pred, normalize=True)*100

Problem #1.2 (10 points): Train your fully-connected neural network on the Fashion-MNIST dataset using 5-fold cross validation. Report accuracy on the folds, as well as on the test set.

In [3]:
# To simplify the usage of our dataset, we will be importing it from the Keras 
# library. Keras can be installed using pip: python -m pip install keras

# Original source for the dataset:
# https://github.com/zalandoresearch/fashion-mnist

# Reference to the Fashion-MNIST's Keras function: 
# https://keras.io/datasets/#fashion-mnist-database-of-fashion-articles

import keras
import numpy as np
from keras.datasets import fashion_mnist

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
num_classes = 10
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
NN = NeuralNetwork(20,0.001)
#NN.evaluate(x_test,y_test,w1,w2,w3,b1,b2,b3)

d={}
d['w1'],d['w2'],d['w3'],d['b1'],d['b2'],d['b3'],d['acc']=[],[],[],[],[],[],[]

#The crossvalidation concepts and extracting index refererred from https://machinelearningmastery.com/k-fold-cross-validation/
#https://medium.com/@salsabilabasalamah/cross-validation-of-an-artificial-neural-network-f72a879ea6d5
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
rg=KFold(5)
for index_tr,index_te in rg.split(x_train):
  x_train1=x_train[index_tr]
  y_train1=y_train[index_tr]
  w1,w2,w3,b1,b2,b3 = NN.fit(x_train1,y_train1)
#The dictionary stores all the weights, bias and accuracy for each validation iteration
  d['w1'].append(w1)
  d['w2'].append(w2)
  d['w3'].append(w3)
  d['b1'].append(b1)
  d['b2'].append(b2)
  d['b3'].append(b3)
  x_test1=x_train[index_te]
  y_test1=y_train[index_te]
  d['acc'].append(NN.evaluate(x_test1,y_test1,w1,w2,w3,b1,b2,b3))
  print('Validation accuracy:',d['acc'][-1])

#The weights and biases with maximum accuracy are passed to evaluate final accuracy on test data  
test_acc = NN.evaluate(x_test,y_test,d['w1'][d['acc'].index(max(d['acc']))],d['w2'][d['acc'].index(max(d['acc']))],d['w3'][d['acc'].index(max(d['acc']))],d['b1'][d['acc'].index(max(d['acc']))],d['b2'][d['acc'].index(max(d['acc']))],d['b3'][d['acc'].index(max(d['acc']))])
print("The testing accuracy is: ",test_acc)

60000 train samples
10000 test samples
The training accuracy is:  72.39
Validation accuracy: 76.75
The training accuracy is:  72.37
Validation accuracy: 75.35833333333333
The training accuracy is:  72.37
Validation accuracy: 76.23333333333333
The training accuracy is:  72.37
Validation accuracy: 76.325
The training accuracy is:  72.37
Validation accuracy: 76.06666666666668
The testing accuracy is:  75.01


Problem #2.1 (40 points): Implement a Convolutional Neural Network from scratch. Similarly to problem 1.1, we will be implementing the same architecture as the one shown in [Keras' CNN documentation](https://keras.io/examples/mnist_cnn/). That is:

- Input layer
- Convolutional hidden layer with 32 neurons, a kernel size of (3,3), and relu activation function
- Convolutional hidden layer with 64 neurons, a kernel size of (3,3), and relu activation function
- Maxpooling with a pool size of (2,2)
- Dropout with a value of 0.25
- Flatten layer
- Dense hidden layer, with 128 neurons, and relu activation function
- Dropout with a value of 0.5
- Output layer, using softmax as the activation function

Our loss function is categorical crossentropy and the evaluation will be done using accuracy, as in Problem 1.1. However, we will not be using the gradient optimizer known as Adadelta.

In [0]:
class ConvolutionalNeuralNetwork(object):
  def __init__(epochs, learning_rate):
    pass
  
  def fit(self):
    pass
  
  def evaluate(self):
    pass

Problem #2.2 (10 points): Train your convolutional neural network on the Fashion-MNIST dataset using 5-fold cross validation. Report accuracy on the folds, as well as on the test set.

In [0]:
import keras
from keras.datasets import fashion_mnist

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
num_classes = 10
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)