# Practical Part: Neural Network Implementation & Experiments

Load the Fashion MNIST data:
Note: keep your file structures like this for reading input data without
using ```import os``` for path change!
```
./Homework 3
├── 3_practical_part.ipynb
├── circles.txt
├── data
│   ├── fashion
│   │   ├── t10k-images-idx3-ubyte.gz
│   │   ├── t10k-labels-idx1-ubyte.gz
│   │   ├── train-images-idx3-ubyte.gz
│   │   └── train-labels-idx1-ubyte.gz
│   └── mnist
│       └── README.md
├── hw3
│   └── d3english.pdf
├── overleaf_url.txt
└── utils
    ├── __init__.py
    ├── __pycache__
    │   ├── __init__.cpython-36.pyc
    │   └── mnist_reader.cpython-36.pyc
    ├── argparser.py
    ├── helper.py
    └── mnist_reader.py
```

In [294]:
import utils.mnist_reader as mnist_reader
import numpy as np
import math
import copy 

X_train, y_train = mnist_reader.load_mnist('data/fashion', kind='train')
X_test, y_test = mnist_reader.load_mnist('data/fashion', kind='t10k')

print(X_test.shape)

(10000, 784)


Load the Circles data:

In [365]:
import numpy as np

circlesData = np.loadtxt(open('circles.txt','r'))
circlesTarget = circlesData[:,2]
circlesData = circlesData[:,[0,1]] 
#circlesData = circlesData
circlesData = np.expand_dims(circlesData, axis = 1)
circlesData = circlesData.reshape(1100, 2, 1)
print(circlesData.shape)
circlesTarget = np.array([int(i) for i in circlesTarget])
print(circlesTarget)

(1100, 2, 1)
[1 1 0 ... 0 1 1]


In [50]:
class loadData:
    def __init__(self):
        self.addOnes = False
        self.data_path = '/data/'
    
    def convertTarget(self, targetValues):
        # Convert to one-hot encoding
        numClasses = np.max(targetValues) + 1
        return np.eye(numClasses)[targetValues]
    

    def loadNumData(self, data, target):
        # Split into train/validation/test
        np.random.seed(6390)
        randIndices = np.random.permutation(data.shape[0])
        data, target = data[randIndices], target[randIndices]
        
        div1 = int(math.floor(0.8 * data.shape[0]))
        div2 = int(math.floor(0.9 * data.shape[0]))
        trainData, trainTarget = data[:div1], target[:div1]
        validData, validTarget = data[div1:div2], target[div1:div2]
        testData, testTarget = data[div2:], target[div2:]
    
        # Get one hot encoding
        trainTarget = self.convertTarget(trainTarget)
        validTarget = self.convertTarget(validTarget)
        testTarget = self.convertTarget(testTarget)
        
        return trainData, trainTarget, validData, validTarget, testData, testTarget

dataLoader = loadData()
 
trainData, trainTarget, validData, validTarget, testData, testTarget = dataLoader.loadNumData(circlesData, circlesTarget)
print(trainData.shape, validData.shape, testData.shape)




(880, 2) (110, 2) (110, 2)


## Experiments

### Part 1

> As a beginning, start with an implementation that computes the gradients for a single example, and check that the gradient is correct using the finite difference method described above.

In [301]:
class BatchSampler(object):
    '''
    A (very) simple wrapper to randomly sample batches without replacement.
    '''
    
    def __init__(self, data, targets, batch_size):
        self.num_points = data.shape[0]
        self.features = data.shape[1]
        self.data = data
        self.targets = targets
        self.batch_size = batch_size
        self.indices = np.arange(self.num_points)

    def random_batch_indices(self, m=None):
        if m is None:
            indices = np.random.choice(self.indices, self.batch_size, replace=False)
        else:
            indices = np.random.choice(self.indices, m, replace=False)
        return indices 

    def get_batch(self, m=None):
        '''
        Get a random batch without replacement from the dataset.
        If m is given the batch will be of size m. 
        Otherwise will default to the class initialized value.
        '''
        indices = self.random_batch_indices(m)
        X_batch = np.take(self.data, indices, 0)
        y_batch = self.targets[indices]
        return X_batch, y_batch

In [302]:
# Our own activation functions

def relu(pre_activation):
    '''
    preactivation is a vector
    '''
    relu_output = np.zeros(pre_activation.shape[0])
    for i, neuron in enumerate(pre_activation):
        if neuron[0] > 0:
            relu_output[i] = neuron[0]
    return relu_output

def softmax(pre_activation):
    '''
    Numerically stable because subtracting the max value makes bit overflow impossible,
    we will only have non-positive values in the vector
    '''
    exps = np.exp(pre_activation - np.max(pre_activation))
    return exps / np.sum(exps)

In [388]:
class neuralNet():
    def __init__(self, d, dh, m, eta=1, regularize=None):
        self.inputDim = d #inputDim
        self.hiddenDim = dh #hiddenDim
        self.outputDim = m #outputDim
        self.regularize = regularize # lambda value
        self.learningRate = eta
        #may use xavier init - maybe explore this later.
        
        # Initial weights and biases
        self.W_1 = np.random.uniform(-1/np.sqrt(d), 1/np.sqrt(d), d*dh).reshape(dh, d)
        self.W_2 = np.random.uniform(-1/np.sqrt(dh), 1/np.sqrt(dh), dh*m).reshape(m, dh)   
        self.b_1 = np.zeros(dh).reshape(dh,1)
        self.b_2 = np.zeros(m).reshape(m, 1)
        
    def fprop(self, x):
        #print("in fprop, ", self.W_1.shape, x.shape, self.b_1.shape)
        print("fprop ", x.shape)
        self.h_a = np.dot(self.W_1, x) + self.b_1
        self.h_s = np.expand_dims(relu(self.h_a), axis = 1)
        #print("self.h_a.shape", self.h_a.shape)
        #print("self.h_s.shape", self.h_s.shape)

        self.o_a = np.dot(self.W_2, self.h_s) + self.b_2
        self.o_s = softmax(self.o_a)
        #print("self.o_a.shape", self.o_a.shape)
    def errorRate(self):
        '''
        negative log
        -logO_s(x)
        '''
        negLog = []
        #print('oa', self.o_a)
        for i in range(self.inputDim):
            #print("self.o_a[i]", self.o_a[i])
            #print('np.sum(self.o_a, axis=1))', np.sum(self.o_a, axis=1))
            negLog.append(-self.o_a[i] + np.log(np.sum(np.exp(self.o_a), axis=1)))
        
        error = np.array(negLog)
        #print('returning this error:', error)
        return error
            
    def bprop(self, batchData, batchTarget):
        '''
        batchTarget already in one-hot format
        '''
        
        self.grad_oa = self.o_s - batchTarget
        
        # the np.mean is taking avg of loss
        # pack prop taking the avg to multiply over all the grads
        self.grad_oa = np.dot(self.grad_oa, np.mean(self.errorRate()))
        self.grad_W2 = np.dot(self.grad_oa, self.h_s.T)
        self.grad_b2 = self.grad_oa
        self.grad_hs = np.dot(self.grad_oa.T, self.W_2)
        
        # Check this (dim mismatch maybe)
        print("self.h_a ", self.h_a.shape)
        print("(np.where(self.h_a > 0, 1, 0)))", (np.where(self.h_a > 0, 1, 0)).shape)
        print("self.grad_hs.T ", self.grad_hs.T.shape)
        self.grad_ha = np.multiply(self.grad_hs.T, (np.where(self.h_a > 0, 1, 0)))  
        
        #np.dot(np.expand_dims(self.grad_oa, axis=1), np.expand_dims(self.h_s, axis=1).T)
        self.grad_W1 = np.dot(self.grad_ha, batchData.T)
        self.grad_b1 = self.grad_ha
        
        self.grad_reg1 = 0# regularization for layer 1
        self.grad_reg2 = 0# regularization for layer 2
    
    def updateParams(self):
        self.W_1 -= self.grad_W1 * self.learningRate
        self.W_2 -= self.grad_W2 * self.learningRate
        print("grad_b1 ", self.grad_b1.shape)
        
        self.b_1 -= self.grad_b1 * self.learningRate
        self.b_2 -= self.grad_b2 * self.learningRate
    
    def gradDescentLoop(self, batchData, batchTarget, K):
        # Call each example in the data (over the minibatches) in a loop
        
        
        if K == 1: # For batch size = 1 (one example)
            self.fprop(batchData[0])
            self.bprop(batchData[0], batchTarget)
        else: # For minibatch with size K
            running_error = 0
            for i in range(K):
                self.fprop(batchData[i])
                running_error += self.errorRate()
            avg_error = running_error / K
            self.bprop(batchData[i], batchTarget[i])
            self.updateParams()
            
            
    '''def gradDescentMat():
        # Feed the entire data matrix in as input
        miniBatch = sample(K indices)
        #bprop using matrix operations
        
    '''

        
    def dataSplit(self):
        '''
        train
        test
        valid
        '''

In [391]:
def finiteGradCheck(nn, sigma, x, y):
    print(x.shape)
    # Error for current parameters
    orig_error = nn.errorRate()

    # Perturb each parameter
    nnRes = nn.o_s
    nnKeepCopy = copy.deepcopy(nn)
    netCopy = copy.deepcopy(nn)
    
    #print("deep copy?? nn.W_2,", nn.W_2)
    netCopy.W_1 += sigma
    # W_2 nc = hidden layer dim
    netCopy.W_2 += sigma
    netCopy.b_1 += sigma
    netCopy.b_2 += sigma
    
    #print("netCopy W_2", netCopy.W_2)
    
    # forward prop the copy net
    # x[0] is a heck when we input a single data point
    # b/c we reshaped the whole circle data set into (1100, 2, 1)
    netCopy.fprop(x[0])
    netCopy.bprop(x[0], y)
    
    # Calculate perturbed loss
    new_error = netCopy.errorRate()
    
    print('orig error:', orig_error)
    print('new_error:', new_error)
    gradDiff = (orig_error - new_error) / sigma
    
    print("error difference ", gradDiff)
    gradDiffArr = []
    gradDiffArr.append((netCopy.grad_oa - nnKeepCopy.grad_oa) / sigma)
    gradDiffArr.append((netCopy.grad_W2 - nnKeepCopy.grad_W2) / sigma)
    gradDiffArr.append((netCopy.grad_W1 - nnKeepCopy.grad_W1) / sigma)
    gradDiffArr.append((netCopy.grad_b2 - nnKeepCopy.grad_b2) / sigma)
    gradDiffArr.append((netCopy.grad_b1 - nnKeepCopy.grad_b1) / sigma)
    
    return gradDiffArr
        
'''      
def earlyStopping(net):
    totalEpoch = 10 #may not be enough??
    for each epoch
        train
        valid
        test
        plot / print errors
'''



'      \ndef earlyStopping(net):\n    totalEpoch = 10 #may not be enough??\n    for each epoch\n        train\n        valid\n        test\n        plot / print errors\n'

In [392]:
# Do forward and backprop for one example

# Initialize net
#print(circlesData[0].shape)
nn = neuralNet(2, 1, 2)
nn.gradDescentLoop(circlesData[0:1], circlesTarget[0:1], 1)

# Print the gradients
print('W1 gradient: \n{} \n\n b1 gradient:\n{}'.format(nn.grad_W1, nn.grad_b1))
print("softmax result: \n {}".format(nn.o_s.shape))
print("For finite gradient check - errors divided by sigma: \n", finiteGradCheck(nn, 0.000001,
                                                                                 np.array([circlesData[0]]).T,
                                                                                 np.array([circlesTarget[0]])))

fprop  (2, 1)
self.h_a  (1, 1)
(np.where(self.h_a > 0, 1, 0))) (1, 1)
self.grad_hs.T  (1, 1)
W1 gradient: 
[[3.24347214e-18 1.41458738e-18]] 

 b1 gradient:
[[4.42315798e-18]]
softmax result: 
 (2, 1)
(1, 2, 1)
fprop  (2, 1)
self.h_a  (1, 1)
(np.where(self.h_a > 0, 1, 0))) (1, 1)
self.grad_hs.T  (1, 1)
orig error: [[ 1.38777878e-17  8.05778976e-02]
 [-8.05778976e-02  3.12250226e-17]]
new_error: [[ 4.16333634e-17  8.05785516e-02]
 [-8.05785516e-02 -1.73472348e-17]]
error difference  [[-2.77555756e-11 -6.53962517e-01]
 [ 6.53962517e-01  4.85722573e-11]]
For finite gradient check - errors divided by sigma: 
 [array([[4.51143755e-12],
       [4.16217983e-12]]), array([[1.14125939e-12],
       [1.05290759e-12]]), array([[-1.29739790e-12, -5.65838897e-13]]), array([[4.51143755e-12],
       [4.16217983e-12]]), array([[-1.76927553e-12]])]


### Part 2

> Display  the  gradients  for  both  methods (direct computation and finite difference) for a small network (e.g. $d = 2$ and $d_{h} = 2$) with random weights and for a single example.

### Part 3

> Add a hyperparameter for the minibatch size $K$ to allow computing the gradients on a minibatch of $K$ examples (in a matrix), by looping over the $K$ examples (this is a small addition to your previous code).

### Part 4

> Display the gradients for both methods (direct computation and finite difference) for a small network (e.g. $d = 2$ and $d_{h} = 2$) with random weights and for a minibatch with 10 examples (you can use examples from both classes from the two circles dataset).

In [385]:
nn = neuralNet(2, 2, 2)
print(np.transpose(circlesData[0:10]).shape)
nn.gradDescentLoop((circlesData[0:10]), circlesTarget[0:10], 10)

(1, 2, 10)
fprop  (2, 1)
fprop  (2, 1)
fprop  (2, 1)
fprop  (2, 1)
fprop  (2, 1)
fprop  (2, 1)
fprop  (2, 1)
fprop  (2, 1)
fprop  (2, 1)
fprop  (2, 1)
self.h_a  (2, 1)
(np.where(self.h_a > 0, 1, 0))) (2, 1)
self.grad_hs.T  (2, 1)
grad_b1  (2, 1)


### Part 5

> Train your neural network using gradient descent on the two circles dataset. Plot the decision regions for several different values of the hyperparameters (weight decay, number of hidden units, early stopping) so as to illustrate their effect on the capacity of the model.

### Part 6

> As a second step, copy your existing implementation to modify it to a new implementation that will use matrix calculus (instead of a loop) on batches of size $K$ to improve efficiency. **Take the matrix expressions in numpy derived in the first part, and adapt them for a minibatch of size $K$. Show in your report what you have modified (describe the former and new expressions with the shapes of each matrix).**

### Part 7

> Compare both implementations (with a loop and with matrix calculus) to check that they both give the same values for the gradients on the parameters, first for $K = 1$, then for $K = 10$. Display the gradients for both methods.

### Part 8

> Time how long an epoch takes on Fashion MNIST (1 epoch = 1 full traversal through the whole training set) for $K = 100$ for both versions (loop over a minibatch and matrix caluclus).

### Part 9

> Adapt your code to compute the error (proportion of misclassified examples) on the training set as well as the total loss on the training set during each epoch of the training procedure, and at the end of each epoch, it computes the error and average loss on the validation set and the test set. Display the 6 corresponding figures (error and average loss on train/valid/test), and write them in a log file.

### Part 10

> Train your network on the Fashion MNIST dataset. Plot the training/valid/test curves (error and loss as a function of the epoch number, corresponding to what you wrote in a file in the last question). Add to your report the curves obtained using your best hyperparameters, i.e. for which you obtained your best error on the validations et. We suggest 2 plots: the first one will plot the error rate (train/valid/test with different colors, show which color in a legend) and the other one for the averaged loss (on train/valid/test). You should be able to get less than 20% test error.