# Fully Connected Neural Network for MNIST

## Requirements

### Python-Modules

In [1]:
# third party
import numpy as np
# mnist data
from deep_teaching_commons.data.fundamentals.mnist import Mnist
# tensorflow
import tensorflow as tf

from tqdm import tqdm

### Data

In [2]:
# create mnist loader from deep_teaching_commons
mnist_loader = Mnist(data_dir='data')

# load all data, labels are one-hot-encoded, images are flatten and pixel squashed between [0,1]
train_images, train_labels, test_images, test_labels = mnist_loader.get_all_data(flatten=False, one_hot_enc=True, normalized=True)
print(train_images.shape, train_labels.shape)

# reshape to match generel framework architecture
train_images, test_images = train_images.reshape(60000, 28, 28, 1), test_images.reshape(10000, 28, 28, 1)
print(train_images.shape, train_labels.shape)

# shuffle training data
shuffle_index = np.random.permutation(60000)
train_images, train_labels = train_images[shuffle_index], train_labels[shuffle_index]

auto download is active, attempting download
mnist data directory already exists, download aborted
(60000, 28, 28) (60000, 10)
(60000, 28, 28, 1) (60000, 10)


## Building the Neural Network
For a better understanding of neural networks you will start to implement your own framework. The given notebook explaines some core functions and concetps of the framework, so you all have the same starting point. The  Pipeline will be: 

**define a model architecture -> construct a neural network from the model -> define a evaluation citeria -> optimize the network**

### Creating a custom architecture
To create a custom model you have to define layers and activation functions that can be used to do so. Layers and activation functions are modeled as objects. Each object that you want to use has to implement a `forward` that is used by the `NeuralNetwork` class. Additionally the `self.params` attribute is mandatory to meet the specification of the `NeuralNetwork` class. It is used to store all learnable parameters that you need for the optimization algorithm. We implement our neural network so that we can use the objects as building blocks and stack them up to create a custom model. 

#### Layers  Class
The file `layer.py` contains implementations of neural network layers and regularization techniques that can be inserted as layers into the architecture.

In [3]:
class Flatten():
    ''' Flatten layer used to reshape inputs into vector representation
    
    Layer should be used in the forward pass before a dense layer to 
    transform a given tensor into a vector. 
    '''
    def __init__(self):
        self.params = []

    def forward(self, X):
        ''' Reshapes a n-dim representation into a vector 
            by preserving the number of input rows.
        
        Args:
            X: Images set
    
        Returns:
            X_: Matrix with images in a flatten represenation
            
        Examples:
            [10000, 1, 28, 28] -> [10000,784]
        '''
        return tf.reshape(X, [-1, X.shape[1] * X.shape[2] * X.shape[3]])

Later on, we are going to define some activation functions which we obviously going to use on top of the layers. Thus we add to the constructor a parameter ```activation_func``` to the class.

In [4]:
class FullyConnected():
    ''' Fully connected layer implemtenting linear function hypothesis 
        in the forward pass and its derivation in the backward pass.
    '''
    def __init__(self, in_size, out_size, activation_func=None,stddev=0.1):
        ''' Initilize all learning parameters in the layer
        
        Weights will be initilized with modified Xavier initialization.
        Biases will be initilized with zero. 
        '''
        self.W = tf.Variable(tf.truncated_normal([in_size, out_size], stddev=stddev))
        self.b = tf.Variable(tf.ones([out_size])/10)
        self.params = [self.W, self.b]
        self.activation_func = activation_func
        self.out_size = out_size

    def forward(self, X):
        ''' Linear combiationn of images, weights and bias terms
            
        Args:
            X: Matrix of images (flatten represenation)
    
        Returns:
            out: Sum of X*W+b  
        '''
        Z = tf.matmul(X, self.W) + self.b
        if self.activation_func is None:
            return Z
        else:
            return self.activation_func.forward(Z)

#### Activation Functions
The file `activation_func.py` contains implementations of activation functions you can use as a none linearity in your network. The functions work on the basis of matrix operations and not discret values, so that these can also be inserted as a layer. As an example the ReLU and Softmax function is implemented:

$$
f ( x ) = \left\{ \begin{array} { l l } { x } & { \text { if } x > 0 } \\ { 0 } & { \text { otherwise } } \end{array} \right.
$$

In [5]:
class ReLU():
    ''' Implements activation function rectified linear unit (ReLU) 
    '''

    def __init__(self):
        self.params = []

    def forward(self, X):
        return tf.nn.relu(X)

In [6]:
class Softmax():
    ''' Implements activation function softmax
    '''

    def __init__(self):
        self.params = []

    def forward(self, X):
        return tf.nn.softmax(X)

#### Loss Function
Implementations of loss functions can be found in `loss_func.py`. A loss function object defines the criteria your network is evaluated during the optimization process and also contains score functions that can be used as classification criteria for predictions with the final model. Therefore it is necessary to create a loss function object and provide to the optimization algorithm.

In [7]:
class Loss_functions:
    ''' Implements different typs of loss functions for neural networks
    '''

    def cross_entropy(X, y):
        ''' Computes loss and prepares dout for backprop 

        https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/
        '''
        cross_entropy = -tf.reduce_mean(y * tf.log(X)) * 10
        return cross_entropy

## NeuralNetwork class

In [8]:
class NeuralNetwork:
    ''' Creates a neural network from a given layer architecture 

    This class is suited for fully connected network and
    convolutional neural network architectures. It connects 
    the layers and passes the data from one end to another.
    '''

    def __init__(self, layers):
        ''' Setup a global parameter list and initilize a
            score function that is used for predictions.

        Args:
            layer: neural network architecture based on layer and activation function objects
            score_func: function that is used as classifier on the output
        '''
        self.layers = layers
        self.params = []
        for layer in self.layers:
            if len(layer.params) > 0:
                self.params.append(layer.params)

    def forward(self, X):
        ''' Pass input X through all layers in the network 
        '''
        for layer in self.layers:
            X = layer.forward(X)
        return X

    def predict(self, X):
        ''' Run a forward pass and use the score function to classify 
            the output.
        '''
        temp = tf.placeholder(tf.float32, X.shape)
        pred = self.forward(temp)
        with tf.Session() as sess:
            init = tf.global_variables_initializer()
            sess.run(init)
            pred = sess.run(pred, feed_dict={temp: X})
        return np.argmax(pred, axis=1)

## Optimization with SGD
The file `optimizer.py` contains implementations of optimization algorithms. Your optimizer needs your custom `network`, `data` and `loss function` and some additional hyperparameter as arguments to optimzie your model. 

In [9]:
class Optimizer():

    def get_minibatches(X, y, batch_size):
        ''' Decomposes data set into small subsets (batch)
        '''
        m = X.shape[0]
        batches = []
        for i in range(0, m, batch_size):
            X_batch = X[i:i + batch_size, :, :, :]
            y_batch = y[i:i + batch_size, ]
            batches.append((X_batch, y_batch))
        return batches

    def calculate_gradient(network, loss_function):
        grad = []
        for param in network.params:
            W, b = param
            dW = tf.gradients(loss_function, W)
            db = tf.gradients(loss_function, b)
            grad.append([dW, db])
        return grad

    def sgd(network, X_train, y_train, loss_function, batch_size=32, epoch=100, learning_rate=0.01, X_test=None, y_test=None, verbose=None):
        ''' Optimize a given network with stochastic gradient descent 
        '''
        X_shape, y_shape = [None], [None]
        for x in X_train.shape[1:]:
            X_shape.append(x)
        for y in y_train.shape[1:]:
            y_shape.append(y)
        X = tf.placeholder(tf.float32, X_shape)
        Y = tf.placeholder(tf.float32, y_shape)

        loss = loss_function(network.forward(X), Y)
        grads = Optimizer.calculate_gradient(network, loss)

        minibatches = Optimizer.get_minibatches(X_train, y_train, batch_size)
        for i in range(epoch):
            if verbose:
                print('Epoch', i + 1)
            for X_mini, y_mini in tqdm(minibatches):
                for param, grad in zip(network.params, grads):
                    with tf.Session() as sess:
                        init = tf.global_variables_initializer()
                        sess.run(init)
                        sess_grad = sess.run(grad, feed_dict={X: X_mini, Y: y_mini})
                    for i in range(len(sess_grad)):
                        param[i] = param[i] - learning_rate * sess_grad[i][0]
            if verbose:
                with tf.Session() as sess:
                    init = tf.global_variables_initializer()
                    sess.run(init)
                    sess_loss = sess.run(loss, feed_dict={X: X_mini, Y: y_mini})
                train_acc = np.mean(np.argmax(y_train, axis=1) == network.predict(X_train))
                test_acc = np.mean(np.argmax(y_test, axis=1) == network.predict(X_test))
                print("Loss = {0} :: Training = {1} :: Test = {2}".format(
                    sess_loss, train_acc, test_acc))
        return network

# Put it all together
Now you have parts together to create and train a fully connected neural network. First, you have to define an individual network architecture by flatten the input and stacking fully connected layer with activation functions. Your custom architecture is given to a `NeuralNetwork` object that handles the inter-layer communication during the forward and backward pass. Finally, you optimize the model with a chosen algorithm, here stochastic gradient descent. That kind of pipeline is similar to the one you would create with a framework like Tensorflow or PyTorch.

In [10]:
# design a three hidden layer architecture with Dense-Layer
# and ReLU as activation function
def fcn_mnist():
    flat = Flatten()
    hidden_01 = FullyConnected(784, 300, activation_func=ReLU())
    hidden_02 = FullyConnected(300, 200, activation_func=ReLU())
    hidden_03 = FullyConnected(200, 100, activation_func=ReLU())
    ouput = FullyConnected(100, 10, activation_func=Softmax())
    return [flat, hidden_01, hidden_02, hidden_03, ouput]

# create a neural network on specified architecture with softmax as score function
fcn = NeuralNetwork(fcn_mnist())

In [None]:
# optimize the network and a softmax loss
fcn = Optimizer.sgd(fcn, train_images, train_labels, Loss_functions.cross_entropy, batch_size=256, epoch=10, learning_rate=0.01, X_test=test_images, y_test=test_labels, verbose=True)

  0%|          | 0/235 [00:00<?, ?it/s]

Epoch 1


100%|██████████| 235/235 [02:29<00:00,  1.57it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

Loss = 2.413881778717041 :: Training = 0.12285 :: Test = 0.1105
Epoch 2


100%|██████████| 235/235 [06:08<00:00,  1.57s/it]
  0%|          | 0/235 [00:00<?, ?it/s]

Loss = 2.3438308238983154 :: Training = 0.10123333333333333 :: Test = 0.1133
Epoch 3


100%|██████████| 235/235 [09:47<00:00,  2.50s/it]
  0%|          | 0/235 [00:00<?, ?it/s]

Loss = 2.551849126815796 :: Training = 0.12831666666666666 :: Test = 0.0889
Epoch 4


100%|██████████| 235/235 [14:00<00:00,  3.58s/it]
  0%|          | 0/235 [00:00<?, ?it/s]

Loss = 2.560546398162842 :: Training = 0.08156666666666666 :: Test = 0.1382
Epoch 5


100%|██████████| 235/235 [18:34<00:00,  4.74s/it]
  0%|          | 0/235 [00:00<?, ?it/s]

Loss = 2.3954720497131348 :: Training = 0.06816666666666667 :: Test = 0.1136
Epoch 6


100%|██████████| 235/235 [22:54<00:00,  5.85s/it]
  0%|          | 0/235 [00:00<?, ?it/s]

Loss = 2.367936849594116 :: Training = 0.1372 :: Test = 0.0882
Epoch 7


 83%|████████▎ | 194/235 [21:40<04:34,  6.71s/it]