# SelfImplementedNeuralNetwork
A neural network implemented using only numpy.

In [7]:
import numpy as np

## Initiliase
Initialises the weights and biases of the network with uniform distributions from -1 to 1. The network comprises 4 layers having 784, 128, 128 and 10 neurons respectively.

In [108]:
def init_network():
    # The layer sizes
    n0 = 28*28
    n1 = 128
    n2 = 128
    n3 = 10
    
    init_weights = lambda m, n: np.random.uniform(-1,1,(n,m))
    init_biases = lambda m: np.random.uniform(-1,1,(m,))
    
    layer0_1_weights = init_weights(n0,n1)
    layer0_1_biases = init_biases(n1)
    layer0_1 = layer0_1_weights, layer0_1_biases
    
    layer1_2_weights = init_weights(n1,n2)
    layer1_2_biases = init_biases(n2)
    layer1_2 = layer1_2_weights, layer1_2_biases
    
    layer2_3_weights = init_weights(n2,n3)
    layer2_3_biases = init_biases(n3)
    layer2_3 = layer2_3_weights, layer2_3_biases
    
    return layer0_1, layer1_2, layer2_3

## Activation functions
*ReLU* is defined as $\text{max}(0,x)$ and is the go-to activation function for hidden layers.

In [109]:
def relu(x):
    return np.maximum(0,x)

The *Sotfmax* acitvation function is a generalization of the *Sigmoid* function. It is applied to a 1D-Array of values which are squished to the interval $[0,1)$ such that the sum of all values is 1. Thereby, we get a true probability distribution which is quite convenient. It is defined as

$$\frac{e^{x_i}}{\sum_{j}e^{x_j}}.$$

To be compatible with forward propagation on a 2D-Array of multiple training examples we extend the function to also take 2D-Arrays as a parameter and perform the softmax row-wise. (How this works in detail is annotated in the comments.)

Moreover, if the arrays contain big values we quickly get an overflow for `np.exp`. Therefore, we subtract the maximum value of each row (or for the 1D case: simply the maximum) from all values which doesn't alter the endresult of our computation, but rids us of some possible (pseudo-)infinities along the way. 

In [386]:
def softmax(x):
    # If x is a 2D-array of multiple training exmaples, we transpose so that axis=0 is the axis of a single 
    # training example. (We can't simply use axis=1 because that would break if we use the function for a 
    # 1D-array, i.e. only one training example. Transposing first and then using axis=0 works in both cases.)
    x = x.T
    # To avoid overflow of np.exp (doesn't alter the value)
    x = x - np.max(x, axis=0)
    # The main calculation
    result = np.exp(x) / np.sum(np.exp(x), axis=0)
    # Tranpsose back to normal form where each row is a training example (only makes a difference if array is 2D;
    # doesn't change anything if array is 1D.)
    return result.T

## Forward propagation
Here we implement forward propagation. This function can either take a 1D-Array of a single training example or a 2D-Array of multiple training examples where each row is a training example. The transformations (`np.array.T`) are necessary so that the dot product and the vector addition work in the 2D case. They can be ignored in the 1D case.

In [281]:
def feed_forward(data, network):
    layer0 = data

#   # How it would work iteratively:
#   layer1 = []
#   for weights, bias in zip(network[0][0], network[0][1]):
#      neuron = relu(np.dot(weights, layer0) + bias)
#      layer1.append(neuron)  
#   layer1 = np.array(layer1)
    
    layer1 = relu(np.dot(network[0][0], layer0.T).T + network[0][1])
        
    layer2 = relu(np.dot(network[1][0], layer1.T).T + network[1][1])
            
    layer3 = softmax(np.dot(network[2][0], layer2.T).T + network[2][1])
    
    return layer3

## Get data
We use the mnist handwritten digit classification data which is shipped with keras.

In [277]:
network = init_network()

In [278]:
import tensorflow as tf
(x_train_raw, y_train), (x_test_raw, y_test) = tf.keras.datasets.mnist.load_data()

Flatten the data to make it a well-behaved input for our neural network:

In [279]:
x_train = x_train_raw.reshape(60000, 28*28)
x_test = x_test_raw.reshape(10000, 28*28)

The feed forward function can either take a single example or an array of examples:

In [285]:
feed_forward(x_train[10], network) == feed_forward(x_train, network)[10]

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

## Cost function
First, the loss function, i.e. the error for a single training example. There are a multitude of loss functions, though there is an "industry-standard" for each type of problem. While the simpler and better-known MSE (*mean squared error*) is used for regression problems, one uses *cross entropy loss* for categorization problems (which fits our case precisely).

Given a prediction $\hat{y}_i$ and the correct solution $y_i$ (in our case for $0\leq i<10$), we can define *cross entropy loss* as 
$$H(y,\hat{y})=-\sum_{i=0}^9 y_i\cdot \text{log}(\hat{y}_i).$$
Since we know that $y_i$ is zero for all but one $i$, we can simplify this to
$$H(y,\hat{y})=-\text{log}(\hat{y}_k)$$
where $k$ is the index of the correct solution.

In [636]:
def cross_entropy_loss(y_hat, y):
    y_val = y_hat[y]
    # To avoid division by zero
    if y_val == 0:
        # #The numpy epsilon (np.finfo(float).eps) is apparently not really 
        # # the smallest number possible; by trying out I found the limit for 
        # # np.log to not throw an error was at about eps*10^(-307).
        # y_val = np.finfo("float64").eps*10**-307
        # Update: 
        # np.spacing(0) gives the smallest number bigger than 0
        y_val = np.spacing(0)
    return -np.log(y_val)

In [605]:
result = feed_forward(x_train, network)

In [606]:
cross_entropy_loss(result[10], y_train[10])

744.4400719213812

Then, the cost function, i.e. the error for a whole training set (or subset).

In [607]:
def cross_entropy_cost(y_hats, ys):
    return np.sum(cross_entropy_loss(y_hat, y) for y_hat, y in zip(y_hats, ys))

In [608]:
cross_entropy_cost(result, y_train)

40916654.30157135

## Backpropagation
Now, finally the training will be implemented.

### Definitions

- Let the layers be labeled from $n=0$ to $n=3$.
- Let $x_i^{(0)}$ be the $i$-th $x$-value, that is the $i$-th input of the network ($0\leq i<28^2$).
- Let $x_i^{(1)}$ and $x_i^{(2)}$ be the value of the $i$-th neuron of (hidden) layer 1 or 2, respectively ($0\leq i<128$).
- Let $y_i$ be the $i$-th output, that is the value of the $i$-th neuron in layer 3 ($0\leq i< 10$).
- Let $w_{ij}^{(n)}$ represent the weight on the connection of the $i$-th neuron in layer n to the $j$-th neuron in layer n+1. (Can be found in `network[n][0][j][i]`.)  

### Stochastic Gradient Descent
Let us consider the gradient aka partial derivatives for specific groups of parameters:
#### Layer 2 to 3 weights
We consider the weights 
$$w_{ij}^{(2)} \qquad\text{where}\; 0\leq i\leq127,\; 0\leq j\leq9$$ 
which represents the weight on the connection of the $i$-th neuron in layer 2 to the $j$-th neuron in layer 3. (Can be found in `network[2][0][j][i]`.)

Our goal is to determine
$$\frac{\partial}{\partial w_{ij}^{(2)}} \;\text{cross_entropy_cost}$$
$$=\quad\sum_{\forall\;\text{samples}} \frac{\partial}{\partial w_{ij}^{(2)}} \;\text{cross_entropy_loss}$$
$$=\quad\sum_{\forall\;\text{samples}} \frac{\partial}{\partial w_{ij}^{(2)}} \;\left(-\text{log}(y_k)\right),$$
where $k$ is the correct solution for a given sample.

Hence, let us consider
$$\frac{\partial}{\partial w_{ij}^{(2)}} \;\left(-\text{log}(y_k)\right)$$
$$=\quad \frac{\partial}{\partial y_k}\;\left(-\text{log}(y_k)\right)\cdot\frac{\partial y_k}{\partial w_{ij}^{(2)}}$$
$$=\quad -\frac{1}{y_k}\cdot \frac{\partial}{\partial w_{ij}^{(2)}} \sum_{r=0}^{127} w_{rk}^{(2)}\cdot x_r^{(2)}$$
$$=\quad -\frac{1}{y_k}\cdot \frac{\partial}{\partial w_{ij}^{(2)}} w_{ik}^{(2)}\cdot x_r^{(2)}.$$
We see that if $j\neq k$ 
$$ \frac{\partial}{\partial w_{ij}^{(2)}} \;\text{cross_entropy_loss} = 0.$$
Otherwise if $j = k$
$$\frac{\partial}{\partial w_{ij}^{(2)}} \;\text{cross_entropy_loss}$$
$$=\quad -\frac{1}{y_k}\cdot \frac{\partial}{\partial w_{ik}^{(2)}} w_{ik}^{(2)}\cdot x_r^{(2)}.$$
$$=\quad -\frac{1}{y_k}\cdot x_r^{(2)}.$$


In [620]:
def sgd(x_train, y_train, network, batch_size=50, epochs=3):
    x_single = x_train[0]
    y_single = y_train[0]
    
    result = feed_forward(x_single, network)
    
    print(result.shape)
    
    # Layer 2to3 weights
    
    # Layer 3 biases
    
    # Layer 1to2 weights
    
    # Layer 2 biases
    
    # Layer 0to1 weights
    
    # Layer 1 biases