# A First Simulation

Now, finally, we're going to actually train a neural network!  We'll use the techniques we've learned so far, and compare the results we get with various activation functions and cost functions we've talking about.

## Data

To do this, we need data.  All kinds of data will work, but there are standard benchmark datasets that people use, and we'll use one of them.  This is called **MNIST**, and consists of a large number of handwritten digits (0 to 9) with labels, dividing into training and test sets (60000 and 10000 examples, respectively).

You can get the data <a href="http://yann.lecun.com/exdb/mnist/">here</a>, but the format is annoying.  You could read about the format and write your own parser, or you could use the parser <a href="http://g.sweyla.com/blog/2012/mnist-numpy/">here</a>.  To make that even easier for my purposes, I've written a convenience function on top of his parser; you can get it <a href="mnist.py">here</a>, although it requires the files to be in your current working directory to work.

In [1]:
from mnist_import import get_mnist_nice

train_X, train_Y, test_X, test_Y = get_mnist_nice()

n = train_X.shape[1]
k = train_Y.shape[1]

Observe that $n=784$ and $k=10$.

This is because each data point is a 28x28 greyscale image, and the floats indicate darkness of each pixel, written out into a single long row.  Since there are 50000 training examples and 10000 test examples, `train_X` is a 50000x784 array and `test_X` is a 10000x784 array.

We aim to classify the digits.  Since each digit can be 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9, but these don't naturally have any kind of useful ordering, we consider them each their own class.  So the 4th column (for example) of `train_Y` is always 0 or 1, and is 1 if and only if that row is a 3 (we start from zero).  Thus `train_Y` is a 50000x10 array, and `test_Y` is a 10000x10 array.

## Putting the Pieces Together

Now that we have the dataset and the problem picked out, we need to do the following:
1. Decide on the architecture of the neural network, including hyperparameters, cost functions, etc.
2. Initialize the neural network
3. Train the neural network
4. Evaluate the neural network

We aren't too worried about accuracy right now, we just want to kick the tires and make sure everything is working.  So we'll go with a basic setup, with sigmoid neurons and only one hidden layer.

I've placed all the code from the preceding notebooks in the file <a href="basic_nn.py">basic_nn.py</a>, so we can just invoke the old code.

In [2]:
from basic_nn import *

np.random.seed(31)

# Step 1: pick architecture (in prose above)
cost_function = cost_CE     # this is a classification problem
learning_rate = 0.125       # picked arbitrarily, seems to work okay

# Step 2: initialize

#100 input neurons, 100 neurons in a hidden layer, so 100+100+10 total
neuron_sizes = [100, 100]

weights, biases = initialize_xavier_sigmoid(n, k, neuron_sizes)
acts = [act_sigmoid for _ in range(0, len(weights))] # all sigmoid

*Note*: I picked the learning rate with a little care.  I started with 1, but the error wasn't consistently dropping.  Then I cut it in half (down to 0.5 now) but the error would again start to rise, after a certain point.  Then I cut it in half (down to 0.25) and got the same problem.  Finally, I got it down to 0.125 and the error consistently dropped with that learning rate.  It's annoying, but this really is the way you pick learning rates; pick a number that causes blowup, then keep halving it until it doesn't.

Each iteration through the training data is called an *epoch*, probably because it takes forever.  There are two ways to know when to stop -- when the training converges, or when you run a specified number of epochs.  The second choice seems more dominant.  For sake of time, let's do 20 epochs.

In [3]:
# Step 3: train
import time

t1 = time.time()

num_epochs = 20

old_predictions = [0] * num_epochs

for epoch in range(0, num_epochs):
    x, z, y = forward_prop(weights, biases, acts, train_X)
    
    bp_grad_w, bp_grad_b = back_prop(weights, biases, acts, cost_function, train_X, train_Y, x, y, z)
    
    # Just for fun, let's save the successive predictions as we go
    old_predictions[epoch] = y[-1]
    
    for i in range(0, len(weights)):
        weights[i] -= learning_rate * bp_grad_w[i] / len(train_X)
        biases[i] -= learning_rate * bp_grad_b[i] / len(train_X)
    
    cost = cost_function(y[-1], train_Y)
    print("Cost {2:0.7f} before epoch {0}; took {1:0.3f} seconds so far.".format(epoch, time.time()-t1, cost))

Cost 6.9314718 before epoch 0; took 5.159 seconds so far.
Cost 5.3662717 before epoch 1; took 10.635 seconds so far.
Cost 4.5788508 before epoch 2; took 15.947 seconds so far.
Cost 4.1597876 before epoch 3; took 21.177 seconds so far.
Cost 3.9122014 before epoch 4; took 26.492 seconds so far.
Cost 3.7528381 before epoch 5; took 31.919 seconds so far.
Cost 3.6435288 before epoch 6; took 37.667 seconds so far.
Cost 3.5649101 before epoch 7; took 43.407 seconds so far.
Cost 3.5062749 before epoch 8; took 48.587 seconds so far.
Cost 3.4612781 before epoch 9; took 53.992 seconds so far.
Cost 3.4259450 before epoch 10; took 59.105 seconds so far.
Cost 3.3976713 before epoch 11; took 64.285 seconds so far.
Cost 3.3746859 before epoch 12; took 69.413 seconds so far.
Cost 3.3557465 before epoch 13; took 74.639 seconds so far.
Cost 3.3399587 before epoch 14; took 80.061 seconds so far.
Cost 3.3266638 before epoch 15; took 85.277 seconds so far.
Cost 3.3153672 before epoch 16; took 90.438 seconds

Some takeaways:
1. Convergence is happening, but very slowly.  We need a lot more epochs.
2. Epochs take a very long time (5 seconds apiece on my laptop).

Points (1) and (2) are certainly in conflict, and in the next notebook we'll discuss stochastic gradient descent, which will let us accomplish more in each epoch by subdividing the data into more manageable pieces.

## Evaluating the Results

Now, let's look at classification error.  That is, how often is our network getting the right answers?  The output of the network (for each row) is whatever class has the highest score.  We can easily get this from the `np.argmax(arr, axis=0)` function, then just check how often the prediction matches the answer.

In [4]:
def classification_success_rate(y_hat, y):
    predicted_classes = np.argmax(y_hat, axis=1)
    actual_classes = np.argmax(y, axis=1)
    errors = predicted_classes - actual_classes
    
    return 1 - (np.count_nonzero(errors) / len(errors))

In [5]:
for i in range(0, num_epochs):
    success = classification_success_rate(old_predictions[i], train_Y)
    print("Success rate before epoch {0}: {1:0.3f}%".format(i, 100*success))

Success rate before epoch 0: 9.872%
Success rate before epoch 1: 11.237%
Success rate before epoch 2: 11.237%
Success rate before epoch 3: 11.237%
Success rate before epoch 4: 11.237%
Success rate before epoch 5: 11.237%
Success rate before epoch 6: 11.238%
Success rate before epoch 7: 11.238%
Success rate before epoch 8: 11.238%
Success rate before epoch 9: 11.238%
Success rate before epoch 10: 11.238%
Success rate before epoch 11: 11.238%
Success rate before epoch 12: 11.238%
Success rate before epoch 13: 11.238%
Success rate before epoch 14: 11.238%
Success rate before epoch 15: 11.238%
Success rate before epoch 16: 11.238%
Success rate before epoch 17: 11.238%
Success rate before epoch 18: 11.238%
Success rate before epoch 19: 11.238%


The success rate is not inspiring, I admit.  The honest truth is that with thousands or even millions of epochs, this would improve dramatically, but I don't have time for that and neither does anyone else.  There is a better way.