# A SIMPLE INTRODUCTION TO NEURAL NETWORKS WITH PYTHON

The goal of this notebook is to provide a simple, straightforward implementation of a multilayer neural network so we can explore various concepts and building blocks you will run into again and again while working with neural nets. Many great examples exist online that are far more robust. However, if you are new to machine learning, the added complexity of these tutorials comes with a cognitive tax - there is too much to learn. I decided to write the canonical python neural network I wish I would have found when I was first starting my journey into deep learning.

By implementing a two layer neural network using only Python and Numpy (with a tiny sprinkling of scikit-learn), I hope we can explore each step in consumable manner. Future tutorials will bring in more robust frameworks that we will call upon as our projects get more powerful and creative, such as Google's **Tensorflow** or Microsoft's **Cognitive Toolkit**. These are the two frameworks I use most at work.

But being to write a full neural network from basic Python building blocks will be a great first step to understanding what is actually happening under the hood without getting too lost along the way. Let's get started...


## Step 1: Loading and preparing our data

We will build a neural network that can classify which type of Iris we are analyzing based on certain measurements of the flower. We will be using the famous [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) first presented in 1938 by Fisher and Anderson. 

There are three types of Irises the dataset classifies among its 150 records. These classes are:

![IRISES](images/irises.png)

Each of these three labels have 50 records of flower measurments each (for a total of 150 records in the full dataset). Each flower record contains four measurements we will use as our input *features*. These features record the sepal length, sepal width, petal length and petal width for each iris. 

Fortunately, since the Iris dataset is very famous, it already comes as one of the default datasets that ships with scikit-learn. So let's go ahead and grab it from there:


In [1]:
# WE WILL BE BUILDING OUR NEURAL NETWORK IN PYTHON, AND MAINLY USE NUMPY FOR OUR OPERATIONS
import numpy as np

# WHILE SCIKIT-LEARN COMES WITH A LOT OF MACHINE LEARNING HELPER CLASSES, WE WILL LIMIT OURSELVES
# TO ONLY USING IT AS AN EASY WAY TO GET US ACCESS TO THE IRIS DATASET
from sklearn import datasets
iris = datasets.load_iris()

features = iris.data
labels = iris.target


The dataset already seperates the measurement *features* we will be using as our inputs, and the corresponding category *label* for that record that tells us which type of iris we are evaluating.

Let's look at the features of the first five records in our dataset to see what the measurements look like:

In [8]:
print("total feature records:", len(features))
print("\nFirst five records:\n", features[:5])

total feature records: 150

First five records:
 [[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]


As expected, each record has four measurements. This means when we build our Neural Network, our **input layer** will need four neurons to hold each measurement.

Now let's look at the labels for all the records in the dataset. There should be 150 of them, which will match up 1x1 with the 150 feature records we saw above:

In [10]:
print("total labels:", len(labels))
print("\nLabel records:\n", labels)

total labels: 150

Label records:
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


Since our simple neural network will be updating itself a record at a time, we may want to shuffle up the order to get a better training distribution for our neural network to learn.

Since we already have the features and labels split, however, we need to make sure that we shuffle them both in the same order, otherwise we'll have broken our training data if the labels don't match up to their features anymore!

Here's an easy way to do that in Python:

In [12]:
# LET'S CREATE A VECTOR OF INDICES FOR EACH FEATURE AND THEN SHUFFLE IT
idx = np.arange(features.shape[0])
np.random.seed(42)
np.random.shuffle(idx)

# AS LONG AS BOTH THE FEATURES AND LABELS USE THE SAME INDEX VECTOR, WE'RE STILL ALL GOOD
features = features[idx]
labels = labels[idx]

# NOW LET'S LOOK AT THE LABELS AGAIN AND MAKE SURE THEY'RE SHUFFLED
print("total labels:", len(labels))
print("\nLabel records:\n", labels)

total labels: 150

Label records:
 [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
 0 0 0 2 1 1 0 0 1 2 2 1 2 1 2 1 0 2 1 0 0 0 1 2 0 0 0 1 0 1 2 0 1 2 0 2 2
 1 1 2 1 0 1 2 0 0 1 1 0 2 0 0 1 1 2 1 2 2 1 0 0 2 2 0 0 0 1 2 0 2 2 0 1 1
 2 1 2 0 2 1 2 1 1 1 0 1 1 0 1 2 2 0 1 2 2 0 2 0 1 2 2 1 2 1 1 2 2 0 1 2 0
 1 2]


As we can see above, we have three classes of Irises so each label is either a 0, 1, or 2 (for Setosa, Versicolor, or Verginica respectively).

Now, the Iris dataset knows exactly what type of flower each record relates to, so it can give a single label. However, the purpose of a trained, classification neural network is to generate a prediction that takes into account all *possible categories* it has been trained on. That means that for each record, the Neural Network will not generate a single answer, but give each possible label an individual score across a *probability distribution.* 

Basically, that means that if we have three categories (Setosa, Versicolor, Verginica), each record will get a triple score that looks something like [0.89  .07 .04], where the category with the highest score indicates the highest probability. In this example, that means the prediction would be Setosa, with a probability score of 89%.

In order to properly train this network, therefore, we need to reformat all single valued training labels into a probability distribution across the three categories. Since we are 100% sure the labels in our training data are correct, each probability distribution only needs one class scored with a 1 and the other possible class labels zeroed out. 

It looks like this for our three classes:

'0' becomes [1  0  0]

'1' becomes [0  1  0]

'2' becomes [0  0  1]

This is why it is called **one hot encoding** and you can code it like this:


In [15]:
total_classes = 3

def convert_to_one_hot(vector, num_classes):
    result = np.zeros(shape=(len(vector), num_classes))
    result[np.arange(len(vector)), vector] = 1
    return result.astype(int)

one_hot_labels = convert_to_one_hot(labels, total_classes)

print("The original first 5 labels\n",labels[:5])
print("\nThe first 5 one hot lables\n", one_hot_labels[:5])

The original first 5 labels
 [1 0 2 1 1]

The first 5 one hot lables
 [[0 1 0]
 [1 0 0]
 [0 0 1]
 [0 1 0]
 [0 1 0]]


There is one last step we should do to properly prepare our data. As of now, we have 150 records we can use for training our neural network. However, if we use all the data for training, we won't have any *clean* data to see how well it performs against new data it hasn't seen before.

Fortunately, scikit-learn also gives us an easy function that splits up data into training and testing groups, based on some percentage. We will use X and y as our features and labels for training, and only use X_test and y_test once we've built and trained the model to see how good our model is at making new predictions.


In [17]:
from sklearn.model_selection import train_test_split
X, X_test, y, y_test = train_test_split(features,one_hot_labels, test_size=0.1)

print("total training records:", len(y))
print("\ntotal testing records:", len(y_test))

total training records: 135

total testing records: 15


Now that we have loaded all our data, properly shuffled the records, one hot encoded the labels and split out a few records to use for final testing, we're ready to build our model!


## Step 2: Building our Neural Network

We are going to build a neural network that takes in four inputs (one per feature) all connected to a hidden layer that has five neurons. This hidden layer is fully connected to a final output layer that has three neurons (one per each class of iris).

The secret sauce of a neural network are the *weights* that connect together all these neurons together. For example, each of the four input neurons has a unique weight for each of the 5 neurons it is connected to in the hidden layer. That means there are 4x5 or 20 weights leading into the hidden layer. 

There is another set of weights for each of the 5 neurons on the hidden layer connecting to each of the 3 output neurons, for a total of 5x3 or 15 weights in the second set.

The way our neural network will work is that for each record we use for training, those 4 measurement features are used as our input values **X** for the 4 input neurons. Then, each of those input values are multiplied by their unique weights per neuron **W** in the hidden layer. Now each of the 5 nodes in the hidden layer sums up the four distinct **(X \* W)** values it received from the input nodes and adds an additional value called a bias. After each neuron has added up all it the values of the inputs neurons multiplied by their unique weights and added a bias factor, the last step a neuron does is make it **non-linear** with an **activation function**. This gives the neuron a bit more wiggle room as it eventually improves and looks for ways to optimize itself through training. While there are multiple types of activation functions, we will use the popular **sigmoid** function as the activation for our hidden layer neurons. A [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) takes any number and squashes it down to a value between 0 and 1.

Once all the neurons on the hidden layer have their summed and activated values the received from the input features, this process is repeated again bewteen the hidden layer and the output layer. Each of the 5 hidden layer neurons multiplies its calculated value by its unique weights per output node and sends them along. Each output node sums up all the values it received from the hidden layer. However, the final output layer doesn't use sigmoid, but an activation technique called **softmax** which will take it's three values and convert them into a true probability distribution. This means that like sigmoid, each of the three output neurons (which represent the three possible iris categories) will have their values squashed down to a value bewteen 0 and 1, but softmax takes it a step farther and makes sure that collectively, all three values add up to 1.0 (representing all possible scenarios).

Let's first setup these sigmoid and softmax functions, which we will make use of in our model:


In [19]:
# SIGMOID activation squashes a number onto a curve from 0 to 1
def sigmoid(x):
    return 1. / (1. + np.exp(-x))

# SOFTMAX is used on the output layer to both squash each neuron value
# to a number on a curve from 0 to 1 but also make sure that all
# of their unique values collectively add up to exactly 1.0, 
# representing a true probability distribution among all possible categories
def softmax(w):
    e = np.exp(w - np.amax(w))
    dist = e / np.sum(e)
    return dist


What we've discussed so far is called the **forward pass**. The flow is input values (**X**) come in from the features of each record, these input values are multiplied by their unique weights and summed together in each neuron on the hidden layer. The hidden layer adds an additional bias term (treated as an additional weight in practice), and then applies a sigmoid activation to the result. This final activation value is then multipled by it's own unique set of weights and those values are summed together in each of the three ouput layer neurons. The output layer applies softmax onto its three neuron values to produce a final probability distribution that should match the *one hot encoded* label for that record, which is what the answer *should* be.

All this sounds well and good. But you need to remember that the important thing here as the programmer is that we build the *model* correctly, not that we figure out the secret sauce, the weights, ourselves. That's the whole point of a neural network. But until we have a good set of weight values, our neural network, even if it is flawlessly built, is still going to be useless and just spew out junk. 

It is important to realize this early on, and I don't believe this can be stated explicitly enough when you're first starting out - when we first build our neural network, *all these weights are initialized to just random garbage.* That means it doesn't matter how many *forward passes* you run through, nothing is going to change. Those initial random weights, which are rubish, are just going to stay rubish and your neural network will never *learn*. That is where **backpropagation** comes in. Here's how it works.

When we run a **forward pass**, we end up with a probability distribution that will look something like [.43  .37 .2]. Each value represents the probability that the features map to either a setosa, versicolor or verginica iris. As we've discussed above, these first *several hundred* predictions are going to most likely be horribly, horribly wrong. But that is why we are **training** the neural network with features that have the correct label available. 

What backpropgation does is it looks at the result of the forwardpass, the predicted answer, and compares it to what the real answer should be. The difference between what the model produced and what the answer should have been now gives us a metric, an error score, we can exploit with the power of calculus! With that error score and a bit of this *calculus magic* (specifically derivatives, partial derivatives and the chain rule) we can start walking backward through the neural network and have each weight calculate which way it should change itself to move in the 'less wrong' direction. By the way, how *big of a step* it takes in this 'optimized' direction is called the **learning rate**. By running this "forwardpass / backward pass" training_cycle many, many times, and each time the weights are nudged in a direction that should be slightly more optimal if they were trying to get their calculated prediction closer to what the answer *should* have been, we will slowly *grow* our neural network into a accurately weighted, classification machine.

If you want to learn more about backpropagation and how it works, *once you're done with **this** notebook,* check out this [5 minute video](https://www.youtube.com/watch?v=q555kfIFUCM) by Siraj Raval for a great introduction to the topic. 

For now, though, if you're ok considering backpropagation as a form of *calculus magic*, we can move on. There is one explicit function we will need to define to help us out, though. This function calcuates the derivative of our sigmoid function we defined and which is used in the hidden layer. We need this for backprop as we're calculating how much each weights needs to get tweaked, based on how much it contributed to the final error and how much it actually did. We are also going to import one more scikit-learn function we will use during backprop to figure out our error score. 

Here they both are:


In [22]:
from sklearn.metrics import log_loss

def sigmoidPrime(y):
    return np.multiply(y, (1. - y))


With all that out of the way, we're ready for the main event. Defining our simple Neural Network class. When we initialize it, it will expect to know how many input features it is expected to process (in our case, 4 measurements). It will also want to know how many neurons we want in our hidden layer (we will start with 5 neurons, though this is a **hyper-parameter**, or number you may want to play around with and see if the model trains better or faster ). Finally, it will want to know how many classes should exist in our output layer's probability distribution.

We are also adding a reset_all() function so that we can test and poke around this Neural Network once it's working and reset all the learned weights back to initial, randomized values (which will predict garbage again, until we re-train the network again).

We define the forward and backward passes as their own functions, though in practice we won't call these independently. We will instead call the **training_cycle()** function, passing it in a record of features (**X**) and one_hot_labels (**y**) to be used for training. 

Instead of having a discreet training_cycle() function, most Neural Network samples provide a function called fit() or train() that runs this training_cycle many, many times. I wanted to make it a single forward/backward pass function, though, so we can start off very slowly and send a few records in one at a time to see what is actually happening during a training cycle. Once satisfied, we can speed up our training by calling this training_cycle() in it's own loop.

Here is the full class definition:


In [36]:
class Neural_Network(object):
    
    def __init__(self, input_size, hidden_size, output_size):

        self.learning_rate = 0.01
        
        # A very easy way of adding a bias to each neuron is to treat the bias
        # as one more input value, which will be fully connected to all neurons
        # in the hidden layer
        self.input_size = input_size + 1
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        # Let's use our own reset_all() function to initialize 
        # the weights with small, random values
        self.reset_all()
        
    def reset_all(self):
        self.total_training_cycle = 0
    
        # Setup our activations, or initial values, of our neurons
        # We can initialize these all to '1' for now.
        self.activations_input = np.ones(self.input_size)
        self.activations_hidden = np.ones(self.hidden_size)
        self.activations_output = np.ones(self.output_size)
        
        # Setup our weights. Here, we don't want to initialize them to either 1, zero
        # the same number as they each need to start and optimize themselves (grow) in
        # their own distinct direction. The best thing to do is to fill them up with
        # random numbers that cover a small range around zero, like so:
        input_range = 0.2

        # The size of our input_layer weights is (input_size, hidden_size) or (4, 5)
        self.weights_input_layer  = np.random.normal(loc = 0, scale = input_range, size = (self.input_size, self.hidden_size))
        
        # The size of our output_layer weights is (hidden_size, output_size) or (5, 3)
        self.weights_output_layer = np.random.normal(loc = 0, scale = input_range, size = (self.hidden_size, self.output_size))
    
    
    def feedForward(self, inputs):
        
        # Don't forget, we added an extra input neuron to serve as our hidden neuron's bias value
        # so only fill up the first 4 (not 5) vaules with the 4 input features of the record
        # we are processing this learning cycle
        self.activations_input[0:self.input_size-1] = inputs
        
        # Since the weights are a matrix, we dot product to multiply them 
        # together with their corresponding input values
        hidden_sums = np.dot(self.weights_input_layer.T, self.activations_input)
        # we then apply our activation function on the sum totals to get our neuron's final value
        self.activations_hidden = sigmoid(hidden_sums)

        # We repeat the process, only this time with the hidden neurons and their weights
        # connecting them to the output neurons
        output_sums = np.dot(self.weights_output_layer.T, self.activations_hidden)
        # we apply softmax instead of sigmoid on the final output layer to get a true
        # probability distribution ( all 3 values will add up to 1.0 )
        self.activations_output = softmax(output_sums)

        # we return our activations_output, which contains 3 values, 
        # one probability prediction for each of the 3 classes of irises 
        return self.activations_output
    
    
    def backPropagate(self, targets):
        
        # the targets are our one_hot_encoded labels values for the record being processed
        output_deltas = -(targets - self.activations_output)
 
        # **********************************************************************************
        # BEGIN CALCULUS MAGIC. ONCE YOU GROK THE FULL THING, FEEL FREE TO COME BACK 
        # HERE AND DIVE IN. BUT DON'T GET DISTRACTED UNTIL YOU FIRST FINISH THE FULL
        # PROJECT AND SEE IT WORKING. THEN COME BACK HERE AND JUMP IN IF YOU LIKE!
        error = np.dot(self.weights_output_layer, output_deltas)
        hidden_deltas = sigmoidPrime(self.activations_hidden) * error
        
        # update the weights connecting the hidden layer to the output layer
        output_gradients = output_deltas * np.reshape(self.activations_hidden, (self.activations_hidden.shape[0], 1))
        self.weights_output_layer = self.weights_output_layer - self.learning_rate * output_gradients
        
        # update the weights connecting the input layer to the hidden layer
        hidden_gradients = hidden_deltas * np.reshape(self.activations_input, (self.activations_input.shape[0], 1))
        self.weights_input_layer = self.weights_input_layer - self.learning_rate * hidden_gradients
        # **********************************************************************************
        
        # This will give us an overal 'error score' we can use to see how well our Neural Network is performing...
        log_error = log_loss(targets, self.activations_output)
        return log_error
    
    def training_cycle(self, X, y):
        # A training cycle expects the features (X) and 
        # one_hot_label (y) from a record in our training data
        
        # we pass the features as our input data for the forward pass...
        self.feedForward(X)
        
        # then compare the prediction against the true answer (y) in our backward pass...
        error = self.backPropagate(y)
        
        self.total_training_cycle += 1
        # let's return some extra information we may want to explore as we watch the Neural Network learn and grow
        return self.total_training_cycle, self.activations_output, error
        

## Step 3: Training our Neural Network

Now that we've defined our neural network, let's instantiate it and start passing in records from our Iris dataset:

In [37]:

total_features = 4 # each record has 4 measurement features
hidden_neurons = 5 # We want 5 neurons on our hidden layer
output_neurons = 3 # our output layer has one neuron per flower category

NN = Neural_Network(total_features, hidden_neurons, output_neurons)


Let's see what the two sets of weights look like. In fact, if you run the following cell multiple times, you will see that each time the weights will be different, since they are always initialized to just random junk and will need to be grow over time through many, many training_cycle() calls...

In [61]:
# In case you want to run this cell multiple times, you'll see we are always resetting
# and randomizing our weights when we first initialize our untrained Neural Network
NN.reset_all()

# LET'S SEE WHAT THE RANDOM WEIGHTS ARE AFTER BUILDING OUR NETWORK
print("INITIALIZED WEIGHTS_INPUT_LAYER:")
print(NN.weights_input_layer,"\n")

print("INITIALIZED WEIGHTS_INPUT_LAYER:")
print(NN.weights_output_layer)

INITIALIZED WEIGHTS_INPUT_LAYER:
[[ 0.2693492   0.02535863 -0.15192788 -0.00067017 -0.01379613]
 [-0.51564309 -0.11227337  0.06284075 -0.08484311 -0.12924187]
 [ 0.18227524 -0.18330916  0.14232453 -0.26654889  0.55215239]
 [-0.04313711 -0.06465306  0.22861318 -0.08374097  0.29098556]
 [ 0.07431877 -0.02367958 -0.04500118  0.24151843  0.04609744]] 

INITIALIZED WEIGHTS_INPUT_LAYER:
[[ 0.08646175  0.10820037  0.02658724]
 [ 0.10397445 -0.02366468 -0.00996173]
 [-0.22382299  0.05955205 -0.3439082 ]
 [ 0.05671161 -0.0561232   0.07491702]
 [-0.20976567 -0.42387147 -0.07017306]]


In [63]:
# LET'S PASS THE FIRST RECORD IN OUR TRAINING DATASET THROUGH OUR NEW NEURAL NETWORK
loop_count, prediction, error = NN.training_cycle(X[0], y[0])
print("loop:",loop_count," target: ", y[0], " => ", prediction, "  log_error: ", error)


loop: 1  target:  [0 1 0]  =>  [ 0.34420125  0.31466026  0.34113849]   log_error:  0.665134996522


As we can see, the very first loop produces horrible results, and why shouldn't it? The weights, the secret sauce or the neural network is totally untrained. Actually, it has been slightly trained, since it has experienced it's training cycle, including the backpropagation() that slightly tweaked it's weights. Let's verify this by looking at the weights again:

In [64]:
print("INITIALIZED WEIGHTS_INPUT_LAYER:")
print(NN.weights_output_layer)

INITIALIZED WEIGHTS_INPUT_LAYER:
[[ 0.08391677  0.11326768  0.0240649 ]
 [ 0.10302124 -0.02176675 -0.01090645]
 [-0.22565606  0.0632019  -0.34572497]
 [ 0.05587271 -0.05445286  0.07408558]
 [-0.2128824  -0.41766574 -0.07326206]]


We can see that the output weights are now *slightly* different. Backpropagation has used that calculus magic to lightly nudge them in a direction that would have gotten their error score a little smaller. Let's call the training_cycle a few more times and see what happens. It will help if we define a singlePass() function for this:

In [43]:
def singlePass(idx):
    loop_count, prediction, error = NN.training_cycle(X[idx], y[idx])
    print("loop:",loop_count," target: ", y[idx], " => ", prediction, "  log_error: ", error)

print("First 5 records:\n", X[:5])
print("\nFirst 5 matching labels:\n", y[:5])

print("\nRunning the first 5 records through the Neural Network:\n")
singlePass(0)
singlePass(1)
singlePass(2)
singlePass(3)
singlePass(4)

First 5 records:
 [[ 5.5  2.3  4.   1.3]
 [ 6.3  2.8  5.1  1.5]
 [ 5.4  3.4  1.5  0.4]
 [ 6.   2.9  4.5  1.5]
 [ 6.9  3.1  5.1  2.3]]

First 5 matching labels:
 [[0 1 0]
 [0 0 1]
 [1 0 0]
 [0 1 0]
 [0 0 1]]

Running the first 5 records through the Neural Network:

loop: 2  target:  [0 1 0]  =>  [ 0.37362072  0.36521779  0.26116149]   log_error:  0.592578851778
loop: 3  target:  [0 0 1]  =>  [ 0.37290538  0.36834671  0.25874792]   log_error:  0.759324481499
loop: 4  target:  [1 0 0]  =>  [ 0.35627257  0.37793727  0.26579016]   log_error:  0.605244649705
loop: 5  target:  [0 1 0]  =>  [ 0.37326426  0.36389688  0.26283887]   log_error:  0.594354608299
loop: 6  target:  [0 0 1]  =>  [ 0.37142832  0.36774208  0.2608296 ]   log_error:  0.755550349497


The log error doesn't really give us much to go on yet, since it's all over the map. That is because there are only three categories, so even with this very young neural network almost blindly guessing, it's still going to be right around 33% of the time! 

Let's reset our Neural Network and start to run a lot more training_cycles. When you run through all the training data one time, that is usually called an **epoch**. So let's build a function that will run a full epoch, or all 135 training records (don't forget, we still have 15 records set aside to use for testing once we've grown a strong neural network ) 

In [46]:
# Remember, this resets the weights so they will no longer match
# with the above explorations. We are starting fresh and from square one again...
NN.reset_all()

def run_full_epoch():
    # Let's start averaging the epoch error, which will make it
    # much more accurate than the individual error per record
    # which doesn't account for random guesses being right a third of the time
    epoch_error = 0
    total_records = len(X)
    
    # This will run through each record in our dataset once,
    # slightly optimizing the weights and thus training our model
    # each time we call training_cycle()
    for i in range(total_records):
        _, prediction, error = NN.training_cycle(X[i], y[i])
        epoch_error += error
    
    # Let's see what our average error rate is for that Epoch of training.
    # This is what we will look at to make sure it is going DOWN.
    # That means the error is getting smaller, or our neural network is getting smarter!
    epoch_error = epoch_error / float(total_records)
    print("\nEpoch average error: ", epoch_error )

    # After each Epoch of training, let's pull a random sample
    # from our dataset to see how well the predictions match up to the one_hot_label
    # they're shooting for...
    random_sample = int(np.random.randint(0, len(X)-1))
    loop_count, prediction, _ = NN.training_cycle(X[random_sample], y[random_sample])
    print("loop: ", loop_count, " target: ", y[random_sample], " => ", prediction)

# Let's run a full epoch, or go through each of the 135 records in our dataset once...
run_full_epoch()



Epoch average error:  0.635613824874
loop:  136  target:  [0 1 0]  =>  [ 0.29955778  0.33991338  0.36052883]


We see our average error is around .63 and our prediction still seems pretty much like a pure guess.

This isn't good. But it's a start!

Let's run through 10 more epochs and see if our average error starts going down:

In [48]:
for i in range(10):
    run_full_epoch()



Epoch average error:  0.630789634342
loop:  272  target:  [0 1 0]  =>  [ 0.29650255  0.32728224  0.37621521]

Epoch average error:  0.626099344244
loop:  408  target:  [1 0 0]  =>  [ 0.31410764  0.33649905  0.34939331]

Epoch average error:  0.619721270147
loop:  544  target:  [0 0 1]  =>  [ 0.27732252  0.31004189  0.41263559]

Epoch average error:  0.610521606591
loop:  680  target:  [0 0 1]  =>  [ 0.2546135   0.30617147  0.43921503]

Epoch average error:  0.597263664073
loop:  816  target:  [1 0 0]  =>  [ 0.36663587  0.33142909  0.30193504]

Epoch average error:  0.578927551171
loop:  952  target:  [0 1 0]  =>  [ 0.29049441  0.32656956  0.38293602]

Epoch average error:  0.556195224188
loop:  1088  target:  [1 0 0]  =>  [ 0.45151613  0.31029856  0.23818531]

Epoch average error:  0.529385568667
loop:  1224  target:  [0 1 0]  =>  [ 0.22703763  0.32694988  0.44601249]

Epoch average error:  0.501418015058
loop:  1360  target:  [0 1 0]  =>  [ 0.23142739  0.33674207  0.43183054]

Epoch 

**WOOT!**

our average error is starting to go down. our predictions are still rubbish, but they are slowly getting better on average. Let's crank up more revolutions. Say 25 more epochs:

In [50]:
for i in range(25):
    run_full_epoch()



Epoch average error:  0.448567024666
loop:  1632  target:  [0 0 1]  =>  [ 0.1245329   0.34442173  0.53104537]

Epoch average error:  0.426363477651
loop:  1768  target:  [0 1 0]  =>  [ 0.20045424  0.36866163  0.43088412]

Epoch average error:  0.407129555377
loop:  1904  target:  [0 1 0]  =>  [ 0.23083145  0.37915419  0.39001436]

Epoch average error:  0.391014724891
loop:  2040  target:  [0 0 1]  =>  [ 0.06537836  0.34608456  0.58853708]

Epoch average error:  0.377478142141
loop:  2176  target:  [1 0 0]  =>  [ 0.78302919  0.15558821  0.0613826 ]

Epoch average error:  0.365613883777
loop:  2312  target:  [1 0 0]  =>  [ 0.76758564  0.1671593   0.06525507]

Epoch average error:  0.355544579925
loop:  2448  target:  [1 0 0]  =>  [ 0.84055955  0.12009904  0.03934141]

Epoch average error:  0.346847022065
loop:  2584  target:  [1 0 0]  =>  [ 0.84662814  0.11771079  0.03566107]

Epoch average error:  0.339084253057
loop:  2720  target:  [0 0 1]  =>  [ 0.02878808  0.33588543  0.63532649]



Nice.

Our average error is still going down and now we can see that our sample predictions are returning accurate results for the most part. Let's hit it with another 30 epochs of training and see how low we get:

In [52]:
for i in range(30):
    run_full_epoch()



Epoch average error:  0.245658417988
loop:  5032  target:  [0 0 1]  =>  [ 0.01608487  0.34978043  0.6341347 ]

Epoch average error:  0.240878907391
loop:  5168  target:  [0 0 1]  =>  [ 0.00802199  0.26666623  0.72531178]

Epoch average error:  0.235623977402
loop:  5304  target:  [0 1 0]  =>  [ 0.21382681  0.6131437   0.17302948]

Epoch average error:  0.230378797318
loop:  5440  target:  [0 0 1]  =>  [ 0.01486166  0.35672817  0.62841017]

Epoch average error:  0.225946741653
loop:  5576  target:  [1 0 0]  =>  [ 0.90223454  0.09300606  0.0047594 ]

Epoch average error:  0.220638425339
loop:  5712  target:  [1 0 0]  =>  [ 0.92016144  0.07656778  0.00327077]

Epoch average error:  0.215914576195
loop:  5848  target:  [0 0 1]  =>  [ 0.00398369  0.19417624  0.80184007]

Epoch average error:  0.211361469717
loop:  5984  target:  [0 1 0]  =>  [ 0.07432277  0.63040103  0.2952762 ]

Epoch average error:  0.206368959141
loop:  6120  target:  [1 0 0]  =>  [ 0.91849981  0.07881315  0.00268704]



One more time...

In [55]:
for i in range(30):
    run_full_epoch()



Epoch average error:  0.115360887617
loop:  10472  target:  [0 0 1]  =>  [  4.58138259e-04   6.46581629e-02   9.34883699e-01]

Epoch average error:  0.113859561054
loop:  10608  target:  [0 1 0]  =>  [ 0.024821    0.75625344  0.21892556]

Epoch average error:  0.112267572556
loop:  10744  target:  [0 0 1]  =>  [ 0.00423611  0.2994531   0.69631079]

Epoch average error:  0.111284499649
loop:  10880  target:  [1 0 0]  =>  [  9.59954140e-01   3.98353141e-02   2.10545784e-04]

Epoch average error:  0.109537652705
loop:  11016  target:  [0 0 1]  =>  [  3.75934118e-04   5.60469590e-02   9.43577107e-01]

Epoch average error:  0.10822115072
loop:  11152  target:  [0 1 0]  =>  [ 0.04601588  0.88427512  0.06970899]

Epoch average error:  0.106877787148
loop:  11288  target:  [0 0 1]  =>  [  5.40714140e-04   7.59178147e-02   9.23541471e-01]

Epoch average error:  0.105679360796
loop:  11424  target:  [0 0 1]  =>  [ 0.00219782  0.21704295  0.78075923]

Epoch average error:  0.104663608881
loop:  

After almost 100 Epochs of training, with each Epoch giving the Neural Network exposure to the full dataset, we have an average error rate under 10%! That's great. We could keep going, though we would eventually hit a plateau most likely and never get it all the way down to zero. 

## Step 4: Testing our Neural Network

For now, though, this is more than enough to prove that our Neural Network works, learns by being exposed to data through the forward and backward passes, and has grown a nice collection of weights. Let's use our testing data now and see how our Neural Network stands up to new data it has never seen before.

For testing, we won't call a full training cycle anymore but just call the forward pass directly and see what our trained weights get us. We split out 15 records for testing from our original Iris dataset, so let's just run through them all.



In [58]:
total_test_records = len(y_test)

for i in range(total_test_records):
    test_result = NN.feedForward(X_test[i])
    print("Test:", i, "\nTarget:", y_test[i],"\nPrediction: ", test_result)
    
    

Test: 0 
Target: [0 1 0] 
Prediction:  [ 0.0084245   0.64473482  0.34684068]
Test: 1 
Target: [1 0 0] 
Prediction:  [  9.62175655e-01   3.77177240e-02   1.06621302e-04]
Test: 2 
Target: [0 1 0] 
Prediction:  [ 0.0459506   0.93628104  0.01776837]
Test: 3 
Target: [0 0 1] 
Prediction:  [  5.94041677e-04   9.75696254e-02   9.01836333e-01]
Test: 4 
Target: [0 0 1] 
Prediction:  [  2.25100103e-04   4.75673453e-02   9.52207555e-01]
Test: 5 
Target: [1 0 0] 
Prediction:  [  9.69292761e-01   3.06223645e-02   8.48745599e-05]
Test: 6 
Target: [1 0 0] 
Prediction:  [  9.66004420e-01   3.39020587e-02   9.35214425e-05]
Test: 7 
Target: [0 0 1] 
Prediction:  [ 0.00093543  0.13727543  0.86178914]
Test: 8 
Target: [0 1 0] 
Prediction:  [ 0.04571727  0.93677001  0.01751273]
Test: 9 
Target: [0 1 0] 
Prediction:  [ 0.12464612  0.86967803  0.00567584]
Test: 10 
Target: [0 0 1] 
Prediction:  [  3.69233349e-04   6.49375410e-02   9.34693226e-01]
Test: 11 
Target: [1 0 0] 
Prediction:  [  9.69730403e-01   3.


As a quick side note, that **e-01** or **e-02** etc. at the end of some results is scientific notation. In this context, it means you move the decimal to the left that many places, since the **e** number in these results is negative (-01, -02, ...). When the e number is positive, btw, it means you move the decimal to the *right* that many places, but none of our results here would have a positive e number, since we're dealing with partial percentages that add up to 1.0. 

So, for example, a value of **9.65e-01** means you move the decimal to the left, resulting in a standard notation of **.965**. If this is new to you, here is a short cut: since the farther to the left the decimal is, the smaller the prediction value is, just look for the smallest **e-** negative number to let you know which of the three numbers is largest. A number with **e-02** at the end of it will be larger than a number with **e-03**  

How did you do?!

All 15 of my test records were successfully predicted by our now trained and properly weighted neural net. That's pretty amazing when you think about it. 

## IN CONCLUSION

We, as programmers, didn't *solve the problem* of figuring out how to properly convert four iris measurements to figure out how to properly classify them. That is what **classical computer science** does. As programmers, we ourselves are the ones that create the algorithm, or secret sauce that solves the problem.

Instead, with **multilayer neural nets**, what we do is build a brain, an artificial neural network and make sure the *structure* is correct: that we can feed it the features, that we have created enough neurons, each with weights and and activation function, and then measured the results. As long as we have enough data, and our brain can feedforward that data to produce a prediction, measure *how wrong* that prediction is against what was expected, then use backpropagation and some calculus magic to slightly nudge it's weights in a more optimium direction, our artificial brain will **itself** learn to solve the problem.

This is the biggest mental shift to grasp when switching from classical computer programming to AI programming and deep learning. We no longer create the secret sauce ourselves. We build a brain that can take in large amounts of data and learn for itself what the secret sauce is. 

That is why this AI boom is so important. For decades, we as humans were creating software using classical programming, but the assumption has always been we can only write solutions for problems we as humans are smart enough to solve. But there is an entire universe of problems that are too complex for us to solve with our human brains.

AI starts where classical programming meets its limit. If we *can* solve a program using classical programming, then do it. If you do it right, the output should be 100% correct all the time. But for all those problems we can't solve with 100% certainty, as long as we have enough real world data of previous inputs that lead to various outputs, we can now build an **artificial brain** that can chew over that data and see patterns we can't. Give it enough data, and it can learn the data well enough to be able to make predictions on new information. That's amazing. And that's A.I.

@rickbarraza
