## Perceptron Recap

![](perceptron.gif)

1. Input Layer
2. Weights
3. Activation Function
4. Bias
5. Output

## How Neural Networks Work Recap

1. We specify the architecture of our neural network. (Much like model building, this is based on our data and assumptions, and we get better with practice.)
2. We decide how many epochs $n$ we want to run and how many batches $k$ go in an epoch.
3. Split training data into $k$ batches.
4. Feed first batch into neural network.
5. Calculate error and update weights/bias accordingly.
6. Feed next batch into neural network.
7. Repeat steps 5 and 6 until all $k$ batches have gone through exactly once. This ends the epoch.
8. Repeat steps 4-7 until we have completed $n$ epochs.
9. Make adjustments as necessary and/or use model for prediction.

### Building a Neural Network in Numpy

In [1]:
import numpy as np

input_data = np.array([-1,4])

weights = { 'node_0_0': np.array([3,3]),
            'node_0_1': np.array([3,3]),
            'node_1_0': np.array([3,3]),
            'node_1_1': np.array([3,3]),
            'output': np.array([2,-1])}

In [2]:
def predict_with_network(input_data):
    # Calculate node 0 in the first hidden layer
    node_0_0_input = (input_data * weights['node_0_0']).sum()
    node_0_0_output = relu(node_0_0_input)

    # Calculate node 1 in the first hidden layer
    node_0_1_input = (input_data * weights['node_0_1']).sum()
    node_0_1_output = relu(node_0_1_input)

    # Put node values into array: hidden_0_outputs
    hidden_0_outputs = np.array([node_0_0_output, node_0_1_output])
    
    # Calculate node 0 in the second hidden layer
    node_1_0_input = (hidden_0_outputs*weights['node_1_0']).sum()
    node_1_0_output = relu(node_1_0_input)

    # Calculate node 1 in the second hidden layer
    node_1_1_input = (hidden_0_outputs*weights['node_1_1']).sum()
    node_1_1_output = relu(node_1_1_input)

    # Put node values into array: hidden_1_outputs
    hidden_1_outputs = np.array([node_1_0_output, node_1_1_output])

    # Calculate model output: model_output
    model_output = (hidden_1_outputs*weights['output']).sum()
    
    # Return model_output
    return(model_output)

output = predict_with_network(input_data)
print(output)

NameError: global name 'relu' is not defined

### (Briefly) Building a Neural Network in TensorFlow

In [3]:
# complete program
import numpy as np
import tensorflow as tf

# Model parameters
W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)

# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W * x + b
y = tf.placeholder(tf.float32)

# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares

# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

# training data
x_train = [1,2,3,4]
y_train = [0,-1,-2,-3]

# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
for i in range(1000):
    sess.run(train, {x:x_train, y:y_train})

# evaluate training accuracy
curr_W, curr_b, curr_loss  = sess.run([W, b, loss], {x:x_train, y:y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))

W: [-0.9999969] b: [ 0.99999082] loss: 5.69997e-11


## Gradient Descent Algorithm

**Goal: To find the best choice of parameter!** 

1. Pick a "learning rate," called $\alpha$.
2. Pick a starting point for your parameter - let's call it $\beta_{1_1}$.
3. Check the value of your loss function at $\beta_{1_1}$ and calculate the gradient.
4. Replace $\beta_{1_1}$ with $\beta_{1_2}$ := $\beta_{1_1} - \alpha\left(\text{gradient of loss function}\right)$. (Check the picture below to see an example of calculating the gradient for one particular loss function.)
5. Repeat until you satisfy some condition indicating that you've minimized your loss function (or maximized your utility function). This will be the same idea as our "stopping condition" in iteratively reweighted least squares!
6. Take your final value, $\beta_{1_n}$, and set it equal to your estimate $\hat{\beta}_1$.

![](pic1.jpg)

![](pic2.jpg)

Gradient descent is the method by which we optimize many parameters. There is a slight deviation of this called "stochastic gradient descent" where we randomly select different values of $\alpha$.

**Check:** Why might we include these stochastic (random) choices for $\alpha$?

**Check:** What might gradient descent require in terms of our loss function?

# Building a simple neural-network with Keras

**Author: Xavier Snelgrove**

This is a simple quick-start in performing digit recognition in a neural network in Keras, for a short tutorial at the University of Toronto. It is largely based on the `mnist_mlp.py` example from the Keras source.

## Install prerequisites
We need to install the following packages:

    pip install numpy jupyter keras==1.2.2 matplotlib tensorflow
    
In terminal, type `nano ~/.keras/keras.json`:

    {
        "epsilon": 1e-07,
        "floatx": "float32",
        "image_data_format": "channels_last",
        "backend": "theano"
    }
    
Close your terminal and re-open it.

Close your jupyter notebook and re-open it.

## Time to build a neural network!
First let's import some prerequisites.

In [8]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (7,7) # Make the figures a bit bigger

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.utils import np_utils

ValueError: No JSON object could be decoded

## Load training data

In [5]:
nb_classes = 10 ## Number of categories for prediction.

# the data, shuffled and split between tran and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
print("X_train original shape", X_train.shape)
print("y_train original shape", y_train.shape)

NameError: name 'mnist' is not defined

Let's look at some examples of the training data

In [None]:
for i in range(9):
    plt.subplot(3,3,i+1)
    plt.imshow(X_train[i], cmap='gray', interpolation='none')
    plt.title("Class {}".format(y_train[i]))

## Format the data for training
Our neural-network is going to take a single vector for each training example, so we need to reshape the input so that each 28x28 image becomes a single 784 dimensional vector. We'll also scale the inputs to be in the range [0-1] rather than [0-255]

In [None]:
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print("Training matrix shape", X_train.shape)
print("Testing matrix shape", X_test.shape)

Modify the target matrices to be in the one-hot format, i.e.

```
0 -> [1, 0, 0, 0, 0, 0, 0, 0, 0]
1 -> [0, 1, 0, 0, 0, 0, 0, 0, 0]
2 -> [0, 0, 1, 0, 0, 0, 0, 0, 0]
etc.
```

In [None]:
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

# Build the neural network
Build the neural-network. Here we'll do a simple 3 layer fully connected network.
![](figure.png)

In [None]:
model = Sequential()
model.add(Dense(512, input_shape=(784,)))
model.add(Activation('relu')) # An "activation" is just a non-linear function applied to the output
                              # of the layer above. Here, with a "rectified linear unit",
                              # we clamp all values below 0 to 0.
                           
model.add(Dropout(0.2))   # Dropout helps protect the model from memorizing or "overfitting" the training data
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))
model.add(Activation('softmax')) # This special "softmax" activation among other things,
                                 # ensures the output is a valid probaility distribution, that is
                                 # that its values are all non-negative and sum to 1.

## Compile the model
Keras is built on top of Theano (and now TensorFlow as well), both packages that allow you to define a *computation graph* in Python, which they then compile and run efficiently on the CPU or GPU without the overhead of the Python interpreter.

When compiling a model, Keras asks you to specify your **loss function** and your **optimizer**. The loss function we'll use here is called *categorical crossentropy*, which is a loss function well-suited to comparing two probability distributions.

Here our predictions are probability distributions across the ten different digits:

| P(X = 3) | P(X = 8) | P(X = 2) | P(X = 5) | ...
|:--------:|:--------:|:--------:|:--------:|:--------:
|   0.80   |   0.10   |   0.05   |   0.01   | ...

(e.g. "we're 80% confident this image is a 3, 10% sure it's an 8, 5% it's a 2, etc.").

The target is a probability distribution with 100% for the correct category, and 0 for everything else:

| P(X = 3) | P(X = 8) | P(X = 2) | P(X = 5) | ...
|:--------:|:--------:|:--------:|:--------:|:--------:
|   1.00   |   0.00   |   0.00   |   0.00   | ...

The **cross-entropy** is a measure of how different your predicted distribution is from the target distribution. [See more detail at Wikipedia!](https://en.wikipedia.org/wiki/Cross_entropy)

The optimizer helps determine how quickly the model learns, how resistant it is to getting "stuck" or "blowing up". We won't discuss this in too much detail, but "adam" is often a good choice (developed at the University of Toronto).

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

## Train the model!
This is the fun part: you can feed the training data loaded in earlier into this model and it will learn to classify digits

In [None]:
model.fit(X_train, Y_train,
          batch_size=128, nb_epoch=4,
          verbose=1, validation_data=(X_test, Y_test))

## Finally, evaluate its performance

In [None]:
score = model.evaluate(X_test, Y_test, verbose=0)

print('Test score:', score)
print('Test metric:', model.metrics_names)

### Inspecting the output

It's always a good idea to inspect the output and make sure everything looks sane. Here we'll look at some examples it gets right, and some examples it gets wrong.

In [None]:
# The predict_classes function outputs the highest probability class
# according to the trained classifier for each input example.
predicted_classes = model.predict_classes(X_test)

# Check which items we got right / wrong
correct_indices = np.nonzero(predicted_classes == y_test)[0]
incorrect_indices = np.nonzero(predicted_classes != y_test)[0]

In [None]:
plt.figure()
for i, correct in enumerate(correct_indices[:9]):
    plt.subplot(3,3,i+1)
    plt.imshow(X_test[correct].reshape(28,28), cmap='gray', interpolation='none')
    plt.title("Predicted {}, Class {}".format(predicted_classes[correct], y_test[correct]))
    
plt.figure()
for i, incorrect in enumerate(incorrect_indices[:9]):
    plt.subplot(3,3,i+1)
    plt.imshow(X_test[incorrect].reshape(28,28), cmap='gray', interpolation='none')
    plt.title("Predicted {}, Class {}".format(predicted_classes[incorrect], y_test[incorrect]))

# That's all!

There are lots of other great examples at the Keras homepage at http://keras.io and in the source code at https://github.com/fchollet/keras

# Wrapping Up

**Check:** Suppose you could use Keras or Tensorflow. Which would you use?

**Check:** How would you describe a perceptron?

**Check:** How would you describe an activation function?

**Check:** How would you describe a neural network?