# Keras Tutorial

<!-- include motivation -->

<!-- include introduction to what neural networks are in general -->

## Introduction

This tutorial will introduce you to some basic techniques and ideas you will need to start building Neural Networks in Keras. A Neural network is simply a large collection of "neural units" that are connected to each other loosely modeling the way the human brain solves problems. They are modeled after clusers of neurons connected by axons. Neural networks typically consist of several layers, each layer having several units that generally take inputs from the previous layer, perform some function, then give outputs to the next layer.

Neural networks can be utilized to solve a varirety of different tasks. They have been used in the past for things like classifying data and making predictions. More recently we have heard about neural networks beating the top players in Go with AlphaGo as well as research projects out of Google where neural networks were able to "learn" their own primitive encryption scheme.

Throughout this tutorial we will be exploring how to create these very powerful networks with the tool "Keras" which makes all these advanced ideas simple to implement. Hope you enjoy this tutorial!

## Installing the Libraries

Before getting started, you'll need to install and important the libraries we will use throughout this tutorial. You will need to install both Theanos and Keras using `pip`:

    > pip install theanos
    > pip install keras
    
Keras defaults to using TensorFlow as a back-end for its computation but I will be using Theanos for this tutorial because it is compatable with Windows.

After installing these libraries, you will need to change some configuration settings in the json file `C:\Users\$USER\.keras\keras.json`. You should be able to just copy paste the following:

`{
    "image_dim_ordering": "tf", 
    "epsilon": 1e-07, 
    "floatx": "float32", 
    "backend": "theano"
}`

You may also want to install GCC speed optimization with theano. On Windows you can install "TDM GCC," making sure to enable OpenMP support during the installation (http://tdm-gcc.tdragon.net/).

We will be importing Keras modules as needed throughout the tutorial.

In [1]:
import numpy as np

## Loading example data

In order to go over the basics of Keras we will have to start by loading a dataset. Keras uses numpy arrays thoughout its implementation as inputs and outputs. We will start by loading an exceedingly simple dataset described as "perhaps the best known database to be found in pattern recognition literature." The dataset can be found here `https://archive.ics.uci.edu/ml/datasets/Iris`. The following loads the dataset in the necessary format:

In [19]:
def classConverter(o):
    if o == "Iris-versicolor":
        return 0
    elif o == "Iris-virginica":
        return 1
    else:
        return 2
    
np.random.seed(5)
    
totData = np.loadtxt("iris.data", delimiter=",", converters= {4: classConverter})
#shuffle data so we get good distribution in train/test
np.random.shuffle(totData)
#take last 30 (20 %) as test
trainData = totData[:120,:]
testData = totData[120:,:]
#then have to split attributes and labels
trainX = trainData[:,:4]
trainY = np.array(map(lambda cat: np.array([0.0 if x != cat else 1.0\
                                            for x in range(3)]), trainData[:,4]))
testX = testData[:,:4]
testY = np.array(map(lambda cat: np.array([0.0 if x != cat else 1.0\
                                           for x in range(3)]), testData[:,4]))

print(len(trainX), 'train sequences')
print(len(testX), 'test sequences')

(120, 'train sequences')
(30, 'test sequences')


Since the data and the labels are stored together in the data file we had to manually split them up. Additionally, we had to shuffle the data so they weren't clustered.

Another issue with the data is that the labels were categorical strings, neural networks (and Keras) are unable to handle strings as labels and so we had to split the labels into arrays with indicator variables.

We can get a sense of what our data looks like here, with the left side being the attributes of the data and the right side being the categorical indicator:

In [20]:
for i in xrange(len(testX)):
    print testX[i], testY[i]

[ 6.7  3.1  5.6  2.4] [ 0.  1.  0.]
[ 6.4  3.2  4.5  1.5] [ 1.  0.  0.]
[ 7.6  3.   6.6  2.1] [ 0.  1.  0.]
[ 5.5  3.5  1.3  0.2] [ 0.  0.  1.]
[ 6.5  3.2  5.1  2. ] [ 0.  1.  0.]
[ 5.   3.6  1.4  0.2] [ 0.  0.  1.]
[ 6.9  3.1  5.1  2.3] [ 0.  1.  0.]
[ 5.1  3.5  1.4  0.2] [ 0.  0.  1.]
[ 6.6  2.9  4.6  1.3] [ 1.  0.  0.]
[ 5.4  3.9  1.7  0.4] [ 0.  0.  1.]
[ 6.3  2.9  5.6  1.8] [ 0.  1.  0.]
[ 7.2  3.   5.8  1.6] [ 0.  1.  0.]
[ 4.5  2.3  1.3  0.3] [ 0.  0.  1.]
[ 4.9  2.5  4.5  1.7] [ 0.  1.  0.]
[ 5.6  2.8  4.9  2. ] [ 0.  1.  0.]
[ 7.2  3.2  6.   1.8] [ 0.  1.  0.]
[ 6.7  3.1  4.7  1.5] [ 1.  0.  0.]
[ 4.8  3.1  1.6  0.2] [ 0.  0.  1.]
[ 6.7  3.1  4.4  1.4] [ 1.  0.  0.]
[ 5.1  3.8  1.9  0.4] [ 0.  0.  1.]
[ 5.2  3.5  1.5  0.2] [ 0.  0.  1.]
[ 5.5  2.4  3.8  1.1] [ 1.  0.  0.]
[ 5.7  2.5  5.   2. ] [ 0.  1.  0.]
[ 5.   3.4  1.5  0.2] [ 0.  0.  1.]
[ 6.8  3.   5.5  2.1] [ 0.  1.  0.]
[ 4.4  2.9  1.4  0.2] [ 0.  0.  1.]
[ 6.1  2.8  4.7  1.2] [ 1.  0.  0.]
[ 6.7  3.3  5.7  2.5] [ 0.  

## Defining a model

The core data structure of Keras is a model, and the main type of model is a "Sequential" model. This just means that the layers of our neural network is going to be layed out in a linear stack format (other options may involve multiple inputs at different layers or layers that are shared). Using different kinds of layers and different parameters, we are able to build out a neural network like we described in the introduction. We will start by creating a simple sequntial model:

In [21]:
from keras.models import Sequential

model = Sequential()

Now that we have a base model defined we can begin adding layers to it. Layers in Keras are just a representation of the layers in a neural network. As you may have guessed, there are many different types of layers to choose from. These include layers Keras in grouping such as core layers (dense, activation, flatten, masking,...), convolutional layers (1d convolutions, cropping, upsampling,...), normalization layers, ect. Stacking different types of layers onto our model is incredibly easy - you can just use the method `.add()` on your model.

In our Iris example we will just be using the simplist and most classic kind of layer, the `Dense` layer. This is just a fully connected layer, meaning that each node is connected to every single node in the next output. We also will be giving our layers additional attributes: input_dim (for the first layer), and activation type. The activation type is just the function each node in the layer will use to give an output based on its inputs.

Now we add our layers:

In [22]:
from keras.layers import Dense

model.add(Dense(4, input_dim=4, init="normal", activation='relu'))
model.add(Dense(3, init="normal", activation='sigmoid'))

We choose to use a sigmoid activation function for the final layer  so that our output values will be between 0 and 1. We need this so that we can interpret them as probabilities and pick the largest one as our predicted category. We can now configure its learning process using the method `.compile()`:

In [23]:
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

We choose a loss function of `categorical_crossentropy` A.K.A multiclass logloss, this is a good loss function to use for our binary label arrays. There are of course a plethora of other loss functions to choose from depending on your needs and fancy. We also just stick to the a very standard stochastic gradient descent optimizer, there are also an overabundance of optimizers to pick from.

## Using our model

Now that we have defined our model we can use it. First we want to train our model on our training data:

In [24]:
model.fit(trainX, trainY, nb_epoch=100, batch_size=5)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0xd47db70>

As we can see, Keras gives us progress bars and accuracy/loss values after each epoch. This is useful to see how the training is going.

Now that we have finished training, we should try evaluating how this model does on the 30 examples we withheld from the model to test our accuracy:

In [25]:
perf = model.evaluate(testX, testY)
print "Test loss: ", perf[0]
print "Test accuracy %: ", perf[1]

Test loss:  0.144060507417
Test accuracy %:  1.0


As we can see here the output of `.evaluate()` has two parts: the loss on the test examples, and the `score` which is the percent it got right. In this case thats 100%! Woohoo! If you're not convinced that we've predicted correctly:

In [26]:
idxToClass = {0: "versicolor", 1:"veriginica", 2:"setosa"}
pClasses = model.predict_classes(testX, batch_size=1)
origClasses = map(lambda x: list(x).index(1), testY)
print
for i in xrange(len(pClasses)):
    print "{} predicted: {:<10} actual: {:<10}".format("T" if pClasses[i] == origClasses[i] else "F",
                                                       idxToClass[pClasses[i]],
                                                       idxToClass[origClasses[i]])

 1/30 [>.............................] - ETA: 0s
T predicted: veriginica actual: veriginica
T predicted: versicolor actual: versicolor
T predicted: veriginica actual: veriginica
T predicted: setosa     actual: setosa    
T predicted: veriginica actual: veriginica
T predicted: setosa     actual: setosa    
T predicted: veriginica actual: veriginica
T predicted: setosa     actual: setosa    
T predicted: versicolor actual: versicolor
T predicted: setosa     actual: setosa    
T predicted: veriginica actual: veriginica
T predicted: veriginica actual: veriginica
T predicted: setosa     actual: setosa    
T predicted: veriginica actual: veriginica
T predicted: veriginica actual: veriginica
T predicted: veriginica actual: veriginica
T predicted: versicolor actual: versicolor
T predicted: setosa     actual: setosa    
T predicted: versicolor actual: versicolor
T predicted: setosa     actual: setosa    
T predicted: setosa     actual: setosa    
T predicted: versicolor actual: versicolor
T pre

## Example application: IMDB Movie reviews sentiment classification

<!-- Small image classification https://keras.io/datasets/ -->

As an example to delve more into different types of Keras layers and settings we are going to go over and use the IMDB Movie reviews sentiment classification example dataset included with Keras. The original network we're adapting from can be found in the references bellow.

Before we begin, we will import all of the necessary utilities we will need:

In [27]:
from keras.preprocessing import sequence
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Embedding
from keras.layers import LSTM, SimpleRNN, GRU
from keras.datasets import imdb

Now we define some constants we will be using later and load our data into variables. We set an nb_words flag on our imdb data set to specify we only want to consider the `max_features` number of top most frequent words. We also set a seed so that we can get the same data every time.

In [28]:
max_features = 20000
maxlen = 80  # cut texts after this number of words (among top max_features most common words)
batch_size = 32
(X_train, y_train), (X_test, y_test) = imdb.load_data(path="imdb_full.pkl",
                                                      nb_words=max_features,
                                                      seed=388)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

(25000, 'train sequences')
(25000, 'test sequences')


We then choose to preprocess our data using the Keras sequence preprocessing library to cut each examples text to `maxlen` number of words out of the most frequent words. This shortens the data and will allow us to train faster on the most "relevant" (frequent) data.

In [29]:
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

Pad sequences (samples x time)
('X_train shape:', (25000L, 80L))
('X_test shape:', (25000L, 80L))


Now we've processed our data enough we can begin building our model:

In [30]:
model = Sequential()
model.add(Embedding(max_features, 128, dropout=0.2))
model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))  # try using a GRU instead, for fun
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

This model uses many new layer types we have not been exposed to before - more detailed information can be found in Keras documentation pages. The Embedding layer creates a sort of map from the words in the dataset to some continuous vector space. This is a natural language processing trick that is meant to help with text learning. We then use an LSTM layer which stands for Long-Short term memory unit, which is a layer type proposed by Sepp Hochreiter in 1997 - it is considered well-suited to learn to classify things when there are very long time lags of unknown size between important events. How this helps learning on this dataset is left as an exercise to the reader. As you can see, there are very complicated layers built upon amazing research that you can utilize by writing one simple line in Keras.

Next we have a dense layer like we've seen before which then outputs to an Activation layer which just applies the sigmoid function which maps the ou tput to a float between 0 and 1.

We also can define different loss functions and optimizers. In this case we use `binary_crossentropy` as a loss function, also known as logloss. We also use a different type of optimizer now, `adam`, which is just another method of stochastic optimization. We see again how Keras has provided such power and complexity at our fingertips - we can prototype with awesome speed.

We then train our model:

In [31]:
model.fit(X_train, y_train,
          batch_size=batch_size, nb_epoch=15,
          validation_data=(X_test, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x1ee88c88>

and evaluate its accuracy:

In [32]:
score, acc = model.evaluate(X_test, y_test,
                            batch_size=batch_size)
print('Test loss:', score)
print('Test accuracy %:', acc)

('Test loss:', 0.64337649493694304)
('Test accuracy %:', 0.81484000000000001)


With this model we have achieved 81% accuracy on the test dataset and 97% accuracy on the training dataset. And if we look at the above output actually epoch 3 had the best test accuracy. This feels like we might be overfitting the model on the training data so we can try making some changes to the model to avoid this. Following is a proposed an alternate model to avoid overfitting:

In [33]:
from keras.layers import GaussianNoise

model = Sequential()
model.add(Embedding(max_features, 128, dropout=0.25))
model.add(GaussianNoise(0.2))
model.add(GRU(128, dropout_W=0.25, dropout_U=0.25))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(X_train, y_train,
          batch_size=batch_size, nb_epoch=4,
          validation_data=(X_test, y_test))
score, acc = model.evaluate(X_test, y_test,
                            batch_size=batch_size)
print('Test loss:', score)
print('Test accuracy %:', acc)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
('Test loss:', 0.34797629148483278)
('Test accuracy %:', 0.84560000000000002)


We seemed to have increased our accuracy on the testing dataset to 85%! We made several changes to our model. Gaussian noise applys the training input of the layer to an additive zero-centered gaussian noise with the standard deviation as the parameter - this means we dont learn on the training data too "exactly".  We also changed LSTM to GRU which is a Gated Recurrent Unit which you can read about here (https://arxiv.org/pdf/1412.3555v1.pdf), this is another type of layer that has just been shown to work well with this sort of data. And we increased dropout rates slightly to further mitigate overfitting.

I'm no neural network expert but I was able to squeeze out 3% more accuracy on the test data with my changes. The takeaway here is that it is incrediblly easy to just grab and drop different layers into our network to try things out in order to try get better networks. With something as mathematically and representively powerful as neural networks with its complexity that often defies human understanding, it is very useful to have a tool that allows us to try different networks efficiently and with very little pain.

## Example Application: Diabetes in Pima Indians

Code for this example adapted from reference. We now show another example on classifying patients and whether they have onset of diabetes or not based on several variables including: no. of times pregnant, tricep skin fold thickness, bmi, and others. We begin by importing our required model and layers:

In [34]:
from keras.models import Sequential
from keras.layers import Dense
import numpy

# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.data", delimiter=",")
X = dataset[:,0:8]
Y = dataset[:,8]

We then define and compile a simple model with just 3 layers of densely connected units with different activation functions:

In [35]:
model.add(Dense(12, input_dim=8, init='uniform', activation='relu'))
model.add(Dense(8, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

We can then fit our model on the dataset, but also use an optional parameter to split our data automatically into training and validation:

In [36]:
model.fit(X, Y, validation_split=0.25, nb_epoch=150, batch_size=10)

Train on 576 samples, validate on 192 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
E

<keras.callbacks.History at 0x20f0e4a8>

We can also apply techniques we've learned in class such as k-fold cross validation, we can use a quick little addition from sci-kit learn to help with our kfold validation called StratifiedKFold. As you can see, we can very simply add and remove additional complexity and steps on top of all of our keras models without a care in the world:

In [37]:
from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=10, shuffle=True)
cvscores = []
for train, test in kfold.split(X, Y):
  # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, init='uniform', activation='relu'))
    model.add(Dense(8, init='uniform', activation='relu'))
    model.add(Dense(1, init='uniform', activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    # Fit the model
    model.fit(X[train], Y[train], nb_epoch=150, batch_size=10, verbose=0)
    # evaluate the model
    scores = model.evaluate(X[test], Y[test], verbose=0)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
 
print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))

acc: 75.32%
acc: 74.03%
acc: 76.62%
acc: 83.12%
acc: 80.52%
acc: 62.34%
acc: 71.43%
acc: 80.52%
acc: 82.89%
acc: 65.79%
75.26% (+/- 6.71%)


As we can see, it is incredibly easy to make and modify Keras models, as well as build additional infrastructure and techniques on top of them. You can find more information about neural networks and Keras in the links below. I hope you have had as much fun and learned as much reading this tutorial as I did making it.

## Additional Resources

- Keras tutorial videos: https://www.youtube.com/playlist?list=PLFxrZqbLojdKuK7Lm6uamegEFGW2wki6P\
- Keras examples: https://github.com/fchollet/keras/tree/master/examples
- Neural Network reading: https://en.wikipedia.org/wiki/Artificial_neural_network
- Additional NN reading: http://www.cs.cmu.edu/~epxing/Class/10701-10s/Lecture/lecture7.pdf
- Old CMU Neural Net class: https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/

## Summary and References

- Keras: https://keras.io/
- Iris Dataset: https://archive.ics.uci.edu/ml/datasets/Iris
- IMDB Example: https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py
- IMDB Alternate: https://github.com/fchollet/keras/blob/master/examples/imdb_cnn_lstm.py
- Pima Indians Example: http://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
- Model Evaluation: http://machinelearningmastery.com/evaluate-performance-deep-learning-models-keras/