# Notebook 3: Neural networks with Keras

On to the most exciting machine learning technique: Neural networks. as you will see they are just as easy to use as the sklearn methods.

There are several deep learning libraries for Python. The three most commonly used are:

- Keras: High level library based on Tensorflow (or others) that is easy to use and flexible enough for most standard users. It has a great documentation and online support.
- Tensorflow: Google's neural network library. Most widely used in ML research. Flexible and powerful but also (unnecessarily?) complicated.
- Pytorch: The newcomer developed by Facebook. Flexible like Tensorflow but with a nicer, more Pythonic API.

Here we will use Keras which is a great start for most tasks.

In [1]:
import pickle
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
DATADIR = './dataset1/'

In [3]:
# Copy from previous notebook
def create_sub(preds, fn=None):
    df =  pd.DataFrame({'Id': range(len(preds)), 'Expected': preds})
    if fn is not None: df.to_csv(DATADIR + fn, index=False)
    return df

In [4]:
# Load the preprocessed data
with open('./tmp/preproc_data.pkl', 'rb') as f:
    X_train, y_train, X_valid, y_valid, X_test = pickle.load(f)
with open('./tmp/dfs.pkl', 'rb') as f:
    df_train, df_test = pickle.load(f)

## Linear regression: The Keras way

First, let's build our own linear regression algorithm with stochastic gradient descent. This should give us the same solution as the sklearn linear regression we did in the last notebook.

In [5]:
from keras.models import Sequential
from keras.layers import *
from keras.optimizers import SGD, Adam

Using TensorFlow backend.


In [6]:
X_train.shape

(728008, 22)

There are two ways to build a model in Keras. We will start with the easier, a Sequential model. This means that it is a succession of layers. For linear regression we only have one linear layer that maps the inputs to the outputs. Layers where all inputs are connected to all outputs are called fully-connected or Dense layers.

In [7]:
ln = Sequential([Dense(1, input_shape=(22,), activation='linear')])

In [8]:
ln.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 1)                 23        
Total params: 23
Trainable params: 23
Non-trainable params: 0
_________________________________________________________________


We can see that the model has 23 parameters, 22 coefficients plus one bias term.

Next we need to compile the model, which basically means telling Keras which optimizer to use and which loss function to minimize. In the background it also randomly initializes the weights and biases at this stage.

We will use the Adam optimizer, which is a fancy version of SGD: https://arxiv.org/abs/1412.6980

In [9]:
ln.compile(Adam(1e-1), 'mse')

Now we can train/fit the model. For this we specify the training data, the batch size, the number of epochs. Optionally, we can pass on the validation data, so that we get a validation score every epoch.

In [10]:
ln.fit(X_train, y_train, 10_000, epochs=12, validation_data=(X_valid, y_valid))

Train on 728008 samples, validate on 180849 samples
Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<keras.callbacks.History at 0x21517677080>

In [11]:
from sklearn.metrics import r2_score
def print_scores(m, X_train=X_train, X_valid=X_valid):
    preds = m.predict(X_valid, 10000)
    print('Train R2 = ', r2_score(y_train, m.predict(X_train, 10000)), 
          ', Valid R2 = ', r2_score(y_valid, preds), ', Valid MSE = ', 
          m.evaluate(X_valid, y_valid, 10000, 0))

In [12]:
print_scores(ln)

Train R2 =  0.931859979389 , Valid R2 =  0.918576221325 , Valid MSE =  3.24872250179


Recall that with the sklearn linear regression algorithm we got a score of around 3.24. This indicates that we are doing pretty much the same here. But we can do better!

## Neural network with a hidden layer

Now let's actually build a neural network with a hidden layer. For this we simply add another Dense layer but this time with a non-linear activation function. In our case this is a Rectified Linear Unit or relu. There is no set rule for how many hidden layers or nodes to use. For this we just need to employ trial-and-error.

BTW: You will often see people using powers of two for the batch size or the number of nodes. This is for optimization purposes on the GPU, which are not crucial in our case.

In [13]:
nn = Sequential([
    Dense(256, input_shape=(22,), activation='relu'),
    Dense(1, activation='linear')
])

In [14]:
nn.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_2 (Dense)              (None, 256)               5888      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 257       
Total params: 6,145
Trainable params: 6,145
Non-trainable params: 0
_________________________________________________________________


In [15]:
nn.compile(Adam(1e-3), 'mse')

In [16]:
nn.fit(X_train, y_train, 1024, epochs=12, validation_data=(X_valid, y_valid))

Train on 728008 samples, validate on 180849 samples
Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<keras.callbacks.History at 0x21517799b70>

In [17]:
print_scores(nn)

Train R2 =  0.947471880236 , Valid R2 =  0.927105055996 , Valid MSE =  2.90843115181


*Some notes on the results*

- We see that we are quite heavily overfitting. This makes sense in a model with 6k parameters. There are techniques to prevent overfitting in neural networks, most notably dropout and weight decay (L2 regularization). We will cover those later. For some reason, I found those techniques not to work for this model and dataset. Can you figure out why?
- The score is quite significantly better than our best single random forest model. This suggests that the nonlinear computing power of the neural network is useful.
- This is not yet a deep neural network. DNNs have several hidden layers. Again, I found that for this dataset a DNN didn't perform better. This might be because the nonlinearities are not very strong, but feel free to prove me wrong.

## ADVANCED TECHNIQUE ALERT: Embeddings

We saw in the previous notebook that the station information might be really important. We could train a separate neural network for each station but that would reduce the amount of training data for each individual model and probably lead to stronger overfitting. 

Here is a different method of using categorical variables in neural networks. Don't worry if you don't understand embeddings right away. The concept is easy but it takes a while to wrap your head around it (it did for me anyways). 

First we need to get continuous station IDs.

In [18]:
split_date = '2015-01-01'

In [19]:
stations = df_train.station
stations_train = df_train.station[df_train.time < split_date]
stations_valid = df_train.station[df_train.time >= split_date]
stations_test = df_test.station

In [20]:
unique_stations = pd.concat([df_train.station, df_test.station]).unique()

In [21]:
stat2id = {s: i for i, s in enumerate(unique_stations)}

In [22]:
ids = stations.apply(lambda x: stat2id[x])

In [23]:
ids_train = ids[df_train.time < split_date]
ids_valid = ids[df_train.time >= split_date]
ids_test = stations_test.apply(lambda x: stat2id[x])

Now we will have to use the Keras's Functional API. Check out the official documentation.

We now have two separate inputs. Our regular features and the station ID.

An embedding is a mapping from an integer to a vector of real numbers. In our case the vector has length two. The elements are also called latent features. These are then concatenates with the regular features and passed through one hidden layer as before.

The latent features are updated along with the weights and biases during training and can now represent station-specific information.

In [24]:
from keras.models import Model

In [25]:
features_in = Input(shape=(22,))
id_in = Input(shape=(1,))
emb = Embedding(len(unique_stations), 2)(id_in)
emb = Flatten()(emb)
x = Concatenate()([features_in, emb])
x = Dense(100, activation='relu')(x)
out = Dense(1, activation='linear')(x)
model = Model(inputs=[features_in, id_in], outputs=out)

In [26]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 1, 2)         1044        input_2[0][0]                    
__________________________________________________________________________________________________
input_1 (InputLayer)            (None, 22)           0                                            
__________________________________________________________________________________________________
flatten_1 (Flatten)             (None, 2)            0           embedding_1[0][0]                
__________________________________________________________________________________________________
concatenat

In [27]:
model.compile('adam', 'mse')

In [28]:
model.fit([X_train, ids_train], y_train, 1024, 10, 
          validation_data=([X_valid, ids_valid], y_valid))

Train on 728008 samples, validate on 180849 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2151b033240>

In [29]:
model.fit([X_train, ids_train], y_train, 1024, 10, 
          validation_data=([X_valid, ids_valid], y_valid))

Train on 728008 samples, validate on 180849 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x21518331198>

In [30]:
print_scores(model, X_train=[X_train, ids_train], X_valid=[X_valid, ids_valid])

Train R2 =  0.955964364947 , Valid R2 =  0.938262241398 , Valid MSE =  2.46327159392


In [31]:
# Submit to Kaggle
df_sub = create_sub(model.predict([X_test, ids_test], 10000).squeeze(), 'nn_emb.csv'); df_sub.head()

Unnamed: 0,Id,Expected
0,0,4.093746
1,1,1.777899
2,2,0.641055
3,3,3.240536
4,4,1.727655


This technique allows us to build a single model that incorporates station information and gives us the best score. Yay!

But can you do better?

## Your turn

1. As we have seen now there are a lot of hyperparameters. Try playing around with them and get the best score. How do the parameters influence the skill?
2. Try an ensemble of techniques. This means training a few models (can be several NNs or some NNs with some RFs) and averaging the predictions. This is also a way to prevent overfitting and might just increase your score ;)