# Introduction

Here I will build off of the supervised learning techniques and data preparation exhibibited in the previous notebooks by applying the data to both a Convolutional Neural Network and a Long Short Term Memory Network (a type of recurring neural network)

For both of these deep learning models, I will be applying the classification problem. So the outcome variable will be the Speed Index that was used to predict in the previous notebook. 

As you'll see, to get a convolutional neural network to work on the two dimensional data I fed into the supervised and unsupervised learning problems, a great deal of data transformation will need to happen. First of which is adding a third dimension to the data.

In [1]:
import numpy as np
import pandas as pd

# plot
import matplotlib.pyplot as plt
import seaborn as sns
import re
import scipy
from collections import Counter
%matplotlib inline
sns.set_style("darkgrid")

#sklearn
from sklearn.model_selection import train_test_split

# For the network
import glob
import cv2
import os

import ipywidgets as iw
from IPython.display import display, clear_output

!pip install tensorflow==2.0.0-alpha0

import tensorflow.keras
from tensorflow.keras import layers, models, optimizers, metrics 


#from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
#tf.enable_eager_execution()

from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split


# Import various componenets for model building
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.layers import LSTM, Input, TimeDistributed
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import RMSprop

# Import the backend
from tensorflow.python.keras import backend as k

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import warnings
# filter warnings
warnings.filterwarnings('ignore')



### Load the data
Going to take a little bit, as the file is quite large.

In [2]:
traffic_18_m = pd.read_csv('traffic_18_m.csv')

### Set Up the Outcome Variable
This is pulling the same code that was used to recreate the variable during the supervised learning phase. Trying to learn on the variable as it previously was was too hard. So this shortens the amount of outcomes.

In [33]:
traffic_18_m['speed_index'] = 0

# Minus
traffic_18_m['speed_index'].loc[(traffic_18_m['SPEED']<=(traffic_18_m['speed_limit']-10))&
                                  (traffic_18_m['SPEED']>(traffic_18_m['speed_limit']-20))] = -10
traffic_18_m['speed_index'].loc[(traffic_18_m['SPEED']<=(traffic_18_m['speed_limit']-20))] = -20


# Minus
traffic_18_m['speed_index'].loc[(traffic_18_m['SPEED']>=(traffic_18_m['speed_limit']+10))&
                                  (traffic_18_m['SPEED']<(traffic_18_m['speed_limit']+20))] = 10
traffic_18_m['speed_index'].loc[(traffic_18_m['SPEED']>=(traffic_18_m['speed_limit']+20))] = 20

traffic_18_m['speed_index'].loc[traffic_18_m['SPEED']==0] = -100



#traffic_18_m['speed_index'] = traffic_18_m['speed_index'].astype('category')

### Balance the data
500,000 of each variable for a total of 3,000,000 rows. Will probably need to shorten this even farther later but for now it will work.

In [34]:
traffic_s = traffic_18_m.sample(frac=1, random_state=1)

traffic_cnn = pd.DataFrame(columns=traffic_s.columns)

for i in traffic_s.speed_index.unique():
    label = traffic_s.loc[traffic_s['speed_index']==i][:500000]
    traffic_cnn = pd.concat([traffic_cnn, label])
    
traffic_cnn.speed_index.value_counts()

-10     500000
-20     500000
-100    500000
 20     500000
 10     500000
 0      500000
Name: speed_index, dtype: int64

### Clean the Data

In [36]:
# Drop unwanted columns
traffic_cnn = traffic_cnn.drop(columns=['Unnamed: 0','index','LINK_POINTS',
                                        'ENCODED_POLY_LINE','ENCODED_POLY_LINE_LVLS',
                                        'TRANSCOM_ID','Join_ID'])

In [37]:
# Convert to numeric and to codes

tonumeric = ['ID', 'STATUS', 'LINK_ID', 'year', 'month', 'dayofweek', 'hour','minute','poly_num','BikeLane', 
             'weekend','morn_rush_hr', 'eve_rush_hr','Number_Tot','Number_Tra','SeqNum','StreetCode','lion_id',
             'speed_id','speed_limit']
tocategory = ['Snow_Prior','NonPed','RB_Layer','SegmentTyp','FeatureTyp','Street','BOROUGH']

# turn columns into numberic
for i in tonumeric:
    traffic_cnn[i] = pd.to_numeric(traffic_cnn[i])

# to a category then immediately into a coded column
for i in tocategory:
    traffic_cnn[i] = traffic_cnn[i].astype('category')
    traffic_cnn[i+'_codes'] = traffic_cnn[i].cat.codes

In [38]:
# Get rid of null values
traffic_cnn = traffic_cnn.drop(columns='NonPed')
traffic_cnn = traffic_cnn.dropna()

In [39]:
# Drop non-numeric columns
traffic_cnn = traffic_cnn.drop(['DATA_AS_OF','OWNER', 'BOROUGH','LINK_NAME','RecordedAtTime',
                                'LINK_START','LINK_END','LINK_MIDDLE','Street', 'FeatureTyp', 'SegmentTyp', 
                                'RB_Layer', 'TrafDir','Snow_Prior', 'TRAVEL_TIME','SPEED'], 1)

In [11]:
# Check to see if everything is all right
traffic_cnn.dtypes

ID                        int64
STATUS                    int64
LINK_ID                   int64
year                      int64
month                     int64
dayofweek                 int64
hour                      int64
minute                    int64
poly_num                  int64
speed_id                  int64
speed_limit               int64
lion_id                   int64
SeqNum                    int64
StreetCode                int64
StreetWidt              float64
BikeLane                  int64
Number_Tra                int64
Number_Tot                int64
weekend                   int64
morn_rush_hr              int64
eve_rush_hr               int64
morning_rush_avg_spd    float64
evening_rush_avg_spd    float64
wknd_avg_spd            float64
overall_avg_spd         float64
overall_std_speed       float64
speed_index              object
Snow_Prior_codes           int8
NonPed_codes               int8
RB_Layer_codes             int8
SegmentTyp_codes           int8
FeatureT

## Convolutional Neural Network

Now we begin actually setting up the data to be run through the neural network. First off, I will take a sample out of the data that will allow me to run the network much faster. I had tried to run it on the full 3 million but it was taking a ridiculous amount of time and I noticed that removing data didn't affect the accuracy or loss of the network a great deal.

In [114]:
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D

# pull out a sample
traffic_test_sm = traffic_cnn.sample(60000, random_state=1)


# to split a training and test sample
from sklearn.model_selection import train_test_split, cross_val_score
train, test = train_test_split(traffic_test_sm, random_state=0, train_size=.5)

# split into input and outputs
train_X, train_y = train.drop(['speed_index'],1), train['speed_index']
test_X, test_y = test.drop(['speed_index'],1), test['speed_index']

# to values
train_X, train_y = train_X.values, train_y.values
test_X, test_y = test_X.values, test_y.values

# Reshape into 3d
train_X = np.array(train_X).reshape(train_X.shape[0], -1, train_X.shape[1]) 
test_X = np.array(train_X).reshape(test_X.shape[0], -1, test_X.shape[1]) 

# What's the output?
print('x_train shape:', train_X.shape)
print('x_test shape:', test_X.shape)

x_train shape: (30000, 1, 33)
x_test shape: (30000, 1, 33)


In [75]:
train_y.shape

(30000,)

### Update the outcome to categorical
The convolutional neural network only takes a binary outcome, so we have to update the variable so each row has an array of binary outcomes. Luckily there is a keras utility that can do this.

In [101]:
from keras.utils import to_categorical

train_y = tensorflow.keras.utils.to_categorical(train_y, num_classes=100, dtype='object')
test_y = tensorflow.keras.utils.to_categorical(test_y, num_classes=100, dtype='object')

In [102]:
train_y

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=object)

### Fit the Network
For the network, I am two rows of two networks with a pooling layer between them. Then finally a dropout layer to flatten then a dense layer that runs the outcome through a traditional neural network.

In [124]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Conv1D, GlobalAveragePooling1D, MaxPooling1D

model = Sequential()
model.add(Conv1D(64, 1, activation='relu', input_shape=(1, train_X.shape[2])))
model.add(Conv1D(64, 1, activation='relu'))
model.add(MaxPooling1D(1))
model.add(Conv1D(128, 1, activation='relu'))
model.add(Conv1D(128, 1, activation='relu'))
model.add(GlobalAveragePooling1D())
model.add(Dropout(0.5))
model.add(Dense(100, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

model.fit(train_X, train_y, batch_size=8, epochs=50)
score = model.evaluate(test_X, test_y, batch_size=8)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


___

## LSTM
Here I set up the LSTM in the same way that I set up the CNN. The data needs to be in 3 dimensions but the 2nd dimension is about how many steps back the model takes as it is running. I pull a little bit larger of a sample here to see if it affects much as well.

In [118]:
traffic_test = traffic_cnn.sample(80000)

In [119]:
traffic_test.shape

(80000, 34)

In [120]:
traffic_test.columns

Index(['ID', 'STATUS', 'LINK_ID', 'year', 'month', 'dayofweek', 'hour',
       'minute', 'poly_num', 'speed_id', 'speed_limit', 'lion_id', 'SeqNum',
       'StreetCode', 'StreetWidt', 'BikeLane', 'Number_Tra', 'Number_Tot',
       'weekend', 'morn_rush_hr', 'eve_rush_hr', 'morning_rush_avg_spd',
       'evening_rush_avg_spd', 'wknd_avg_spd', 'overall_avg_spd',
       'overall_std_speed', 'speed_index', 'Snow_Prior_codes', 'NonPed_codes',
       'RB_Layer_codes', 'SegmentTyp_codes', 'FeatureTyp_codes',
       'Street_codes', 'BOROUGH_codes'],
      dtype='object')

In [121]:
# to split a training and test sample
from sklearn.model_selection import train_test_split, cross_val_score
train, test = train_test_split(traffic_test, random_state=0)

# split into input and outputs
train_X, train_y = train.drop(['speed_index'],1), train['speed_index']
test_X, test_y = test.drop(['speed_index'],1), test['speed_index']
# to values
train_X, train_y = train_X.values, train_y.values
test_X, test_y = test_X.values, test_y.values

train_y = tensorflow.keras.utils.to_categorical(train_y, num_classes=100, dtype='object')
test_y = tensorflow.keras.utils.to_categorical(test_y, num_classes=100, dtype='object')

# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

(60000, 1, 33) (60000, 100) (20000, 1, 33) (20000, 100)


### Fitting the Network
Here I set up a fairly simple LSTM. With just one layer, then a dense output. This was more of an exercise to see how the data performed on the CNN but I wanted to take it a bit further and explore the LSTM. A possible next step here would be to stack some LSTM's on top of each other but even this simple version performs so well that it might not be worth the time.

In [123]:
# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(100))
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), shuffle=False)

Train on 60000 samples, validate on 20000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


## Conclusion

This was just an exercise to see how these two types of neural networks would work with the traffic data. I was a little nervous running non-image data through the CNN but it worked far better than I thought. I think idea of the data being so constrained to time helped play into the outcome, especially for the LSTM.

If given more time, I would set the data up as more of a traditional time-series problem, and then set up a Recurring Neural Network and LSTM to then pass the data through. In it's current state, the data definitely has a time factor to it but it isn't as straightforward as it would be in a time-series scenario.