# Revisiting the NYC Taxi DataSet Model Architecture Part 6

In this notebook, the model architecture is being changed in order to understand the correlation between the model architecture and the prediction quality.
From all the prior parts, Part 3 used only the location sequence as an input and achieved relatively good accuracy nonetheless.
This model changes the used SequencesFeatures Layer that is still under development (in tf.experimental) to DenseFeatures.
From evaluation of independently developed standalone models (according to the latest tensorflow tutorials for timeseries data prediction and multi-class classification) it is expected that the model accuracy decreases significantly.
The usage of the new feature layer seems to have a great impact for the sequenced location data and greatly increases model prediction quality.
That assumption is to be verified here.

In [1]:
import numpy as np
np.random.seed(0)
import tensorflow as tf
import pandas as pd
from tensorflow import feature_column
from tensorflow.keras import layers
import import_ipynb

In [2]:
from model_helper import ModelHelper

importing Jupyter notebook from model_helper.ipynb


# Dataset

In [3]:
df = pd.read_csv("./ma_results/trips_with_zones_final.csv")
df = df.head(10000000)
df.head(10)

Unnamed: 0,medallion,pickup_week_day,pickup_hour,pickup_day,pickup_month,dropoff_week_day,dropoff_hour,dropoff_day,dropoff_month,pickup_location_id,dropoff_location_id
0,00005007A9F30E289E760362F69E4EAD,1,0,1,1,1,0,1,1,162.0,262.0
1,00005007A9F30E289E760362F69E4EAD,1,0,1,1,1,0,1,1,262.0,239.0
2,00005007A9F30E289E760362F69E4EAD,1,0,1,1,1,1,1,1,239.0,236.0
3,00005007A9F30E289E760362F69E4EAD,1,1,1,1,1,1,1,1,236.0,41.0
4,00005007A9F30E289E760362F69E4EAD,1,1,1,1,1,1,1,1,41.0,211.0
5,00005007A9F30E289E760362F69E4EAD,1,1,1,1,1,2,1,1,211.0,238.0
6,00005007A9F30E289E760362F69E4EAD,1,2,1,1,1,2,1,1,238.0,142.0
7,00005007A9F30E289E760362F69E4EAD,1,2,1,1,1,2,1,1,142.0,263.0
8,00005007A9F30E289E760362F69E4EAD,1,2,1,1,1,3,1,1,263.0,48.0
9,00005007A9F30E289E760362F69E4EAD,1,3,1,1,1,3,1,1,48.0,246.0


In [4]:
# Check dtypes of the attributes
df.dtypes

medallion               object
pickup_week_day          int64
pickup_hour              int64
pickup_day               int64
pickup_month             int64
dropoff_week_day         int64
dropoff_hour             int64
dropoff_day              int64
dropoff_month            int64
pickup_location_id     float64
dropoff_location_id    float64
dtype: object

In [5]:
# Drop the medallion, it is not needed for this example
df.drop(['medallion'], axis=1, inplace=True)

Because there are too many taxis (over 9000) it is better to take the 100 taxi with the major number of records

In [6]:
# Cast the columns type to int32
dictionary = {'pickup_week_day': 'int32', 'pickup_hour': 'int32', 'pickup_day': 'int32', 'pickup_month': 'int32', 'dropoff_week_day': 'int32', 'dropoff_hour': 'int32', 'dropoff_day': 'int32', 'dropoff_month': 'int32', 'pickup_location_id':'int32', 'dropoff_location_id':'int32'}
df = df.astype(dictionary, copy=True)
df.dtypes

pickup_week_day        int32
pickup_hour            int32
pickup_day             int32
pickup_month           int32
dropoff_week_day       int32
dropoff_hour           int32
dropoff_day            int32
dropoff_month          int32
pickup_location_id     int32
dropoff_location_id    int32
dtype: object

We can use the other taxis to create a local test and validation sets

Now we need to create the location sequence for each user

In [7]:
mh = ModelHelper(df, 129)

In [8]:
# Call the function
mh.df_to_location_sequence()

print(mh.df)

            index  location_id  day  month  hour_sin      hour_cos  \
0               0          162    1      1  0.000000  1.000000e+00   
1              12          230    1      1  0.707107  7.071068e-01   
2              13          125    1      1  0.707107  7.071068e-01   
3              15           48    1      1  0.866025  5.000000e-01   
4              18          170    1      1  1.000000  6.123234e-17   
...           ...          ...  ...    ...       ...           ...   
13731996  7284341          161   26      1 -0.500000 -8.660254e-01   
13731997  7284341          161   26      1 -0.500000 -8.660254e-01   
13731998  7284342          132   26      1 -0.707107 -7.071068e-01   
13731999  7284343          141   26      1 -0.866025 -5.000000e-01   
13732000  7284344          141   26      1 -0.866025 -5.000000e-01   

          week_day_sin  week_day_cos  weekend  
0             0.781831      0.623490        0  
1             0.781831      0.623490        0  
2             0

In [9]:
mh.train_val_test_split()
print(len(mh.df_train), 'train examples')
print(len(mh.df_val), 'validation examples')
print(len(mh.df_test), 'test examples')

8788480 train examples
2197120 validation examples
2746401 test examples


In [10]:
mh.split_data()
mh.list_test[0]

Unnamed: 0,index,location_id,day,month,hour_sin,hour_cos,week_day_sin,week_day_cos,weekend
10985600,5283998,246,4,1,-0.866025,0.500000,-0.433884,-0.900969,0
10985601,5283999,107,4,1,-0.707107,0.707107,-0.433884,-0.900969,0
10985602,5284000,142,4,1,-0.707107,0.707107,-0.433884,-0.900969,0
10985603,5284001,48,4,1,-0.500000,0.866025,-0.433884,-0.900969,0
10985604,5284001,48,4,1,-0.500000,0.866025,-0.433884,-0.900969,0
...,...,...,...,...,...,...,...,...,...
10985724,5284091,234,7,1,0.000000,1.000000,0.000000,1.000000,0
10985725,5284092,162,7,1,0.258819,0.965926,0.000000,1.000000,0
10985726,5284093,142,7,1,0.500000,0.866025,0.000000,1.000000,0
10985727,5284093,142,7,1,0.500000,0.866025,0.000000,1.000000,0


In [11]:
mh.set_batch_size(128)
mh.create_batch_dataset()
mh.test_dataset

<BatchDataset shapes: ({start_place: (128, 128), start_hour_sin: (128, 128), start_hour_cos: (128, 128), weekend: (128, 128), week_day_sin: (128, 128), week_day_cos: (128, 128), end_hour_sin: (128,), end_hour_cos: (128,), end_weekend: (128,), end_week_day_sin: (128,), end_week_day_cos: (128,)}, (128,)), types: ({start_place: tf.int32, start_hour_sin: tf.float64, start_hour_cos: tf.float64, weekend: tf.int32, week_day_sin: tf.float64, week_day_cos: tf.float64, end_hour_sin: tf.float64, end_hour_cos: tf.float64, end_weekend: tf.int32, end_week_day_sin: tf.float64, end_week_day_cos: tf.float64}, tf.int32)>

In [12]:
mh.set_target_column_name('location_id')
mh.set_vocab_size()
mh.set_numerical_column_names(['start_hour_sin', 'start_hour_cos', 'weekend', 'week_day'])

In [13]:
# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 256

# Create a model
def create_model():
  N = mh.total_window_length
  batch_size = mh.batch_size
  number_of_places = mh.vocab_size

	# Shortcut to the layers package
  l = tf.keras.layers

   # Declare the dictionary for the places sequence as before
  sequence_input = {
      'start_place': tf.keras.Input((N-1,), batch_size=batch_size, dtype=tf.dtypes.int32, name='start_place') # add batch_size=batch_size in case of stateful GRU
  }

  # Handling the categorical feature sequence using one-hot
  places_one_hot = feature_column.categorical_column_with_vocabulary_list(
      'start_place', [i for i in range(number_of_places)])

  # Embed the one-hot encoding
  places_embed = feature_column.embedding_column(places_one_hot, embedding_dim)

  # With an input sequence we can't use the DenseFeature layer, we need to use the SequenceFeatures
  dense_features = l.DenseFeatures(places_embed)(sequence_input)

  #dense_features = tf.ensure_shape(dense_features, (batch_size, N-1, dense_features.shape[2]))

  dense_features = tf.expand_dims(dense_features, -1)

  # Rnn
  recurrent = l.GRU(rnn_units,
                        batch_size=batch_size, #in case of stateful
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform')(dense_features)

  recurrent_2 = l.GRU(64,
                        batch_size=batch_size, #in case of stateful
                        stateful=True,
                        recurrent_initializer='glorot_uniform')(recurrent)

	# Last layer with an output for each places
  dense_1 = layers.Dense(number_of_places)(recurrent_2)

	# Softmax output layer
  output = l.Softmax()(dense_1)

	# To return the Model, we need to define its inputs and outputs
	# In out case, we need to list all the input layers we have defined
  inputs = list(sequence_input.values())

	# Return the Model
  return tf.keras.Model(inputs=inputs, outputs=output)

In [14]:
# Get the model and compile it
mh.assign_model(create_model())
mh.compile_model()

# Training

In [15]:
mh.set_num_epochs(10)
mh.fit_model()

Epoch 1/10


  [n for n in tensors.keys() if n not in ref_input_names])


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 00004: early stopping


# Evaluation

In [16]:
mh.evaluate_model()



In [17]:
mh.model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
start_place (InputLayer)     [(128, 128)]              0         
_________________________________________________________________
dense_features (DenseFeature (128, 256)                67584     
_________________________________________________________________
tf_op_layer_ExpandDims (Tens [(128, 256, 1)]           0         
_________________________________________________________________
gru (GRU)                    (128, 256, 256)           198912    
_________________________________________________________________
gru_1 (GRU)                  (128, 64)                 61824     
_________________________________________________________________
dense (Dense)                (128, 264)                17160     
_________________________________________________________________
softmax (Softmax)            (128, 264)               

In [18]:
mh.print_test_prediction_info()

logits
Shape :  (21248, 264)
Example [0] :  [2.19648192e-03 8.74014746e-04 4.16128978e-06 2.41860398e-05
 4.34706453e-03 1.90446808e-05 2.01919156e-05 3.83476680e-03
 1.04172483e-04 5.41678346e-05 1.80263349e-04 6.18902341e-05
 3.65208281e-04 9.18495003e-03 1.00926019e-03 1.15527189e-04
 6.53688621e-05 9.94463102e-04 1.00447884e-04 2.82751080e-05
 3.84194318e-05 6.17694677e-05 3.57458310e-04 4.39436189e-05
 2.90893484e-03 2.22603604e-03 1.73467881e-04 6.72991609e-06
 2.07932084e-04 1.24590471e-04 4.58845443e-06 2.37560689e-05
 2.09630762e-05 2.92644510e-03 9.57764569e-05 9.25926797e-05
 7.56544410e-04 1.96089805e-03 4.30240689e-05 1.70404965e-04
 1.33508549e-03 3.74178728e-03 3.55806481e-03 1.25618130e-02
 2.32119583e-05 3.35296127e-03 4.52181921e-05 6.79534132e-05
 2.67040282e-02 2.51214718e-03 8.78496841e-03 4.06981781e-05
 1.35310472e-03 9.16527715e-05 1.68430779e-04 3.97323238e-05
 8.80572552e-05 9.95942173e-06 2.85917395e-05 3.76005005e-06
 6.82771424e-05 1.59253646e-03 3.52465257

The usage of ordinary DenseFeatures results in a major reduction in prediction quality.
The SequenceFeatures enable the model to understand the data in the correct way and are thus necessary.