# Revisiting the NYC Taxi DataSet Model Architecture Part 7

In this notebook, the model architecture is being changed in order to understand the correlation between the model architecture and the prediction quality.
From all the prior parts, Part 2 used the most complex model.
This model tries to increase the prediction quality by also using SequenceFeatures for the time components.
If any improvement happens, it is expected to not yield much of a boost in accuracy.
The reason for that expectation is the fact that the time components do not have a great influence on the prediction anyways as discussed in the evaluation of the prior parts.


In [1]:
import numpy as np
np.random.seed(0)
import tensorflow as tf
import pandas as pd
from tensorflow import feature_column
from tensorflow.keras import layers
import import_ipynb

In [2]:
from model_helper import ModelHelper

importing Jupyter notebook from model_helper.ipynb


# Dataset

In [3]:
df = pd.read_csv("./ma_results/trips_with_zones_final.csv")
df = df.head(10000000)
df.head(10)

Unnamed: 0,medallion,pickup_week_day,pickup_hour,pickup_day,pickup_month,dropoff_week_day,dropoff_hour,dropoff_day,dropoff_month,pickup_location_id,dropoff_location_id
0,00005007A9F30E289E760362F69E4EAD,1,0,1,1,1,0,1,1,162.0,262.0
1,00005007A9F30E289E760362F69E4EAD,1,0,1,1,1,0,1,1,262.0,239.0
2,00005007A9F30E289E760362F69E4EAD,1,0,1,1,1,1,1,1,239.0,236.0
3,00005007A9F30E289E760362F69E4EAD,1,1,1,1,1,1,1,1,236.0,41.0
4,00005007A9F30E289E760362F69E4EAD,1,1,1,1,1,1,1,1,41.0,211.0
5,00005007A9F30E289E760362F69E4EAD,1,1,1,1,1,2,1,1,211.0,238.0
6,00005007A9F30E289E760362F69E4EAD,1,2,1,1,1,2,1,1,238.0,142.0
7,00005007A9F30E289E760362F69E4EAD,1,2,1,1,1,2,1,1,142.0,263.0
8,00005007A9F30E289E760362F69E4EAD,1,2,1,1,1,3,1,1,263.0,48.0
9,00005007A9F30E289E760362F69E4EAD,1,3,1,1,1,3,1,1,48.0,246.0


In [4]:
# Check dtypes of the attributes
df.dtypes

medallion               object
pickup_week_day          int64
pickup_hour              int64
pickup_day               int64
pickup_month             int64
dropoff_week_day         int64
dropoff_hour             int64
dropoff_day              int64
dropoff_month            int64
pickup_location_id     float64
dropoff_location_id    float64
dtype: object

In [5]:
# Drop the medallion, it is not needed for this example
df.drop(['medallion'], axis=1, inplace=True)

Because there are too many taxis (over 9000) it is better to take the 100 taxi with the major number of records

In [6]:
# Cast the columns type to int32
dictionary = {'pickup_week_day': 'int32', 'pickup_hour': 'int32', 'pickup_day': 'int32', 'pickup_month': 'int32', 'dropoff_week_day': 'int32', 'dropoff_hour': 'int32', 'dropoff_day': 'int32', 'dropoff_month': 'int32', 'pickup_location_id':'int32', 'dropoff_location_id':'int32'}
df = df.astype(dictionary, copy=True)
df.dtypes

pickup_week_day        int32
pickup_hour            int32
pickup_day             int32
pickup_month           int32
dropoff_week_day       int32
dropoff_hour           int32
dropoff_day            int32
dropoff_month          int32
pickup_location_id     int32
dropoff_location_id    int32
dtype: object

We can use the other taxis to create a local test and validation sets

Now we need to create the location sequence for each user

In [7]:
mh = ModelHelper(df, 129)

In [8]:
# Call the function
mh.df_to_location_sequence()

print(mh.df)

            index  location_id  day  month  hour_sin      hour_cos  \
0               0          162    1      1  0.000000  1.000000e+00   
1              12          230    1      1  0.707107  7.071068e-01   
2              13          125    1      1  0.707107  7.071068e-01   
3              15           48    1      1  0.866025  5.000000e-01   
4              18          170    1      1  1.000000  6.123234e-17   
...           ...          ...  ...    ...       ...           ...   
13731996  7284341          161   26      1 -0.500000 -8.660254e-01   
13731997  7284341          161   26      1 -0.500000 -8.660254e-01   
13731998  7284342          132   26      1 -0.707107 -7.071068e-01   
13731999  7284343          141   26      1 -0.866025 -5.000000e-01   
13732000  7284344          141   26      1 -0.866025 -5.000000e-01   

          week_day_sin  week_day_cos  weekend  
0             0.781831      0.623490        0  
1             0.781831      0.623490        0  
2             0

In [9]:
mh.train_val_test_split()
print(len(mh.df_train), 'train examples')
print(len(mh.df_val), 'validation examples')
print(len(mh.df_test), 'test examples')

8788480 train examples
2197120 validation examples
2746401 test examples


In [10]:
mh.split_data()
mh.list_test[0]

Unnamed: 0,index,location_id,day,month,hour_sin,hour_cos,week_day_sin,week_day_cos,weekend
10985600,5283998,246,4,1,-0.866025,0.500000,-0.433884,-0.900969,0
10985601,5283999,107,4,1,-0.707107,0.707107,-0.433884,-0.900969,0
10985602,5284000,142,4,1,-0.707107,0.707107,-0.433884,-0.900969,0
10985603,5284001,48,4,1,-0.500000,0.866025,-0.433884,-0.900969,0
10985604,5284001,48,4,1,-0.500000,0.866025,-0.433884,-0.900969,0
...,...,...,...,...,...,...,...,...,...
10985724,5284091,234,7,1,0.000000,1.000000,0.000000,1.000000,0
10985725,5284092,162,7,1,0.258819,0.965926,0.000000,1.000000,0
10985726,5284093,142,7,1,0.500000,0.866025,0.000000,1.000000,0
10985727,5284093,142,7,1,0.500000,0.866025,0.000000,1.000000,0


In [11]:
mh.set_batch_size(128)
mh.create_batch_dataset()
mh.test_dataset

<BatchDataset shapes: ({start_place: (128, 128), start_hour_sin: (128, 128), start_hour_cos: (128, 128), weekend: (128, 128), week_day_sin: (128, 128), week_day_cos: (128, 128), end_hour_sin: (128,), end_hour_cos: (128,), end_weekend: (128,), end_week_day_sin: (128,), end_week_day_cos: (128,)}, (128,)), types: ({start_place: tf.int32, start_hour_sin: tf.float64, start_hour_cos: tf.float64, weekend: tf.int32, week_day_sin: tf.float64, week_day_cos: tf.float64, end_hour_sin: tf.float64, end_hour_cos: tf.float64, end_weekend: tf.int32, end_week_day_sin: tf.float64, end_week_day_cos: tf.float64}, tf.int32)>

In [12]:
mh.set_target_column_name('location_id')
mh.set_vocab_size()
mh.set_numerical_column_names(['start_hour_sin', 'start_hour_cos', 'weekend', 'week_day_sin', 'week_day_cos'])

In [13]:
def sparse_f(input_dense):
  zero = tf.constant(0, dtype=tf.float32)
  indices = tf.where(tf.not_equal(input_dense, zero))
  values = tf.gather_nd(input_dense, indices)
  sparse = tf.SparseTensor(indices, values,  tf.cast(tf.shape(input_dense), dtype=tf.int64))
  return sparse

In [16]:
# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units1 = 256
rnn_units2=128

# Create a model
def create_model():
  N = mh.total_window_length
  batch_size = mh.batch_size
  number_of_places = mh.vocab_size

	# Shortcut to the layers package
  l = tf.keras.layers

  other_feature_inputs = {
    'end_hour_sin': tf.keras.Input((1, ), batch_size=batch_size, name='end_hour_sin'),
    'end_hour_cos': tf.keras.Input((1, ), batch_size=batch_size, name='end_hour_cos'),
    'end_weekend': tf.keras.Input((1, ), batch_size=batch_size, name='end_weekend'),
    'end_week_day_sin': tf.keras.Input((1, ), batch_size=batch_size, name='end_week_day_sin'),
    'end_week_day_cos': tf.keras.Input((1, ), batch_size=batch_size, name='end_week_day_cos')
  }

# List of numeric feature columns to pass to the DenseLayer
  num_features = []
  feature_inputs={}
  # Handling numerical columns
  for header in mh.numerical_columns_names:
    input = tf.keras.Input((N-1,), dtype=tf.dtypes.float32, batch_size=batch_size, name=header)
    feature_inputs[header] = input
    f =  feature_column.sequence_numeric_column(header, shape=(N-1), dtype=tf.dtypes.float32)
    sparse_input = tf.keras.layers.Lambda(sparse_f)(input)
    feature, sequence_length = tf.keras.experimental.SequenceFeatures(f)({header: sparse_input})
    feature = tf.reshape(feature, (sparse_input.shape[0], sparse_input.shape[1], 1))
    num_features.append(feature)

  end_hour_sin = feature_column.numeric_column("end_hour_sin", shape=(1))
  end_hour_sin_feature = l.DenseFeatures(end_hour_sin)(other_feature_inputs)

  end_hour_cos = feature_column.numeric_column("end_hour_cos", shape=(1))
  end_hour_cos_feature = l.DenseFeatures(end_hour_cos)(other_feature_inputs)

  end_weekend = feature_column.numeric_column("end_weekend", shape=(1))
  end_weekend_feature = l.DenseFeatures(end_weekend)(other_feature_inputs)

  end_week_day_sin = feature_column.numeric_column("end_week_day_sin", shape=(1))
  end_week_day_sin_feature = l.DenseFeatures(end_week_day_sin)(other_feature_inputs)

  end_week_day_cos = feature_column.numeric_column("end_week_day_cos", shape=(1))
  end_week_day_cos_feature = l.DenseFeatures(end_week_day_cos)(other_feature_inputs)


  # Declare the dictionary for the places sequence as before
  sequence_input = {
      'start_place': tf.keras.Input((N-1,), batch_size=batch_size, dtype=tf.dtypes.int32, name='start_place') # add batch_size=batch_size in case of stateful GRU
  }


  # Handling the categorical feature sequence using one-hot
  places_one_hot = feature_column.sequence_categorical_column_with_vocabulary_list(
      'start_place', [i for i in range(number_of_places)])

  # Embed the one-hot encoding
  places_embed = feature_column.embedding_column(places_one_hot, embedding_dim)


  # With an input sequence we can't use the DenseFeature layer, we need to use the SequenceFeatures
  sequence_features, sequence_length = tf.keras.experimental.SequenceFeatures(places_embed)(sequence_input)

  input_sequence = l.Concatenate(axis=2)([sequence_features] + num_features)

  # Rnn
  recurrent = l.GRU(rnn_units1,
                        batch_size=batch_size, #in case of stateful
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform')(input_sequence)

  recurrent_2 = l.GRU(rnn_units2,
                        batch_size=batch_size, #in case of stateful
                        stateful=True,
                        recurrent_initializer='glorot_uniform')(recurrent)


  flatten = l.Flatten()(recurrent_2)

  concatenate_2 = l.Concatenate(axis=1)([flatten, end_hour_sin_feature, end_hour_cos_feature, end_weekend_feature, end_week_day_sin_feature, end_week_day_cos_feature])

	# Last layer with an output for each places
  dense_1 = layers.Dense(number_of_places)(concatenate_2)

	# Softmax output layer
  output = l.Softmax()(dense_1)

	# To return the Model, we need to define its inputs and outputs
	# In out case, we need to list all the input layers we have defined
  inputs = list(feature_inputs.values()) + list(sequence_input.values()) + list(other_feature_inputs.values())

	# Return the Model
  return tf.keras.Model(inputs=inputs, outputs=output)

In [19]:
# Get the model and compile it
mh.assign_model(create_model())
mh.compile_model()

# Training

In [20]:
mh.set_num_epochs(10)
mh.fit_model()

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 00008: early stopping


# Evaluation

In [21]:
mh.evaluate_model()



In [22]:
mh.model.summary()

Model: "functional_5"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
start_hour_sin (InputLayer)     [(128, 128)]         0                                            
__________________________________________________________________________________________________
start_hour_cos (InputLayer)     [(128, 128)]         0                                            
__________________________________________________________________________________________________
weekend (InputLayer)            [(128, 128)]         0                                            
__________________________________________________________________________________________________
week_day_sin (InputLayer)       [(128, 128)]         0                                            
_______________________________________________________________________________________

In [23]:
mh.print_test_prediction_info()

logits
Shape :  (21248, 264)
Example [0] :  [7.20760610e-04 4.23889520e-04 5.51475114e-06 1.27672483e-05
 3.64277558e-03 3.29216709e-06 3.67066082e-06 4.00454924e-03
 1.07155211e-04 5.22390183e-05 1.89134953e-04 9.41245671e-05
 4.05823404e-04 8.94040801e-03 1.97955189e-04 1.22425627e-04
 5.11538292e-06 2.90986936e-04 4.24823738e-05 1.17550117e-05
 6.44030570e-06 2.06827171e-05 1.57609000e-04 1.45997601e-05
 3.84668889e-03 1.13789400e-03 3.86439897e-05 3.52099573e-06
 2.99521457e-06 2.58960372e-05 1.15756811e-05 1.62377182e-05
 1.80514890e-05 6.98839664e-04 1.05963947e-04 2.26659431e-05
 4.28660685e-04 1.17776124e-03 4.23331803e-05 3.76595963e-05
 1.77694819e-04 7.30000390e-03 4.52563306e-03 6.69823587e-03
 7.40577980e-06 1.08051172e-03 2.67281685e-05 7.30247702e-05
 6.69471398e-02 6.06301415e-04 2.58190446e-02 2.20787479e-05
 5.41590620e-04 1.37046605e-04 1.60067706e-04 4.99217713e-05
 3.34307369e-05 4.42263172e-06 1.88128524e-05 6.63368837e-06
 2.35647149e-05 2.76715029e-04 3.82596889

The usage of SequenceFeatures to the time components resulted in a marginal increase of performance.
This is due to the fact that the model learns to predict that the taxis remain in the same area.
For other datasets, this model is expected to have the best prediction quality.

When going for efficiency or trying to reduce model complexity, it is advised to predict only on:
    * sequences of prior locations
    * end time