# STEP 6 - Choosing a window length

A windowing model was found that performed well even for the NYC Taxi DataSet.
The optimal value for the window length can be evaluated now.

In [7]:
import tensorflow as tf
from tensorflow import feature_column
import pandas as pd
import numpy as np
import import_ipynb

## Testing different sequence_lengths

Multiple models are trained with different sequence_lengths and compared to each other in terms of prediction quality.
Assumptions are made before the tests are run.
Logically speaking, too long and too short sequences should perform worse with intermediate sequences expected to yield the best results.
Short sequences result in less features for the network to learn patterns from.
Long sequences reduce the number of available rows.

In [8]:
from model_helper import ModelHelper

In [9]:
df = pd.read_csv("./NYC/trips_with_zones_final.csv")

Only use the first 10000000 rows.

In [10]:
df = df.head(10000000)
df.head(10)

Unnamed: 0,medallion,pickup_week_day,pickup_hour,pickup_day,pickup_month,dropoff_week_day,dropoff_hour,dropoff_day,dropoff_month,pickup_location_id,dropoff_location_id
0,00005007A9F30E289E760362F69E4EAD,1,0,1,1,1,0,1,1,162.0,262.0
1,00005007A9F30E289E760362F69E4EAD,1,0,1,1,1,0,1,1,262.0,239.0
2,00005007A9F30E289E760362F69E4EAD,1,0,1,1,1,1,1,1,239.0,236.0
3,00005007A9F30E289E760362F69E4EAD,1,1,1,1,1,1,1,1,236.0,41.0
4,00005007A9F30E289E760362F69E4EAD,1,1,1,1,1,1,1,1,41.0,211.0
5,00005007A9F30E289E760362F69E4EAD,1,1,1,1,1,2,1,1,211.0,238.0
6,00005007A9F30E289E760362F69E4EAD,1,2,1,1,1,2,1,1,238.0,142.0
7,00005007A9F30E289E760362F69E4EAD,1,2,1,1,1,2,1,1,142.0,263.0
8,00005007A9F30E289E760362F69E4EAD,1,2,1,1,1,3,1,1,263.0,48.0
9,00005007A9F30E289E760362F69E4EAD,1,3,1,1,1,3,1,1,48.0,246.0


The "create_model" function that will be used for all tests below.

In [11]:
def create_model(mh):
    EMBEDDING_DIM = 256
    # Declare the dictionary for the places sequence as before
    sequence_input = {
      'location_id': tf.keras.Input((mh.sequence_length,), dtype=tf.dtypes.int32, batch_size=mh.batch_size, name='location_id')
    }

    # Handling the categorical feature sequence using one-hot
    places_one_hot = feature_column.sequence_categorical_column_with_vocabulary_list(
      'location_id', [i for i in range(int(mh.vocab_size))])

    # Embed the one-hot encoding
    places_embed = feature_column.embedding_column(places_one_hot, EMBEDDING_DIM)

    sequence_features, sequence_length = tf.keras.experimental.SequenceFeatures(places_embed)(sequence_input)
    sequence_features = tf.ensure_shape(sequence_features, (mh.batch_size, mh.sequence_length, EMBEDDING_DIM))

    gru1 = tf.keras.layers.GRU(256,
                               return_sequences=True,
                               input_shape=(mh.batch_size, mh.sequence_length, EMBEDDING_DIM),
                               stateful=True,
                               recurrent_initializer='glorot_uniform')(sequence_features)
    gru2 = tf.keras.layers.GRU(64,
                               input_shape=(mh.batch_size, mh.sequence_length, EMBEDDING_DIM),
                               stateful=True,
                               recurrent_initializer='glorot_uniform')(gru1)

    #drop = tf.keras.layers.Dropout(0.3)(gru2)
    #dense = tf.keras.layers.Dense(number_of_places, activation='softmax')(drop)

    dense = tf.keras.layers.Dense(mh.vocab_size)(gru2)
    output = tf.keras.layers.Softmax()(dense)

    model = tf.keras.Model(inputs=list(sequence_input.values()), outputs=output)
    return model

n is the window size and is always +1.
sequence_stride is the value by which the window is moved.

In [12]:
def run_model_helper_for_n(n, sequence_stride):
    mh = ModelHelper(df, n+1)
    mh.df_to_location_sequence()
    mh.set_target_column_name('location_id')
    mh.set_vocab_size()
    mh.vocab_size
    mh.basic_split_df()
    mh.drop_all_but_target()
    BATCH_SIZE = 128
    mh.set_batch_size(BATCH_SIZE)
    mh.set_window_generator(['location_id'])
    mh.make_windowed_dataset(sequence_stride)
    mh.assign_model(create_model(mh))
    mh.set_num_epochs(5)
    mh.compile_model(optimizer_type=tf.keras.optimizers.Adam, learning_rate=0.002)
    mh.fit_model(with_early_stopping=False)
    mh.evaluate_model()

In [10]:
run_model_helper_for_n(1,1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [12]:
run_model_helper_for_n(8,3)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [13]:
run_model_helper_for_n(128,43)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


The accuracy increases with larger window sizes, but not as significantly as expected.
Also, a window with a very short sequence does not behave as expected.
Even with a window size of 2 the accuracy is already pretty high when compared to the window size of 129.
Why the accuracy is this high even with so little information will be investigated.
