<a href="https://colab.research.google.com/github/vanderbilt-ml/50-nelson-mlproj-waittime/blob/assignment-4/wait_time_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wait Time Prediction


## Background

Recently when planning an upcoming vacation I discovered that a company called Touringplans (touringplans.com) has many publically available data sets with captured wait times for attractions at Walt Disney World in Florida dating back to 2015. I'm intrigued by this data and am interested in building a predective model using the historical wait time data to help forecast future wait times.

## Project Description

Using the captured historical wait time data I would like to create a predictive model that will help myself to understand future wait times of attractions at Walt Disney World in Florida.

The following columns represent my core data:


*   Date: The captured data date
*   DateTime: The captured data datetime
*   SActMin: The actual wait time at the given datetime (if catpured)
*   SPostMin: The posted wait time at the given datetime



Via the metadata.csv file we have loads of relevant information for each date our data has been collected for. I will be able to utilize this data by joining metadata.csv and our sample data via the DATE column. Within this file are important pieces of information like:

*   DayOfWeek
*   DayOfYear
*   WeekOfYear
*   MonthOfYear
*   Season
*   MaxTemp
*   MinTemp
*   MeanTemp



## Performance Metric
Given the abundance of available data I imagine I will be able to split the data into both training and testing data. I would like to be able to create a predictive model with somewhere in the 80-90% accuracy range. At this point however I have no clue if that is possible.

## Required Imports

In [41]:
#tables and visualizations
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#machine learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline 
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer, StandardScaler
from sklearn import config_context
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, roc_curve, roc_auc_score

## Load Data

The metadata is stored in a separate file; loading in both predictive data and metadata, then combining.

In [None]:
wait_time_raw_data = pd.read_csv('https://raw.githubusercontent.com/vanderbilt-ml/50-nelson-mlproj-waittime/assignment-4/big_thunder_mtn.csv')
metadata = pd.read_csv('https://raw.githubusercontent.com/vanderbilt-ml/50-nelson-mlproj-waittime/main/provided_data/metadata.csv')
# To minimize training time for now I've limited the number of metadata columns I'm using to just the following:
metadata = metadata[['DATE', 'DAYOFWEEK', 'DAYOFYEAR', 'WEEKOFYEAR','MONTHOFYEAR', 'SEASON']]
metadata.rename(columns = {'DATE':'date'},  inplace=True)
wait_time_data = pd.merge(wait_time_raw_data, metadata, on ='date')
# Currently having some issues with datetime objects during training, here's some of my attempts to remedy the issue
# wait_time_data['date'] = pd.to_datetime(wait_time_data['date'])
# wait_time_data['datetime'] = pd.to_datetime(wait_time_data['datetime'])
# wait_time_data['datetime'] = np.Timestamp(np.datetime64(wait_time_data['datetime']))
# wait_time_data['datetime'] = wait_time_data['datetime'].values.astype('datetime64[D]')
# wait_time_data['date'] = wait_time_data['date'].values.astype('datetime64[D]')
print(wait_time_data.shape)
print(wait_time_data.head())

## Data Cleaning and Validation

In [70]:
wait_time_data.isna().sum()


date                0
datetime            0
SACTMIN        260224
SPOSTMIN         8745
DAYOFWEEK           0
DAYOFYEAR           0
WEEKOFYEAR          0
MONTHOFYEAR         0
SEASON          30586
dtype: int64

In [71]:
wait_time_data.shape

(268969, 9)

We have many entries with -999 entered as their SPOSTMIN entry. I'll go ahead and drop those. 

In [72]:
wait_time_data = wait_time_data[wait_time_data.SPOSTMIN != -999]
print(wait_time_data.shape)

(246931, 9)


The SACTMIN and SPOSTMIN entries are mutually exclusive. Meaning for every data entry only one of the columns will have data. The SACTMIN should be more valuable data than the SPOSTMIN column; I'm not sure yet how I should handle this so I'll leave them as-is for now

Dropping any columns that are completely empty

In [73]:
wait_time_data.dropna(how='all', axis=1, inplace=True)
display(wait_time_data)

Unnamed: 0,date,datetime,SACTMIN,SPOSTMIN,DAYOFWEEK,DAYOFYEAR,WEEKOFYEAR,MONTHOFYEAR,SEASON
0,2015-01-01,2015-01-01 08:02:13,,5.0,5,0,0,1,CHRISTMAS PEAK
1,2015-01-01,2015-01-01 08:09:12,,15.0,5,0,0,1,CHRISTMAS PEAK
2,2015-01-01,2015-01-01 08:16:12,,20.0,5,0,0,1,CHRISTMAS PEAK
3,2015-01-01,2015-01-01 08:23:12,,20.0,5,0,0,1,CHRISTMAS PEAK
4,2015-01-01,2015-01-01 08:23:53,,20.0,5,0,0,1,CHRISTMAS PEAK
...,...,...,...,...,...,...,...,...,...
268962,2021-08-31,2021-08-31 20:32:54,,10.0,3,242,35,8,
268963,2021-08-31,2021-08-31 20:40:13,,10.0,3,242,35,8,
268964,2021-08-31,2021-08-31 20:47:24,,10.0,3,242,35,8,
268965,2021-08-31,2021-08-31 20:54:12,,10.0,3,242,35,8,


## Feature Engineering

For now, given the mutually exclusive data relationship between SACTMIN and SPOSTMIN I am going to collapse them into one column. SACTMIN represents human-captured wait time (someone stood in line and captured their wait length) and SPOSTMIN captures the posted wait time. In my opion this makes SACTMIN data more valuable, but given the small percentage of data entries that SACTMIN data makes up I'm not sure what other approach to take at this point.

In [74]:
wait_time_data[wait_time_data["SACTMIN"].notna()].head()

Unnamed: 0,date,datetime,SACTMIN,SPOSTMIN,DAYOFWEEK,DAYOFYEAR,WEEKOFYEAR,MONTHOFYEAR,SEASON
63,2015-01-01,2015-01-01 14:55:16,37.0,,5,0,0,1,CHRISTMAS PEAK
142,2015-01-02,2015-01-02 08:40:32,3.0,,6,1,0,1,CHRISTMAS
152,2015-01-02,2015-01-02 09:30:53,35.0,,6,1,0,1,CHRISTMAS
160,2015-01-02,2015-01-02 10:16:26,47.0,,6,1,0,1,CHRISTMAS
190,2015-01-02,2015-01-02 13:16:31,54.0,,6,1,0,1,CHRISTMAS


In [75]:
wait_time_data['wait'] = pd.to_numeric(wait_time_data[['SACTMIN', 'SPOSTMIN']].bfill(axis=1).iloc[:, 0])
wait_time_data[wait_time_data["SACTMIN"].notna()].head()

Unnamed: 0,date,datetime,SACTMIN,SPOSTMIN,DAYOFWEEK,DAYOFYEAR,WEEKOFYEAR,MONTHOFYEAR,SEASON,wait
63,2015-01-01,2015-01-01 14:55:16,37.0,,5,0,0,1,CHRISTMAS PEAK,37.0
142,2015-01-02,2015-01-02 08:40:32,3.0,,6,1,0,1,CHRISTMAS,3.0
152,2015-01-02,2015-01-02 09:30:53,35.0,,6,1,0,1,CHRISTMAS,35.0
160,2015-01-02,2015-01-02 10:16:26,47.0,,6,1,0,1,CHRISTMAS,47.0
190,2015-01-02,2015-01-02 13:16:31,54.0,,6,1,0,1,CHRISTMAS,54.0


In [76]:
wait_time_data = wait_time_data.drop('SACTMIN', axis=1)
wait_time_data = wait_time_data.drop('SPOSTMIN', axis=1)
wait_time_data.head()

Unnamed: 0,date,datetime,DAYOFWEEK,DAYOFYEAR,WEEKOFYEAR,MONTHOFYEAR,SEASON,wait
0,2015-01-01,2015-01-01 08:02:13,5,0,0,1,CHRISTMAS PEAK,5.0
1,2015-01-01,2015-01-01 08:09:12,5,0,0,1,CHRISTMAS PEAK,15.0
2,2015-01-01,2015-01-01 08:16:12,5,0,0,1,CHRISTMAS PEAK,20.0
3,2015-01-01,2015-01-01 08:23:12,5,0,0,1,CHRISTMAS PEAK,20.0
4,2015-01-01,2015-01-01 08:23:53,5,0,0,1,CHRISTMAS PEAK,20.0


## Test Train Split

In [77]:
wait_time_data = wait_time_data.dropna(subset=['wait'])
wait_time_data.shape

(246931, 8)

In [78]:
class_column = 'wait'
random_seed = 2435

wait_time_data = wait_time_data[:5000]

X_train, X_test, y_train, y_test = train_test_split(wait_time_data.drop(columns=class_column), wait_time_data[class_column],
                                                    test_size=0.25, random_state=random_seed)#, stratify=wait_time_data[class_column])

In [58]:
wait_time_data.shape

(5000, 8)

In [79]:
# X Train
print('On X train: ')
print('X train dimensions: ', X_train.shape)
display(X_train.head())

# X test
print('\nOn X test: ')
print('X test dimensions: ', X_test.shape)
display(X_test.head())

On X train: 
X train dimensions:  (3750, 7)


Unnamed: 0,date,datetime,DAYOFWEEK,DAYOFYEAR,WEEKOFYEAR,MONTHOFYEAR,SEASON
4606,2015-02-01,2015-02-01 16:25:06,1,31,5,2,WINTER
163,2015-01-02,2015-01-02 10:37:12,6,1,0,1,CHRISTMAS
4196,2015-01-29,2015-01-29 22:30:05,5,28,4,1,WINTER
4821,2015-02-03,2015-02-03 19:45:05,3,33,5,2,WINTER
3265,2015-01-23,2015-01-23 11:45:05,6,22,3,1,MARTIN LUTHER KING JUNIOR DAY



On X test: 
X test dimensions:  (1250, 7)


Unnamed: 0,date,datetime,DAYOFWEEK,DAYOFYEAR,WEEKOFYEAR,MONTHOFYEAR,SEASON
4993,2015-02-04,2015-02-04 20:30:05,4,34,5,2,WINTER
2204,2015-01-16,2015-01-16 23:00:06,6,15,2,1,MARTIN LUTHER KING JUNIOR DAY
3903,2015-01-27,2015-01-27 12:22:18,3,26,4,1,WINTER
279,2015-01-02,2015-01-02 23:09:12,6,1,0,1,CHRISTMAS
272,2015-01-02,2015-01-02 22:23:12,6,1,0,1,CHRISTMAS


In [80]:
# Y Train
print('On y train: ')
print('y train dimensions: ', y_train.shape)
display(y_train.head())

# Y test
print('\nOn y test: ')
print('y test dimensions: ', y_test.shape)
display(y_test.head())

On y train: 
y train dimensions:  (3750,)


4606    40.0
163     60.0
4196    15.0
4821    10.0
3265    20.0
Name: wait, dtype: float64


On y test: 
y test dimensions:  (1250,)


4993    10.0
2204    10.0
3903    18.0
279     55.0
272     55.0
Name: wait, dtype: float64

In [81]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import  RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# create an object of the LinearRegression Model
model_LR = LinearRegression()

# fit the model with the training data
# Running into issues training the simple linear regression model because of datetime objects
model_LR.fit(X_train, y_train)

# predict the target on train and test data 
predict_train = model_LR.predict(X_train)
predict_test  = model_LR.predict(y_train)

# Root Mean Squared Error on train and test date
print('RMSE on train data: ', mean_squared_error(y_train, predict_train)**(0.5))
print('RMSE on test data: ',  mean_squared_error(y_test, predict_test)**(0.5))

TypeError: ignored

In [82]:
#individual pipelines for differing datatypes
cat_pipeline = Pipeline(steps=[('cat_impute', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
                               ('onehot_cat', OneHotEncoder(handle_unknown='ignore'))])
num_pipeline = Pipeline(steps=[('impute_num', SimpleImputer(missing_values=np.nan, strategy='mean')),
                               ('scale_num', StandardScaler())])

In [83]:
#establish preprocessing pipeline by columns
preproc = ColumnTransformer([('cat_pipe', cat_pipeline, make_column_selector(dtype_include=object)),
                             ('num_pipe', num_pipeline, make_column_selector(dtype_include=np.number))],
                             remainder='passthrough')

In [84]:
#generate the whole modeling pipeline with preprocessing
pipe = Pipeline(steps=[('preproc', preproc),
                       ('mdl', LogisticRegression(penalty='elasticnet', solver='saga', tol=0.01))])

#visualization for steps
with config_context(display='diagram'):
    display(pipe)

## Cross-validation with hyperparameter tuning

In [85]:
tuning_grid = {'mdl__l1_ratio' : np.linspace(0,1,5),
               'mdl__C': np.logspace(-1, 6, 3) }
grid_search = GridSearchCV(pipe, param_grid = tuning_grid, cv = 5, return_train_score=True)

In [86]:
tuning_grid

{'mdl__C': array([1.00000000e-01, 3.16227766e+02, 1.00000000e+06]),
 'mdl__l1_ratio': array([0.  , 0.25, 0.5 , 0.75, 1.  ])}

In [87]:
# Having issues with fitting due to datetime objects
grid_search.fit(X_train, y_train)

75 fits failed out of a total of 75.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
75 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 355, in _fit
    **fit_params_steps[name],
  File "/usr/local/lib/python3.7/dist-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages

TypeError: ignored

In [None]:
print(grid_search.best_score_)
grid_search.best_params_


In [None]:
pd.DataFrame(grid_search.cv_results_)