<a href="https://colab.research.google.com/github/vanderbilt-ml/50-nelson-mlproj-waittime/blob/assignment-4/wait_time_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wait Time Prediction


## Background

Recently when planning an upcoming vacation I discovered that a company called Touringplans (touringplans.com) has many publically available data sets with captured wait times for attractions at Walt Disney World in Florida dating back to 2015. I'm intrigued by this data and am interested in building a predective model using the historical wait time data to help forecast future wait times.

## Project Description

Using the captured historical wait time data I would like to create a predictive model that will help myself to understand future wait times of attractions at Walt Disney World in Florida.

The following columns represent my core data:


*   Date: The captured data date
*   DateTime: The captured data datetime
*   SActMin: The actual wait time at the given datetime (if catpured)
*   SPostMin: The posted wait time at the given datetime



Via the metadata.csv file we have loads of relevant information for each date our data has been collected for. I will be able to utilize this data by joining metadata.csv and our sample data via the DATE column. Within this file are important pieces of information like:

*   DayOfWeek
*   DayOfYear
*   WeekOfYear
*   MonthOfYear
*   Season
*   MaxTemp
*   MinTemp
*   MeanTemp



## Performance Metric
Given the abundance of available data I imagine I will be able to split the data into both training and testing data. I would like to be able to create a predictive model with somewhere in the 80-90% accuracy range. At this point however I have no clue if that is possible.

## Required Imports

In [2]:
#tables and visualizations
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#machine learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline 
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer, StandardScaler
from sklearn import config_context
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, roc_curve, roc_auc_score

## Load Data

In [23]:
wait_time_data = pd.read_csv('https://raw.githubusercontent.com/vanderbilt-ml/50-nelson-mlproj-waittime/assignment-4/big_thunder_mtn.csv?token=GHSAT0AAAAAABT2AS6GF7J5344ASVQUEGOGYVFF2OQ')
print(wait_time_data.shape)
wait_time_data.head()

(309857, 4)


Unnamed: 0,date,datetime,SACTMIN,SPOSTMIN
0,01/01/2015,2015-01-01 08:02:13,,5.0
1,01/01/2015,2015-01-01 08:09:12,,15.0
2,01/01/2015,2015-01-01 08:16:12,,20.0
3,01/01/2015,2015-01-01 08:23:12,,20.0
4,01/01/2015,2015-01-01 08:23:53,,20.0


## Data Cleaning and Validation

In [24]:
wait_time_data.isna().sum()

date             0
datetime         0
SACTMIN     298127
SPOSTMIN     11730
dtype: int64

We have many entries with -999 entered as their SPOSTMIN entry. I'll go ahead and drop those. 

In [25]:
wait_time_data = wait_time_data[wait_time_data.SPOSTMIN != -999]
print(wait_time_data.shape)

(286274, 4)


The SACTMIN and SPOSTMIN entries are mutually exclusive. Meaning for every data entry only one of the columns will have data. The SACTMIN should be more valuable data than the SPOSTMIN column; I'm not sure yet how I should handle this so I'll leave them as-is for now