## Hotel Booking Model

---
#### *Introduction*:
The dataset can be found at https://www.kaggle.com/jessemostipak/hotel-booking-demand

This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.

Per the description for the dataset listed above, an exploratory data analysis will be performed, then a model will be built to predict whether a reservation will be canceled.  This could be a useful tool for hotels and resorts when predicting or forecasting profits.


---
### EDA:
First we will begin by importing all the necessary libraries needed for the analysis.  Some libraries may be imported later to support packages that were initially thought to be needed.

In [None]:
# load libraries
import pandas as pd
import numpy as np
import csv
import time
from datetime import datetime, time
from plotnine import *
from mizani.breaks import date_breaks
from dfply import *
import seaborn as sns
import pprint as p

In [None]:
# load data set using pandas csv reader
hotel_data = pd.read_csv('hotel_bookings.csv')

#### View Basic Attributes of Data:

In [None]:
# View first 5 rows of data
hotel_data.head()

In [None]:
# how many rows of data and how many variables?
variables = hotel_data.shape[1]
rows = hotel_data.shape[0]

print("There are {:d} varaibles with {:d} rows in this dataset\n".format(variables, rows))

This data set has 32 variables with 119390 rows.  It looks like there are a lot of categorical variables in this dataset mixed with dates as well.  An interesting metric they keep track of is number of special requests.  Who knew hotels/resorts kept track of such things.

---
What is the data range for reservations?

In [None]:
# convert reservation_status_date into datetime type
date_temp = pd.to_datetime(hotel_data.reservation_status_date, format = '%Y-%m-%d')

#display(date_temp)

In [None]:
# overwrite date time column to proper data type
hotel_data['reservation_status_date'] = date_temp

#display(hotel_data.reservation_status_date.dt.strftime("%B"))

In [None]:
# determine date range
min_res_date = hotel_data.reservation_status_date.min()
max_res_date = hotel_data.reservation_status_date.max()
print("min date: {}\nmax date: {}".format(min_res_date, max_res_date))

It appears that this data spans from 2014 to 2017.

---

### Visualizations:

Lets explore the data with a couple of visualizations that may answer some interesting questions.  How does number of reservations trend on a monthly basis? 

In [None]:
# visualize results:

# create month list
month_list = ('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')

# extract time data out of column and convert to proper format
hotel_data['res_status_month'] = hotel_data.reservation_status_date.dt.strftime("%B")
hotel_data['res_status_year'] = hotel_data.reservation_status_date.dt.year
hotel_data['res_status_year'] = hotel_data['res_status_year'].astype("str")

# create the visualization
hotel_hist = (ggplot(hotel_data, aes(x='res_status_month', fill='res_status_year'))+
              geom_bar(stat="count", alpha=0.8)+
              scale_x_discrete(limits = month_list)+
              labs(x='Reservation Month', y='Number of Reservations', fill='Year of Res.', title='Reservations by Month')+
              theme_minimal()+
              theme(axis_text_x=element_text(rotation=45, hjust=1))
             )

hotel_hist

The plot above shows the number of hotel and resort reservations with color encoded to year of reservation.  By encoding color to year of reservations, we can see that there are almost no reservations made in 2014 and no reservations made from October on in 2017.  There are also not many reservations made in the early months of 2015.  It’s very important to note that the reason for these differences unknown and we can only speculate.  Perhaps the hotels started collecting data in 2014 but didn’t start regularly collecting data till July 2015.  In the next plot, we will normalize the column height to enable more accurate comparisons across months.

In [None]:
# normalize height of bars
hotel_hist_normalized = (ggplot(hotel_data, aes(x='res_status_month', fill='res_status_year'))+
              geom_bar(stat="count", alpha=0.8, position='fill')+
              scale_x_discrete(limits = month_list)+
              labs(x='Reservation Month', y='Percent of Reservations', fill='Year of Res.', title='Reservations by Month')+
              theme_minimal()+
              theme(axis_text_x=element_text(rotation=45, hjust=1))
             )

hotel_hist_normalized

By normalizing column height, it is easy to compare the number of reservations across months.

---

In [None]:
# Plot city vs hotel reservation count
hotel_data['is_canceled'] = hotel_data['is_canceled'].astype('str')

c_vs_h = (ggplot(hotel_data, aes(x='hotel', fill='is_canceled'))+
          geom_bar(alpha=0.8)+
          geom_text(aes(label='stat(count)'),
                        position =position_stack(vjust=0.5),
                        stat='count',
                        size=12, 
                        va='top',
                        format_string='{}')+
          labs(x='Hotel Type', y='Number of Reservations', fill='Canceled(0=No)', title='Reservations by Hotel Type')+
          theme_minimal()
         )
c_vs_h

The figure above shows the distribution of reservations by hotel type.  We can see that the City Hotel has about twice as many reservations as the Resort Hotel.  It also looks like the City Hotel has a higher percentage of cancelations.

In [None]:
# Calculate percent of cancelations for the total, then resort and city hotels
total_cancel = hotel_data['is_canceled'].sum()
resort_cancel = hotel_data.loc[hotel_data['hotel'] == "Resort Hotel"]['is_canceled'].sum()
city_cancel = hotel_data.loc[hotel_data['hotel'] == "City Hotel"]['is_canceled'].sum()

total_canc_percent = (total_cancel/hotel_data.shape[0])*100
resort_canc_percent = (resort_cancel/hotel_data.loc[hotel_data['hotel'] == "Resort Hotel"].shape[0])*100
city_canc_percent = (city_cancel/hotel_data.loc[hotel_data['hotel'] == "City Hotel"].shape[0])*100

print(f"Total bookings canceled: {total_cancel:,} ({total_canc_percent:.0f} %)")
print(f"Resort hotel bookings canceled: {resort_cancel:,} ({resort_canc_percent:.0f} %)")
print(f"City hotel bookings canceled: {city_cancel:,} ({city_canc_percent:.0f} %)")

---
Lets investigate how cancelations are different between the two types of hotels in the data set, City and Resort.

In [None]:
# plot of reservation month colored by cancelation
dodge_text = position_dodge(width=0.9)
plot3 = (ggplot(hotel_data, aes(x='res_status_month', fill='is_canceled'))+
              geom_bar(stat="count", alpha=0.8, position='dodge')+
              scale_x_discrete(limits = month_list)+
              facet_wrap("hotel")+
              geom_text(aes(label='stat(count)/100'),
                        ha='right',
                        position=dodge_text, stat='count',
                        size=8, 
                        format_string='{}%')+
             coord_flip()+
             labs(x='Reservation Month', y='Count of Reservations', fill='Canceled(0=No)')+#, title='Reservations by Month')+
             theme_minimal()
        )

plot3

The cancelations at the Resort Hotel appear to be much more consistent across months when compared to the City Hotel.  The reservations that were not canceled seem to follow a parabolic trend peaking in August for both hotels.  

---
### Create Correlation Matrix
Before we begin modeling, lets examine a correlation matrix for our data set to see if any variables are highly correlated with cancelations. 

In [None]:
# create correlation matrix
corr = hotel_data.corr()
sns.heatmap(corr)

ADD DESRCRIPTION AFTER HYPERPARAMETER TUNING

---
### Build ML Model with XGBoost:
I chose the XGBoost as my machine learning model mainly due to the fact that I am trying to increase my familiarity with the package and algorithm.  XGBoost is a decision-tree-based ensemble machine leaning algorithm that uses gradient boosting.  Check out this article for more: https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d

In [None]:
# import more libs
import matplotlib.pyplot as plt
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import cross_validate
from sklearn import metrics
import xgboost as xgb
from xgboost import plot_importance
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4
from sklearn.metrics import f1_score, precision_score, recall_score

We'll begin by defining a function to run XGBoost

In [None]:
def modelfit(alg, x_train, y_train, useTrainCV=True, cv_folds=5, early_stopping_rounds=50, feat_plot=False):
    '''
    INPUTS:
        alg     = Algorithm to pass to function (ex: XGBClassifier/XGBRegressor)
        x train = test and training data to pass
        y train = test and training for predictors
    '''
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(x_train.values, label=y_train) #convert training data into DMatrix
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc', # metric needs to change based on Algorithm passed in
            early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])
    print("Optimal estimators for learning rate: ",cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(x_train, y_train)# ,eval_metric='auc')
        
    #Predict training set:
    dtrain_predictions = alg.predict(x_train)
    dtrain_predprob = alg.predict_proba(x_train)[:,1]
        
    #Print model report:
    print ("\nModel Report")
    print ("Accuracy : %.4g" % metrics.accuracy_score(y_train.values, dtrain_predictions))
    print ("AUC Score (Train): %f" % metrics.roc_auc_score(y_train.values.flatten(), dtrain_predprob))
                    
    # plot feature importance
    if feat_plot:
        plot_importance(alg, grid=False)
        plt.show()

Select Features for Model:

In [None]:
# create new df with just original data
hotel_data_original = hotel_data >> select(~X.res_status_month, ~X.res_status_year)
hotel_data_original.head()

In [None]:
# #display(hotel_data_original)
# hotel_data['is_canceled'] = hotel_data['is_canceled'].astype('str')

# #create correlation list
# cancel_corr = hotel_data_original.corr()['is_canceled']

# display(cancel_corr)
# cancel_corr.abs().sort_values(ascending=False)

Drop variables that will poorly influence the model or don't make sense to include (like reservation status).  Separate numeric and categorical variables.  One-hot encode categorical variables.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# create numeric features
num_feat = ["lead_time", "total_of_special_requests", "required_car_parking_spaces", "previous_cancellations", "is_repeated_guest",
            "agent", "adults", "previous_bookings_not_canceled", "days_in_waiting_list", "adr"]

# create categorical features
cat_feat = ["hotel","arrival_date_month","meal","market_segment","distribution_channel","reserved_room_type","deposit_type",
            "customer_type"]
hotel_cat = hotel_data_original[cat_feat]
#display(hotel_cat)
# # create instance of one-hot encoder
# enc = OneHotEncoder(handle_unknown='ignore')

# # pass values in
# enc_df = pd.DataFrame(enc.fit_transform(hotel_cat).toarray())
# display(enc_df)

In [None]:
# Run model again but use pd.get_dummies instead of OneHotEncoder
hotel_cat2 = pd.get_dummies(hotel_data[cat_feat], dtype='int64')

# merge data frames
hotel_df2 = pd.concat([hotel_data_original[num_feat], hotel_cat2], axis=1)
display(hotel_data_original[num_feat])
#p.pprint(hotel_df2.columns.to_series().groupby(hotel_df2.dtypes).groups)

predictors = hotel_data_original['is_canceled']

In [None]:
# split into test train set, this was a 60-40 split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(hotel_df2, predictors, test_size = 0.4,
                                                    random_state=1)

Check distribution of test data to ensure its been evenly split:

In [None]:
#list(x_train.columns.values)

In [None]:
# plot training data
plot_train = (ggplot(x_train, aes(x='hotel_Resort Hotel'))+
              geom_bar(stat='count', position='stack')+
              labs(x='Hotel Type')
)
plot_train

In [None]:
# build model, note that I left all the default input parameters and will edit them during hyperparameter tuning
xgb2 = XGBClassifier(seed=2)

In [None]:
import time

In [None]:
# run model
start = time.time() # start timer

modelfit(xgb2, x_train, y_train) 

print("Building time : " + str(time.time()-start))

In [None]:
plot_importance(xgb2, grid=False, max_num_features=15)
plt.show()

This model produced an accuracy of 85.5%.  This is not bad given that we have default parameters and it only took just under 2 minutes to train.  We can see that the top 3 most important features in the model were *adr*, *lead_time*, and *agent*.  This makes sense as people who put down a lot of money on their reservation are less likely to cancel.

In [None]:
# Predict on test set:
predictions = xgb2.predict(x_test)
predprob = xgb2.predict_proba(x_test)[:,1]

# Print model report:
print ("\nModel Report")
print ("Accuracy: %.4g" % metrics.accuracy_score(y_test.values, predictions))
print ("AUC Score (Test): %f" % metrics.roc_auc_score(y_test.values.flatten(), predprob))

Looks like accuracy decreased slightly when run on the test set.

In [None]:
# calculate other classification metrics

f1 = f1_score(y_test.values, predictions)
precision =  precision_score(y_test.values, predictions)
recall = recall_score(y_test.values, predictions)

print("f1 Score: {:f}\nPrecision: {:f}\nRecall: {:f}".format(f1, precision, recall))

To get a better understanding of the performance of our model, the precision, recall, and f1 score will be calculated.  The precision of a model is the ratio of true positives/(true positives + false positives).  The recall of a model is the true positives/(true positives + false negatives).  Precision and recall can be easily understood through a fishing example.  Recall is the size of the fishing net cast and precision is how many of whatever you catch are fish.  The best of both worlds is a net that is just the right size that catches only fish.  The F1 score is the harmonic mean of precision and recall. 

Here the model has a good combination of precision and recall.  Time to see if it can be improved through hyperparameter tuning.


### Build Model for Just Hotel and Just Resort:

In [None]:
hotel_resort = hotel_data_original >> mask(X.hotel=='Resort Hotel')
predictors_R = hotel_resort['is_canceled']
hotel_cat_resort = hotel_resort[cat_feat]
# Run model again but use pd.get_dummies instead of OneHotEncoder
hotel_cat_resort2 = pd.get_dummies(hotel_resort[cat_feat], dtype='int64')
# merge data frames
hotel_resort2 = pd.concat([hotel_resort[num_feat], hotel_cat_resort2], axis=1)
#display(hotel_resort2)


hotel_city = hotel_data_original >> mask(X.hotel=='City Hotel')
predictors_C = hotel_city['is_canceled']
hotel_cat_city = hotel_city[cat_feat]
# Run model again but use pd.get_dummies instead of OneHotEncoder
hotel_cat_city2 = pd.get_dummies(hotel_city[cat_feat], dtype='int64')
# merge data frames
hotel_city2 = pd.concat([hotel_city[num_feat], hotel_cat_city2], axis=1)
#display(hotel_city2)

#### Resort Hotel Model

In [None]:
import time

In [None]:
# split into test train set, this was a 60-40 split

x_trainR, x_testR, y_trainR, y_testR = train_test_split(hotel_resort2, predictors_R, test_size = 0.4,
                                                    random_state=1)
# build model, note that I left all the default input parameters and will edit them during hyperparameter tuning
xgb_R = XGBClassifier(seed=2)

# run model
start = time.time() # start timer

modelfit(xgb_R, x_trainR, y_trainR) 

print("Building time : " + str(time.time()-start))

In [None]:
plot_importance(xgb_R, grid=False, max_num_features=15)
plt.show()

In [None]:
# Run Test
# Predict on test set:
predictions_R = xgb_R.predict(x_testR)
predprob = xgb_R.predict_proba(x_testR)[:,1]

# Print model report:
print ("\nModel Report")
print ("Accuracy: %.4g" % metrics.accuracy_score(y_testR.values, predictions_R))
print ("AUC Score (Test): %f" % metrics.roc_auc_score(y_testR.values.flatten(), predprob))

f1 = f1_score(y_testR.values, predictions_R)
precision =  precision_score(y_testR.values, predictions_R)
recall = recall_score(y_testR.values, predictions_R)

print("f1 Score: {:f}\nPrecision: {:f}\nRecall: {:f}".format(f1, precision, recall))

#### City Hotel Model

In [None]:
# split into test train set, this was a 60-40 split

x_trainC, x_testC, y_trainC, y_testC = train_test_split(hotel_city2, predictors_C, test_size = 0.4,
                                                    random_state=1)
# build model, note that I left all the default input parameters and will edit them during hyperparameter tuning
xgb_C = XGBClassifier(seed=2)

# run model
start = time.time() # start timer

modelfit(xgb_C, x_trainC, y_trainC) 

print("Building time : " + str(time.time()-start))

In [None]:
plot_importance(xgb_C, grid=False, max_num_features=15)
plt.show()

In [None]:
# test model
# Predict on test set:
predictions_C = xgb_C.predict(x_testC)
predprob = xgb_C.predict_proba(x_testC)[:,1]

# Print model report:
print ("\nModel Report")
print ("Accuracy: %.4g" % metrics.accuracy_score(y_testC.values, predictions_C))
print ("AUC Score (Test): %f" % metrics.roc_auc_score(y_testC.values.flatten(), predprob))

f1 = f1_score(y_testC.values, predictions_C)
precision =  precision_score(y_testC.values, predictions_C)
recall = recall_score(y_testC.values, predictions_C)

print("f1 Score: {:f}\nPrecision: {:f}\nRecall: {:f}".format(f1, precision, recall))

### Hyper-parameter Tuning

In [None]:
'''
note the following format for inputs to xgb:
xgb1 = XGBClassifier(learning_rate = 0.1,
                     n_estimators = 1000,
                     max_depth = 5,          # max depth of tree
                     min_child_weight = 1,   # min sum of weights of all observ. required in a child
                     gamma = 0,              # min loss required to make split
                     subsample = 0.8,        # fraction of observ. to be randomly selected for each tree
                     colsample_bytree = 0.8, # denotes frac. of col. to be randomly smapled for each tree
                     scale_pos_weight = 1    # parameter for high class imbalanc,[default=1]
)
'''
from sklearn.model_selection import GridSearchCV

In [None]:
# Train new model with new hyper-parameters
xgb3 = XGBClassifier(learning_rate = 0.1,
                     n_estimators = 439,     # 439 comes from previous iterations of training
                     max_depth = 5,          # max depth of tree
                     min_child_weight = 1,   # min sum of weights of all observ. required in a child
                     gamma = 0,              # min loss required to make split
                     subsample = 0.8,        # fraction of observ. to be randomly selected for each tree
                     colsample_bytree = 0.8, # denotes frac. of col. to be randomly smapled for each tree
                     scale_pos_weight = 1    # parameter for high class imbalanc,[default=1]
)

In [None]:
# run model
start = time.time() # start timer

modelfit(xgb3, x_train, y_train) 

print("Building time : " + str(time.time()-start))

In [None]:
# Predict on test set:
predictions = xgb3.predict(x_test)
predprob = xgb3.predict_proba(x_test)[:,1]

# Print model report:
print ("\nModel Report")
print ("Accuracy: %.4g" % metrics.accuracy_score(y_test.values, predictions))
print ("AUC Score (Test): %f" % metrics.roc_auc_score(y_test.values.flatten(), predprob))

f1 = f1_score(y_test.values, predictions)
precision =  precision_score(y_test.values, predictions)
recall = recall_score(y_test.values, predictions)

print("f1 Score: {:f}\nPrecision: {:f}\nRecall: {:f}".format(f1, precision, recall))

---

In [None]:
# Tune max_depth and min_child_weight first
param_test1 = {
    'max_depth':range(3,10,2),
    'min_child_weight':range(1,6,2)  
}
gsearch1 = GridSearchCV(estimator=XGBClassifier(learning_rate=0.1, n_estimators=439,
                     min_child_weight = 1,   # min sum of weights of all observ. required in a child
                     gamma = 0,              # min loss required to make split
                     subsample = 0.8,        # fraction of observ. to be randomly selected for each tree
                     colsample_bytree = 0.8, # denotes frac. of col. to be randomly smapled for each tree
                     scale_pos_weight = 1,    # parameter for high class imbalanc,[default=1]
                     seed=2),
                        param_grid=param_test1,
                        scoring='roc_auc',
                        n_jobs=4,
                        iid=False,
                        cv=5
                       )
gsearch1.fit(x_train, y_train)
#gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

print('Best parameter set found on development site: ',gsearch1.best_params_,'\n')
print('Best ROC_AUC: ',gsearch1.best_score_,'\n')
print('Mean test score: ',gsearch1.cv_results_['mean_test_score'],'\n')
print('std. on test score: ',gsearch1.cv_results_['std_test_score'])

In [None]:
# test tuned model on test set
xgb_tuned = XGBClassifier(learning_rate = 0.1,
                     n_estimators = 439,     # 439 comes from previous iterations of training
                     max_depth = 9,          # max depth of tree
                     min_child_weight = 1,   # min sum of weights of all observ. required in a child
                     gamma = 0,              # min loss required to make split
                     subsample = 0.8,        # fraction of observ. to be randomly selected for each tree
                     colsample_bytree = 0.8, # denotes frac. of col. to be randomly smapled for each tree
                     scale_pos_weight = 1    # parameter for high class imbalanc,[default=1]
)

# run model
start = time.time() # start timer

modelfit(xgb_tuned, x_train, y_train) 

print("Building time : " + str(time.time()-start))

In [None]:
# Predict on test set:
predictions = xgb_tuned.predict(x_test)
predprob = xgb_tuned.predict_proba(x_test)[:,1]

# Print model report:
print ("\nModel Report")
print ("Accuracy: %.4g" % metrics.accuracy_score(y_test.values, predictions))
print ("AUC Score (Test): %f" % metrics.roc_auc_score(y_test.values.flatten(), predprob))

f1 = f1_score(y_test.values, predictions)
precision =  precision_score(y_test.values, predictions)
recall = recall_score(y_test.values, predictions)

print("f1 Score: {:f}\nPrecision: {:f}\nRecall: {:f}".format(f1, precision, recall))

### Results:
The grid search performed above took about 40 minutes.  The best parameters found for the two hyperparameters were max_depth=9 and min_child_weight=1.  Ideally we would continue to perform grid searches on the remaining parameters, intermittently retraining our model to get the output of the optimal number of estimators (ex:” Optimal estimators for learning rate:  100”) as that can change.  For times sake no more grid searches will be performed.

Parameters|Value|Description
---|---|:---
learning_rate|0.1|learning rate
n_estimators|439|number of boosting rounds
max_depth|9|max depth of tree
nin_child_weight|1|min sum of weights of all observ. required in a child
gamma|0|min loss required to make split
subsample|0.8|fraction of observ. to be randomly selected for each tree
colsample_bytree|0.8|denotes frac. of col. to be randomly smapled for each tree
scale_pos_weight|1|parameter for high class imbalance [default=1]

Summary of Model Results:

Metric|Before (defualt)|After (tuning)
---|---|---
Accuracy|0.8440|0.8546|
AUC Score|0.9103|0.9213|
f1 Score|0.7693|0.7919|
Precision|0.8488|0.8408|
Recall|0.7034|0.7484|






Looks like this model has managed to produce an accuracy of 85.46% on the test set after some minimal hyperparamter tuning.