# Machine Learning Project : Hotel bookings
<p><img width="150" style="float: right;margin:10px 30px 10px 10px" src="https://food.jumia.ug/blog/wp-content/uploads/2016/09/Hotel-booking-iStock_000089313057_Medium-940x529-660x400.jpg"></p>
<br><p>I have been working on this dataset as a school project for the last three weeks. My intention was to try out the major supervised learning algorithms we saw on the course, compare their respective performances and try to bring up some insights from the data.<br>
Indeed, the dataset includes all the bookings of two hotels along with information about the guests, the room types, check-in and check-out dates, etcâ€¦ We will apply techniques of Exploratory Data Analysis (EDA) to discover patterns in the data, and then apply basic machine learning methods to predict if a booking will be cancelled in the future or not.</p>
## Exploratory Data Analysis
### Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

### Import project files
 First, we will import the file containing the descriptions of the columns in the CSV file, to be used in understanding the content of each column. Then the CSV file will be imported for analysis.

In [None]:
booking_data = pd.read_csv('../input/hotel-booking-demand/hotel_bookings.csv')
print('The shape of the overall database is: ', booking_data.shape,'\n')

### Preliminary observations and Findings 
First, let's explore the booking columns across the dataset to look for <strong>aberrant or missing values</strong><br>
The first observation is that some columns are not fully populated, and this means we need to replace the missing values. And there are <strong>four</strong> columns obeying this observation:
<strong>'Children', 'Country', 'agent' and 'company'.</strong>

In [None]:
booking_data.loc[:,['is_canceled','children','country', 'agent', 'company']].info()
booking_data.country.fillna('-', inplace=True)
booking_data.agent.fillna(0, inplace=True)
booking_data.company.fillna(0, inplace=True)
booking_data.children.fillna(0, inplace=True)

#### <u>Q. What type of hotels suffer more from cancellation?</u>
Not all bookings are the same!<br>
As you may know, each row in this dataset corresponds to a booking, but one row may be related to one person while antoher may be for several. The same thing stands for the nights stayed (one would book a weekend while another may stay longer). 
<br>Let's build some useful columns for later calculations, such as total stay nights and the number of booked people (except babies) per row. Another column will be added to account for the difference between the <i>reserved_room_type</i> and the <i>assigned_room_type</i> and included in our analysis.

In [None]:
booking_data['pax'] = booking_data.children+booking_data.adults
booking_data['stay_nights'] = booking_data.stays_in_week_nights+booking_data.stays_in_weekend_nights
booking_data['bill']=booking_data.stay_nights*booking_data.adr
booking_data['room_assignment'] = booking_data['reserved_room_type']==booking_data['assigned_room_type']
print('* Overall, %2.0f bookings were canceled, accounting for %2.0f percent of booked stays.'
      %(booking_data.is_canceled.sum(), booking_data.is_canceled.mean()*100))
plt.figure(figsize=(8,3))
sns.set_style("white")
sns.countplot(x='hotel', hue='is_canceled', data=booking_data)
plt.show()

In [None]:
print('Cancellation Financial impact')
sns.catplot(x='hotel', y='bill', hue='is_canceled', estimator=sum, ci=None, kind='bar', data=booking_data)
plt.show()

<strong>The majority of canceled bookings are in 'City Hotels' even though both establishments suffer from cancellation to a certain degree.</strong>
#### <u>Q. Do people from a specific country tend to cancel their booking more than the others?</u>

We will now build a grouped table to calculate the column values per country. We will use this to first display the top countries by number of fulfilled bookings.

In [None]:
booking_country = pd.DataFrame(booking_data.groupby('country').sum())
booking_country = booking_country.loc[:,['is_canceled','stay_nights','pax']]
booking_country['booking_count'] = booking_data.groupby('country').hotel.count()
booking_country['cancellation_rate'] =  booking_country.is_canceled.div(booking_country.booking_count)
booking_country['fulfillment_rate'] =  1-booking_country.is_canceled.div(booking_country.booking_count)
booking_country['fulfilled_bookings'] =  booking_country['booking_count']-booking_country['is_canceled']
sns.set(style="whitegrid")
toprint = booking_country.reset_index().sort_values(by='fulfilled_bookings', ascending=False).head(10)
g = sns.PairGrid(toprint, x_vars=toprint.columns[7:8], y_vars=['country'], height=4)
sns.despine(left=True, bottom=True)
g.map(sns.stripplot, size=20, orient="h", palette="ch:s=1,r=-.1,h=1_r", linewidth=2, edgecolor="w")
plt.show()

The countries on the top of the chart are all european, with a remarkable lead for <strong>Portugal</strong>. If we hypothesize that the dataset is for bookings in the same geographical area, we may claim that the establishments object of the bookings are located in the <strong>Iberian peninsula</strong> in order to justify the affluence of customers from this region. And given that <b>Portugal</b> has a small population (10M) compared to the UK, France and Spain, it would be hard to claim that the hotels are Spain. Thus, we conclude that the stablishments are in <b>Portugal</b>.
<br><br>Now let's move back to our dataset and analyze the countries with the highest cancellation rates. It is common sense that a high number of cancellations for a given country could be due to the fact that the overall reservations are also important. So the suitable approach would be to compare percentages.

In [None]:
plt.figure(figsize=(3,3))
sns.set_style("white")
sns.boxplot(data=booking_country, y='cancellation_rate')
plt.annotate('Filter countries with this rate or more', xy=(0.01, 0.48), xytext=(-0.4, 0.8),
            arrowprops=dict(facecolor='black', shrink=1),
            )
plt.title('Distribution of cancellation rates among countries')
plt.show()

After we've distributed the countries per rate of cancellation, we will filter the countries higher than the 75th percentile.
<br>The calculations determined that the 75th quantile is at a cancellation rate of 45%. Nonetheless, we will neglect the countries with low reservations count (<100) even if they present high cancellation rates for the sake of the analysis.<br>
The over-sized points are countries with high cancellation rates while at the same time having an important number of reservations overall. Let's zoom in on the cluster at the left to enumerate those countries.

In [None]:
#print(booking_country.quantile(0.75))
booking_country_plot = booking_country
booking_country_plot['hue'] = (booking_country['cancellation_rate']<0.45)|(booking_country['booking_count']<100)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,3))
sns.set_style('whitegrid')
ax1.set_title('Countries with high cancellation rates')
sns.scatterplot(data=booking_country_plot, x='booking_count', y='is_canceled', size='hue', hue='hue', sizes=(20,150), alpha=0.8, legend=False, ax=ax1)
ax2.set_title('High cancel rates (zoom on the cluster)')
sns.scatterplot(data=booking_country_plot, x='booking_count', y='is_canceled', hue='hue', size='hue', sizes=(20,150), alpha=0.8, legend=False, ax=ax2)
plt.xlim(0,1200)
plt.ylim(0,600)
plt.show()

In [None]:
print('If we do not account for countries with negligible booking counts, the TOP 3 countries in terms of cancelling reservations are : Portugal, China and Angola.')
display(booking_country_plot[booking_country_plot['hue']==False].sort_values(by='is_canceled', ascending=False).loc[:,'is_canceled':'cancellation_rate'])

#### <u>Q. In which period of the year the number of bookings peak? Can you spot any seasonality?</u>
Now let's address the seasonality of the bookings, by finding the months during which the bookings peak.

In [None]:
bookings = booking_data[booking_data['is_canceled']==0].pivot_table(index='arrival_date_month', columns='arrival_date_year', values='hotel', aggfunc=len, fill_value=0)
bookings.index = pd.CategoricalIndex(bookings.index, categories=['January', 'February', 'March', 'April','May','June','July', 'August','September', 'October', 'November', 'December'], ordered=True)
bookings = bookings.sort_index()
mask = np.array([[1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1]])
f, ax = plt.subplots(figsize=(5, 3))
sns.heatmap(bookings, center=2000, annot=True, mask=mask, fmt="d", ax=ax, cmap="YlGnBu")
sns.set_context('paper')
plt.show()

This matrix shows that the number of fulfilled reservations peak each year at the months of <strong>May</strong> and <strong>October</strong>. We may at a later stage leave out the <i>'arrival_year'</i> feature for the training since there is a certain seasonality.
#### <u>Q. What are the features that are more correlated with booking cancellation?</u>

In [None]:
print('The maximum correlation between \'is_canceled\' and any other numeric feature is  %2.2F'
      %booking_data.corr().loc['lead_time':,'is_canceled'].abs().max())

We can infer that there is little correlation between any numerical feature and the target label <i>'is_canceled'</i>, the closest one being <i>'lead_time'</i>. Let's explore the correlation with categorical features.<br>
By previewing each feature by itself while counting the fulfilled bookings vs the cancelled ones, we can detect some patterns. In fact, most of the features have a homogenous distribution of cancellations among the unique values, which is a clear indication that there is little chance for the feature to have a correlation with the label to predict. The columns <i>'arrival_date_month'</i> and <i>'arrival_date_week_number'</i> illustrate this effect:

In [None]:
sns.set_style("white")
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,3))
sns.countplot(data=booking_data, x='arrival_date_week_number', hue='is_canceled', ax=ax1)
sns.countplot(data=booking_data, x='arrival_date_month', hue='is_canceled', ax=ax2)
plt.xticks(rotation=60)
plt.show()

In contrast, other features displayed a certain bias when segregated by cancellation. This pointed us towards some features of interest.

In [None]:
sns.set_style("white")
fig, axes = plt.subplots(2, 3, figsize=(15,8))
sns.countplot(data=booking_data, x='deposit_type', hue='is_canceled', ax=axes[0][0])
sns.countplot(data=booking_data, x='market_segment', hue='is_canceled', ax=axes[0][1])
sns.countplot(data=booking_data, x='distribution_channel', hue='is_canceled', ax=axes[0][2])
sns.countplot(data=booking_data, x='is_repeated_guest', hue='is_canceled', ax=axes[1][0])
sns.countplot(data=booking_data, x='room_assignment', hue='is_canceled', ax=axes[1][1])
sns.countplot(data=booking_data, x='reservation_status', hue='is_canceled', ax=axes[1][2])
plt.show()

In fact, we deduced that these are the columns with the highest correlation with the label to predict:
>* In regards to 'deposit_type': 'Non Refund' bookings are canceled almost all the time
* Also, the cancellation rate varies a lot for different values of 'market_segment' and 'distribution_channel'
* Repeated guests cancel their bookings a lot fewer than non-repeated guests do.

As for the last attribute, there is a perfect connection between the values. For each row that has 'reservation_status' set to 'Check-Out', the label 'is_canceled' is null, and is equal to '1' on the other cases. So this feature will be left out during the training.
<br><i>NB: A test has been done with the feature 'reservation_status', after creating the dummies, and it generated a model with a perfect score with DecisionTree as the algorithm.</i>

Before moving on to the ML part, one peculiar thing that got our attention while cleaning the data, is the presence of 715 rows with null numbers of stay nights (weekends or throughout the week) and a null average daily rate. We investigated the rows to find patterns, and observed that they are scattered along all the features (hotel, arrival dates, meals...), except that:<br>
* The majority are locals (PRT) and the bookings are 'checked-out'
* They are related to 'transient' clients and signed for B&B
* They had 'No-deposits' 
* Booked for 2 people mostly (sometimes 1 person) and almost all of them had 0 'days in waiting list'

We immediately started hypothesizing:<br>
<b>H1</b>: They could be related to clients showing up at he last minute (0 days waiting list) and leaving immediately (maybe not liking the establishment, price...), but they're all labeled as 'Checked-out'<br>
<b>H2</b>: Neither is it plausible to say that it could be clients who book lunch/dinner or to the bar, as it does not make sense to create a booking for that, and a lot of bookings are B&B.<br>
...among another 3 or 4 hypotheses that were far-fetched. We finally decided it has something to do with the validity of the data. So, whether it is maliciously rigged or just a mistake in record keeping, this means that there are bills not accounted for in the establishement. Let's calculate the supposed losses, and drop the rows.

In [None]:
#Average bill amount = average adr for checked-out bookings * average stay duration
bb= booking_data[booking_data['is_canceled']==0].adr.mean()*booking_data[booking_data['is_canceled']==0].stay_nights.mean()
#Subset of fulfilled bookings = 680
rr = booking_data[(booking_data['stay_nights']==0) & (booking_data['is_canceled']==0)].is_canceled.value_counts()
print('The average amount to be expected from the bookings is $%.2f' %(bb*rr))
booking_data = booking_data[booking_data['stay_nights']!=0]

## Machine Learning
### Import libraries

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier, BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import roc_auc_score

### Prepare the inputs
To tackle the model traning part, we need to deal with categorical features, by replacing them with dummy columns. The features at hand are:<br>
<i>hotel, market_segment, distribution_channel, room_assignmnt, deposit_type, is_repeated_guest</i>

In [None]:
bookings = booking_data[['is_canceled', 'lead_time', 'country', 'hotel', 'market_segment', 'distribution_channel', 'room_assignment', 'deposit_type', 'customer_type', 'is_repeated_guest']]
booking_data_dummies=pd.get_dummies(data=bookings, columns=['hotel', 'country', 'market_segment', 'distribution_channel', 'room_assignment', 'deposit_type', 'customer_type', 'is_repeated_guest'])
print(booking_data_dummies.shape)

Then, we need to split the dataframe into features (booking columns) and the label to predict ('is_canceled'). Next, we will split the dataframe into a training set (<strong>X_train and y_train</strong>) and a test set (<strong>X_test and y_test</strong>).

In [None]:
X = booking_data_dummies.drop(['is_canceled'], axis=1).values
y = booking_data_dummies.is_canceled
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)

### Prepare the model
By preparing the model, we mean choosing the algorithm, instantiating it and then fitting it to the training set. The algorithms that are suited for the problem at hand, which is a classification, are the following:
<ul><li>Decision trees (CART)</li><li>KNN (with confusion matrix)</li><li>Logistic regression (with ROC)</li><li>A boosted model with a combination of all the above.</li>    
</ul>

#### Decision Tree Classifier
<br>We'll start with a  <strong>decision tree classifier</strong> as a first attempt to model the problem. We will tackle this through the conventional paradigm (instantiate, fit, predict, assess performance)

In [None]:
model_t = DecisionTreeClassifier(criterion='entropy', random_state=7)
model_t.fit(X_train, y_train)
y_pred = model_t.predict(X_test)
print('Score :', accuracy_score(y_test, y_pred))

With the use of GridSearchCV, it is clear that increasing the '<i>max_depth</i>' parameter increases the model accuracy without finding an optimum. So the choice will be trade-off between execution time and score improvement.
#### K-Nearest Neighbors

In [None]:
model_k = KNN(n_neighbors=4)
model_k.fit(X_train, y_train)
y_pred = model_k.predict(X_test)
print('Score :', accuracy_score(y_test, y_pred))

#### Logistic Regression

In [None]:
model_l = LogisticRegression(random_state=7, solver='liblinear')
model_l.fit(X_train, y_train)
y_pred = model_l.predict(X_test)
print('Score :', accuracy_score(y_test, y_pred))

#### Voting Classifier

In [None]:
classifiers = [('Logistic Regression', model_l),
('K Nearest Neighbours', model_k),
('Classification Tree', model_t)]
vc = VotingClassifier(estimators=classifiers)
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)
print('Voting Classifier: {:.3f}'.format(accuracy_score(y_test, y_pred)))

#### Bagging Classifier

In [None]:
bc = BaggingClassifier(base_estimator=model_l, n_estimators=100)
bc.fit(X_train, y_train)
y_pred = bc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy of Bagging Classifier: {:.3f}'.format(accuracy))

#### Random Forest

In [None]:
rf = RandomForestClassifier(n_estimators=400, random_state=7)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

In [None]:
print('Accuracy of Random Forest Classifier: {:.3f}'.format(accuracy))

#### ADABoost

In [None]:
dt = DecisionTreeClassifier(max_depth=15, random_state=7)
adb_clf = AdaBoostClassifier(base_estimator=dt, n_estimators=500)
adb_clf.fit(X_train, y_train)
y_pred_proba = adb_clf.predict_proba(X_test)[:,1]
adb_clf_roc_auc_score = roc_auc_score(y_test, y_pred_proba)

In [None]:
print('Accuracy of ADABoosted Descision Tree Classifier: {:.3f}'.format(adb_clf_roc_auc_score))

So far, the best strategy to predict the outcomes of bookings was <strong>adaptive boosting</strong> on Decision trees, which yields a score as high as 85%.
In real life, we wouldn't go the management comittee of the holtel chain to tell them that the model predicts the fulfillment of reservations 85% of the time, as this would mean nothing and everything. One obvious comment would be on the ability of the model to predict impact on revenues, aka bottom line numbers.
##### So let's crunch the numbers !

In [None]:
booking_sample = booking_data.sample(4420).sort_values(by='bill', ascending=False)
booking_sample.bill.sum()
booking_sample_0 = booking_sample[booking_sample['is_canceled']==0].sort_values(by='bill', ascending=False).head(int(4420*0.113)).bill.sum()
booking_sample_1 = booking_sample[booking_sample['is_canceled']==1].sort_values(by='bill', ascending=False).head(int(4420*0.113)).bill.sum()
upper_bound = booking_sample.bill.sum() + booking_sample_0
lower_bound = booking_sample.bill.sum() - booking_sample_1
print('The actual month revenue if between %.2f and %.2f of the predicted sum.' %(upper_bound/booking_sample.bill.sum(), lower_bound/booking_sample.bill.sum()))

To recap we've just done, we took the mean count of bookings the hotels may get on a regular month. We supposed the worst case scenarios are the ones where the wrong predictions are on the reservations with the biggest bills.<br>
Since the model is 88.7% accurate, we suppose that in one scenario we have the wrong predictions on 11% of the bookings that were actually canceled, this will give us false hope of redeeming the corresponding value (lower bound), and vice versa (scenario where we wrongly assume that the most important 11% of the bookings are going to be canceled, when in fact they will be fulfilled, thus offsetting the total revenue to the upper bound).
In conclusion, <b>our model of 88.7% accuracy only helps in predicting the revenue to a whopping +/- 25%. Further statistical significance testing is required (p-values) to determine whether the revenue prediction can be narrowed down to a smaller interval.</b>

Another use of the prediction model would be to intentionally overbook the establishments in peak season, to compensate for the bookings to-be-canceled. Given that cancellation rates have a mean of 40% in peak months, we can use this rate to overbook, whenever the predicted fulfilled bookings reach nominal hotel capacity.
To adjust for the error of the model, we suppose that 11.3% of the bookings predicted to-be-canceled are going to turn up at the hotel counter once the reserved date comes, and deduct the number from the overbooked capacity.

In [None]:
print('* All in all, a rough calculation to optimally overbook the establishment at peak seasons is', int(88.7/0.6),'%')