## Context

Reservation cancellation is not uncommon in the hotel industry. Each cancellation means a lost revenue opportunity that can never be recovered. When working at the front desk at an airport hotel, we had to call each guest to confirm if they will show up in the afternoon. If they confirm that they cannot show up, we could try to sell the room again. This practice ensures the revenue to a certain degree. However, it is not sufficient to call each guest in the afternoon since a majority of guests check-in during the afternoon. 

Therefore, if we can predict if a guest would cancel a reservation, hotels could contact guests that most likely to cancel to confirm more efficiently and to resell the room to optimize revenues.

## Content
This [data](https://www.kaggle.com/jessemostipak/hotel-booking-demand) set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.

## Questions
1. Which hotel has more cancelations?
2. Any difference in lead time?
3. How about ADR?
4. Would Deposit Type makes a difference?
5. Any difference in market segments?
6. How about distribution channels?
7. How about Month, Day, and Week Number?

### Import Dataset & Preparation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
# load the dataset
df = pd.read_csv('../input/hotel-booking-demand/hotel_bookings.csv')
df.head()

In [None]:
df.describe()

In [None]:
# check for missing values
df.isnull().sum()

There are many missing values in agent & company columns. These two columns can be dropped. We can drop the missing value in country & children since there is only small proportion missing values.

In [None]:
# check for the shape of dataset
df.shape

In [None]:
# check for data type of each column
df.dtypes

In [None]:
df['is_canceled'].value_counts(normalize=True)

In [None]:
df['reservation_status'].value_counts(normalize=True)

is_canceled & reservation_status provide same information, we can drop reservation_status for model building purpose

In [None]:
df.drop(columns=['agent', 'company', 'reservation_status'],inplace=True)
df.dropna(axis=0,inplace=True)
df.shape

In [None]:
df.isnull().sum()

In [None]:
df['meal'].value_counts()

In [None]:
# "meal" contains values "Undefined", which is equal to SC
df['meal'].replace('Undefined','SC',inplace=True)

In [None]:
df.hist(figsize=(20,20))
plt.show()

Looks like there are some reservation with 0 adults, we need to take a look at those records.

In [None]:
len(df[(df['adults']==0) & (df['children']==0) & (df['babies']==0)])

In [None]:
zero_guests = df[(df['adults']==0) & (df['children']==0) & (df['babies']==0)].index
df.drop(zero_guests, inplace=True)
df.shape

### EDA

Now the data is cleaned and ready for analysis.

#### Which hotel has more cancelations?

In [None]:
print('There are ' + str(len(df[(df['hotel']=='Resort Hotel') & (df['is_canceled']==1)])) + ' cancelations at Resort Hotel')
print('There are ' + str(len(df[(df['hotel']=='City Hotel') & (df['is_canceled']==1)])) + ' cancelations at City Hotel')

In [None]:
plt.figure(figsize=(6,6))
plt.title(label='Cancellations by Hotel Types')
sns.countplot(x='hotel',hue='is_canceled',data=df)
plt.show()

In [None]:
# % of cancellations in Resort Hotel
df[df['hotel']=='Resort Hotel']['is_canceled'].value_counts(normalize=True)

In [None]:
# % of cancellations in City Hotel
df[df['hotel']=='City Hotel']['is_canceled'].value_counts(normalize=True)

City Hotel has a higher cancellations rate of 41.78% comparing to Resort Hotel’s 27.98%

#### Any difference in lead time?

In [None]:
plt.figure(figsize=(12,6))
plt.title(label='Cancellation by Lead Time')
sns.barplot(x='hotel',y='lead_time',hue='is_canceled',data=df)
plt.show()

Looks like the longer the lead time, the reservation is more likely to be canceled.

#### How about ADR?

In [None]:
plt.figure(figsize=(6,6))
plt.title(label='Cancellation by ADR')
sns.barplot(x='is_canceled',y='adr',data=df)
plt.show()

In [None]:
plt.figure(figsize=(6,6))
plt.title(label='Cancellation by ADR & Hotel Type')
sns.barplot(x='hotel',y='adr',hue='is_canceled',data=df)
plt.show()

Looks like the cancellations in Resort Hotels had a higher ADR.

##### Would Deposit Type makes a difference?

In [None]:
plt.figure(figsize=(6,6))
plt.title(label='Cancellation by Deposit Type')
sns.countplot(x='deposit_type',hue='is_canceled',data=df)
plt.show()

Reservations with No-deposit or Non-refund policy are more likely to be canceled

##### Any difference in market segments?

In [None]:
plt.figure(figsize=(6,6))
plt.title(label='Cancellation by Market Segments')
plt.xticks(rotation=45) 
sns.countplot(x='market_segment',hue='is_canceled',data=df)
plt.show()

In [None]:
plt.figure(figsize=(6,6))
plt.title(label='Cancellation by Market Segments & ADR')
plt.xticks(rotation=45) 
sns.barplot(x='market_segment',y='adr',hue='is_canceled',data=df)
plt.show()

• The cancellation percentage of groups is higher than other segments

• The cancellation number of Online TA is higher than other segments

• Almost all canceled reservations have a higher ADR.

#### Distribution Channels


In [None]:
plt.figure(figsize=(6,6))
plt.title(label='Cancellation by Distribution Channels')
plt.xticks(rotation=45) 
sns.countplot(x='distribution_channel',hue='is_canceled',data=df)
plt.show()

Reservations from Travel Agents or Tour Operators are more likely to be canceled


##### Cancellations by Month, Day and Week Number

In [None]:
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'], format='%Y-%m-%d')

In [None]:
plt.figure(figsize=(6,6))
plt.title(label='Cancellation by Month')
plt.xticks(rotation=45) 
sns.countplot(x=df['reservation_status_date'].dt.month,hue='is_canceled',data=df)
plt.show()

In [None]:
plt.figure(figsize=(19,6))
plt.title(label='Cancellation by Week Number')
plt.xticks(rotation=45) 
sns.countplot(x=df['arrival_date_week_number'],hue='is_canceled',data=df)
plt.show()

In [None]:
plt.figure(figsize=(16,6))
plt.title(label='Cancellation by day')
plt.xticks(rotation=45) 
sns.countplot(x=df['reservation_status_date'].dt.day,hue='is_canceled',data=df)
plt.show()

### Modeling

I will look up 1's precision, recall, accuracy as model metrics as well as interpretability to decide the best model.

In [None]:
cat_cols=['is_canceled','arrival_date_month','meal','market_segment','distribution_channel','reserved_room_type',
      'is_repeated_guest','deposit_type','customer_type']
df[cat_cols] = df[cat_cols].astype('category')
num_cols = ['lead_time','arrival_date_week_number','arrival_date_day_of_month','stays_in_weekend_nights','stays_in_week_nights',
        'adults','children','babies','previous_cancellations','previous_bookings_not_canceled','required_car_parking_spaces',
        'total_of_special_requests','adr']

In [None]:
model_df = df[cat_cols+num_cols]
model_df.shape

In [None]:
model_df.corr()

In [None]:
# Create dummy variables
df_dummies = pd.get_dummies(model_df.drop(columns=['is_canceled']))

In [None]:
df_dummies.head()

In [None]:
y = model_df['is_canceled']
X = df_dummies

In [None]:
# Load modules for machine learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)
sc = StandardScaler()

sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

##### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train_std, y_train)
y_lr_pred = lr.predict(X_test_std)

print('Accuracy: %.4f' % accuracy_score(y_test, y_lr_pred))
print(confusion_matrix(y_test, y_lr_pred))
print(classification_report(y_test,y_lr_pred))

##### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(X_train_std, y_train)
y_clf_pred = clf.predict(X_test_std)

print('Accuracy: %.4f' % accuracy_score(y_test, y_clf_pred))
print(confusion_matrix(y_test, y_clf_pred))
print(classification_report(y_test,y_clf_pred))

##### Ada Boosting

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier()
ada.fit(X_train_std, y_train)
y_ada_pred = ada.predict(X_test_std)

print('Accuracy: %.4f' % accuracy_score(y_test, y_ada_pred))
print(confusion_matrix(y_test, y_ada_pred))
print(classification_report(y_test,y_ada_pred))

##### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier()
gbc.fit(X_train_std, y_train)
y_gbc_pred = gbc.predict(X_test_std)

print('Accuracy: %.4f' % accuracy_score(y_test, y_gbc_pred))
print(confusion_matrix(y_test, y_gbc_pred))
print(classification_report(y_test,y_gbc_pred))

##### XGBoost

In [None]:
from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(X_train_std, y_train)
y_xgb_pred = xgb.predict(X_test_std)

print('Accuracy: %.4f' % accuracy_score(y_test, y_xgb_pred))
print(confusion_matrix(y_test, y_xgb_pred))
print(classification_report(y_test,y_xgb_pred))

##### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfl = RandomForestClassifier()
rfl.fit(X_train_std, y_train)
y_rfl_pred = rfl.predict(X_test_std)

print('Accuracy: %.4f' % accuracy_score(y_test, y_rfl_pred))
print(confusion_matrix(y_test, y_rfl_pred))
print(classification_report(y_test,y_rfl_pred))

Looks like the random forest has the highest precision, recall and accuracy, let's take a look at feature importance.

In [None]:
importances = rfl.feature_importances_ 
# Sort feature importances in descending order
indices = np.argsort(importances)[::-1]

# Rearrange feature names so they match the sorted feature importances
names = [X.columns[i] for i in indices]

# Create plot
plt.figure(figsize=(30,30))

# Create plot title
plt.title("Feature Importance")

# Add bars
plt.bar(range(X.shape[1]), importances[indices])

# Add feature names as x-axis labels
plt.xticks(range(X.shape[1]), names, rotation=90)

# Show plot
plt.show()

In [None]:
# pring all feature importance
indices = np.argsort(importances)[::-1]
feat_labels = X.columns[:]

for f in range(X_train_std.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feat_labels[indices[f]], 
                            importances[indices[f]]))

Let's dive deeper into the performance

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

In [None]:
y_score = rfl.predict_proba(X_test_std)[:,1]
# Create true and false positive rates
false_positive_rate, true_positive_rate, threshold = roc_curve(y_test, y_score)
# Plot ROC curve
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
import scikitplot as skplt

In [None]:
rf = rfl.fit(X_train_std, y_train)
y_probas = rf.predict_proba(X_test_std)
skplt.metrics.plot_roc(y_test,y_probas)
plt.show()

In [None]:
skplt.metrics.plot_lift_curve(y_test, y_probas)
plt.show()

So, the best model is the Random Forest with precision of 0.86, recall of 0.74, and overall accuracy of 0.8570. Also, we are able to interpret the model results by extracting feature importances