# Introduction

Hello everyone! This is my data analysis on the Hotel Booking dataset, and this is the first time that I will try to analyze the data and define my own question and solve it. I am very excited! Now, let's us first take a look at our data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import math
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import GridSearchCV
import plotly.express as px
from sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_score

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
full_data = pd.read_csv("/kaggle/input/hotel-booking-demand/hotel_bookings.csv")

In [None]:
full_data.shape

In [None]:
full_data.head()

In [None]:
print("The parameters are:")
print(full_data.columns.unique())
len(full_data.columns.unique())

In [None]:
print("Following columns contain missing values:")
print(full_data.columns[full_data.isna().any()].unique())
len(full_data.columns[full_data.isna().any()].unique())

So we just had a quick look at the data. There are 119390 rows with 32 parameters, and four of the parameters have missing values. Let's now look deeper into these four columns and see how many values are missed.

In [None]:
print("Number of empty values in children column: ", full_data['children'].isnull().sum())
print("Percentage： ", full_data['children'].isnull().sum() / full_data.shape[0] * 100)

In [None]:
print("Number of empty values in country column: ", full_data['country'].isnull().sum())
print("Percentage： ", full_data['country'].isnull().sum() / full_data.shape[0] * 100)

In [None]:
print("Number of empty values in agent column: ", full_data['agent'].isnull().sum())
print("Percentage： ", full_data['agent'].isnull().sum() / full_data.shape[0] * 100)

In [None]:
print("Number of empty values in company column: ", full_data['company'].isnull().sum())
print("Percentage： ", full_data['company'].isnull().sum() / full_data.shape[0] * 100)

So from the above we can see the exact number of rows with missing values in each column, and it is shown that the company column has essentially 94.3% values missing! This basically means that we should not use the company column for further data analysis as there are too many missing values. So let's just drop this column.

In [None]:
full_data = full_data.drop(['company'], axis=1)

In [None]:
full_data.shape

In [None]:
full_data.columns.unique()

# Univariate Data analysis

Now we have see the general picture of our data. I say we can explore some of the columns deeper and visualize their distributions and check outliers and other characteristics. Let's begin!

**Hotel**

From this column, we can see that more people(almost double) prefer to book a city hotel. This may be intuitive because a city hotel is usually cheaper than a resort hotel. 

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Hotel Column")
sns.countplot(x=full_data['hotel'])
plt.show()

print("Percentages: ")
print(full_data['hotel'].value_counts() / full_data.shape[0] * 100)

**is_canceled**

From this column, we can see that more people(almost double) tend to not cancel their bookings. 

In [None]:
plt.figure(figsize=(15, 10))
plt.title("is_canceled Column")
sns.countplot(x=full_data['is_canceled'])
plt.show()

print("Percentages: ")
print(full_data['is_canceled'].value_counts() / full_data.shape[0] * 100)

**lead_time**

So here we see that this column is highly right-skewed. Most people only book the hotel a few days before they go, but there is people who book two years prior to their check in!

In [None]:
plt.figure(figsize=(15, 10))
plt.title("lead_time Column")
sns.distplot(a=full_data['lead_time'], kde=False, axlabel="Number of elapsed days")
plt.show()

In [None]:
plt.figure(figsize=(15, 10))
plt.title("lead_time Column")
sns.boxplot(x=full_data['lead_time'])
plt.show()

print(full_data['lead_time'].describe())

**Arrival Month**

So from this column, we see that there is a slighly increase in arrivings during summer, which is also intuitive because most families can travel together during summer because kids are in vocation. 

In [None]:
ordered_months = ["January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"]

sorted_month = pd.Categorical(full_data["arrival_date_month"], categories=ordered_months, ordered=True)

plt.figure(figsize=(15, 10))
plt.title("Arrival Month")
sns.countplot(x=sorted_month)
plt.show()

print("Percentages: ")
print(sorted_month.value_counts() / full_data.shape[0] * 100)

**stays_in_weekend_nights**

So most people actually don't stay overnight during weekends, but there are people who basically lived in the hotel for a few weeks.

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Weekend Nights Column")
sns.countplot(x=full_data['stays_in_weekend_nights'])
plt.show()

print("Percentages: ")
print(full_data['stays_in_weekend_nights'].value_counts() / full_data.shape[0] * 100)

**Number of adults**

So most groups consist of two adults, which is normal. Those customers with 20 or 50 adults are likely to be tourist groups

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Adult Column")
sns.countplot(x=full_data['adults'])
plt.show()

print(full_data['adults'].value_counts() / full_data.shape[0] * 100)

**Number of Children**

So most customers don't bring children to the hotel.

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Children Column")
sns.countplot(x=full_data['children'])
plt.show()

print(full_data['children'].value_counts() / full_data.shape[0] * 100)

**babies**

Again, most customers don't bring babdies.

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Baby Column")
sns.countplot(x=full_data['babies'])
plt.show()

print(full_data['babies'].value_counts() / full_data.shape[0] * 100)

**Country**

In [None]:
country_data = pd.DataFrame(full_data.loc[full_data["is_canceled"] == 0]["country"].value_counts())
country_data.rename(columns={"country": "Number of Guests"}, inplace=True)
total_guests = country_data["Number of Guests"].sum()
country_data["Guests in %"] = round(country_data["Number of Guests"] / total_guests * 100, 2)
country_data["country"] = country_data.index

guest_map = px.choropleth(country_data,
                    locations=country_data.index,
                    color=country_data["Guests in %"], 
                    hover_name=country_data.index, 
                    color_continuous_scale=px.colors.sequential.Plasma,
                    title="Home country of guests")
guest_map.show()

OK, so I just visualized some columns that I am interested in. This type of analysis did tell us some basic information about the customers such as what type of hotel they book, where they come from, and usually how many people travel together. These are all useful information, but this sort of analysis is too vague and waste of time. I would say let's now define our question and then see what we can do.

# Define Question

If I am a hotel manager, I will be very interested in knowing if a customer will cancel his hotel reservation, because that deeply correlates with the revenues our hotel can earn. Therefore, let's try to build a model to predict if someone will cancel his hotel reservation from other parameters.

**Cancellation Correlations**

Now we have defined the problem, let's first see what columns are highly associated with the is_cancelled column

In [None]:
cancel_corr = full_data.corr()["is_canceled"]
cancel_corr.abs().sort_values(ascending=False)[1:]

It is thus shown that lead_time, total_of_special_requests, required_car_parking_spaces, booking_changes and previous_cancellations are the 5 most correlated numerical variables with is_cancelled column. But to prevent possible data leakage, we should exclude booking_changes, which may include the cancellation of hotel reservation. Furthermore, reservation_status also include whether a customer cancels the booking or not. So it must also be excluded to prevent data leakage. 

Now let's look at the categorical features.

**Hotel**

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Hotel Column")
sns.barplot(x=full_data['hotel'], y=full_data['is_canceled'])
plt.show()

The type of hotel seems to affect if a customer will cancel the reservation.

**Meal**

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Meal Column")
sns.barplot(x=full_data['meal'], y=full_data['is_canceled'])
plt.show()

Meal also seems to affect cancelation chance. Specifically, customers who ordered full board are more likely to cancel reservations.

**Market Segment**

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Market Column")
sns.barplot(x=full_data['market_segment'], y=full_data['is_canceled'])
plt.show()

This column also seem to affect reservation cancelation.

**Distribution Channel**

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Distribution Channel Column")
sns.barplot(x=full_data['distribution_channel'], y=full_data['is_canceled'])
plt.show()

Again, has an effect

**Reserved Room Type**

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Reserved Room Type Column")
sns.barplot(x=full_data['reserved_room_type'], y=full_data['is_canceled'])
plt.show()

This column also seems to have a strong impact on cancellation. Customers who reserved P type room is more likely to cancel reservation.

**Assigned Room Type**

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Assigned Room Type Column")
sns.barplot(x=full_data['assigned_room_type'], y=full_data['is_canceled'])
plt.show()

Customers who were assigned P or L rooms are more likely to cancel reservations.

**Deposit Type**

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Deposit Column")
sns.barplot(x=full_data['deposit_type'], y=full_data['is_canceled'])
plt.show()

Despoit definitely indicates if a customer is more likely to cancel reservation. Visitors with non-refund type deposite are more likely to cancel booking.

**Customer Type**

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Customer Type Column")
sns.barplot(x=full_data['customer_type'], y=full_data['is_canceled'])
plt.show()

Customer type also seems to be an indicator. Specifically transient customers are more likely to cancel reservations.

# Model Selection

Before we make further analysis, let's select a machine learning model to use for our prediction. In this section I will create several simple models and see which one has the best performance. 

In [None]:
num_features = ["lead_time","arrival_date_week_number","arrival_date_day_of_month",
                "stays_in_weekend_nights","stays_in_week_nights","adults","children",
                "babies","is_repeated_guest", "previous_cancellations",
                "previous_bookings_not_canceled","agent",
                "required_car_parking_spaces", "total_of_special_requests", "adr"]

cat_features = ["hotel","arrival_date_month","meal","market_segment",
                "distribution_channel","reserved_room_type","assigned_room_type", "deposit_type","customer_type"]


features = num_features + cat_features
X = full_data.drop(["is_canceled"], axis=1)[features]
y = full_data["is_canceled"]

num_transformer = SimpleImputer(strategy="constant")

cat_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(transformers=[("num", num_transformer, num_features),
                                               ("cat", cat_transformer, cat_features)])

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier

base_models = [("DT_model", DecisionTreeClassifier(random_state=42)),
               ("RF_model", RandomForestClassifier(random_state=42,n_jobs=-1)),
               ("LR_model", LogisticRegression(random_state=42,n_jobs=-1)),
               ("XGB_model", XGBClassifier(random_state=42, n_jobs=-1)),
               ("Ada_model", AdaBoostClassifier(random_state=42)),
               ("KNN_model", KNeighborsClassifier(n_jobs=-1))]

kfolds = 4 # 4 = 75% train, 25% validation
split = KFold(n_splits=kfolds, shuffle=True, random_state=42)

'''
for name, model in base_models:
    model_steps = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)])
    
    cv_results = cross_val_score(model_steps, 
                                 X, y, 
                                 cv=split,
                                 scoring="accuracy",
                                 n_jobs=-1)
    
    min_score = round(min(cv_results), 4)
    max_score = round(max(cv_results), 4)
    mean_score = round(np.mean(cv_results), 4)
    std_dev = round(np.std(cv_results), 4)
    print(f"{name} cross validation accuarcy score: {mean_score} +/- {std_dev} (std) min: {min_score}, max: {max_score}")
'''

We see that random forest classifier had the best performance(86.664%). Therefore, we will use this model for later predictions.

Detail: RF_model cross validation accuarcy score: 0.8664 +/- 0.0012 (std) min: 0.8646, max: 0.8676

# More Bivariate Analysis

We have found the columns that affect is_cancelled the strongest. Now let's actually visualize the correlations.

**lead-time**

We see that generally the earlier the customer books the hotel, the more likely he will cancel the reservation

In [None]:
plt.figure(figsize=(15, 10))
plt.title("lead_time Column")
sns.barplot(x=full_data['is_canceled'], y=full_data['lead_time'])
plt.show()

print( full_data[["lead_time","is_canceled"]].groupby(["is_canceled"], as_index = False).mean() )

**total_of_special_requests**

We see that the more requests the customer makes, the less likely he will cancel the reservation

In [None]:
plt.figure(figsize=(15, 10))
plt.title("total_of_special_requests Column")
sns.barplot(x=full_data['total_of_special_requests'], y=full_data['is_canceled'])
plt.show()

print( full_data[["total_of_special_requests","is_canceled"]].groupby(["total_of_special_requests"], as_index = False).mean() )

**required_car_parking_spaces**

In [None]:
plt.figure(figsize=(15, 10))
plt.title("required_car_parking_spaces Column")
sns.countplot(x=full_data['required_car_parking_spaces'])
plt.show()

print(full_data['required_car_parking_spaces'].value_counts() / full_data.shape[0] * 100)

In [None]:
plt.figure(figsize=(15, 10))
plt.title("required_car_parking_spaces Column")
sns.barplot(x=full_data['required_car_parking_spaces'], y=full_data['is_canceled'])
plt.show()

print( full_data[["required_car_parking_spaces","is_canceled"]].groupby(["required_car_parking_spaces"], as_index = False).mean() )

So we see that if the customer requires at least one car parking space, the chance to cancel the reservation is very small. I say we can actually modify this column to only show if a customer makes a requirement(denoted as 1) or not(denoted as 0), which perhaps is more straightforward to our model?

In [None]:
full_data.loc[full_data['required_car_parking_spaces'] != 0, 'required_car_parking_spaces'] = 1

After the change:

In [None]:
plt.figure(figsize=(15, 10))
plt.title("required_car_parking_spaces Column")
sns.countplot(x=full_data['required_car_parking_spaces'])
plt.show()

print(full_data['required_car_parking_spaces'].value_counts() / full_data.shape[0] * 100)

In [None]:
plt.figure(figsize=(15, 10))
plt.title("required_car_parking_spaces Column")
sns.barplot(x=full_data['required_car_parking_spaces'], y=full_data['is_canceled'])
plt.show()

print( full_data[["required_car_parking_spaces","is_canceled"]].groupby(["required_car_parking_spaces"], as_index = False).mean() )

Let's see if this would make any difference to our model's prediction score.

In [None]:
X = full_data.drop(["is_canceled"], axis=1)[features]
y = full_data["is_canceled"]

model_steps = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', RandomForestClassifier(random_state=42,n_jobs=-1))])
    
cv_results = cross_val_score(model_steps, 
                             X, y, 
                             cv=split,
                             scoring="accuracy",
                             n_jobs=-1)

min_score = round(min(cv_results), 4)
max_score = round(max(cv_results), 4)
mean_score = round(np.mean(cv_results), 4)
std_dev = round(np.std(cv_results), 4)
print(f"RF Model cross validation accuarcy score: {mean_score} +/- {std_dev} (std) min: {min_score}, max: {max_score}")

Nice! We see an improvement in accuracy, which means this modification actually works!

**previous_cancellations**

In [None]:
plt.figure(figsize=(15, 10))
plt.title("previous_cancellations Column")
sns.countplot(x=full_data['previous_cancellations'])
plt.show()

print(full_data['previous_cancellations'].value_counts() / full_data.shape[0] * 100)

In [None]:
plt.figure(figsize=(15, 10))
plt.title("previous_cancellations Column")
sns.barplot(x=full_data['previous_cancellations'], y=full_data['is_canceled'])
plt.show()

print( full_data[["previous_cancellations","is_canceled"]].groupby(["previous_cancellations"], as_index = False).mean() )

So we see that if the number of previous_cancellations is 1 or above 12, the chance of canceling the reservation is very high. This makes me wonder that will the prediction be better if I apply the same modification as parking space. Let's find out

In [None]:
temp_previous_cancel = full_data['previous_cancellations'].copy()

In [None]:
full_data.loc[(full_data['previous_cancellations'] == 1) | (full_data['previous_cancellations'] >= 13), 'previous_cancellations'] = 1
full_data.loc[(full_data['previous_cancellations'] != 1) & (full_data['previous_cancellations'] < 13), 'previous_cancellations'] = 0

In [None]:
plt.figure(figsize=(15, 10))
plt.title("previous_cancellations Column")
sns.countplot(x=full_data['previous_cancellations'])
plt.show()

print(full_data['previous_cancellations'].value_counts() / full_data.shape[0] * 100)

In [None]:
plt.figure(figsize=(15, 10))
plt.title("previous_cancellations Column")
sns.barplot(x=full_data['previous_cancellations'], y=full_data['is_canceled'])
plt.show()

print( full_data[["previous_cancellations","is_canceled"]].groupby(["previous_cancellations"], as_index = False).mean() )

Now our model's performance is:

In [None]:
X = full_data.drop(["is_canceled"], axis=1)[features]
y = full_data["is_canceled"]

model_steps = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', RandomForestClassifier(random_state=42,n_jobs=-1))])
    
cv_results = cross_val_score(model_steps, 
                             X, y, 
                             cv=split,
                             scoring="accuracy",
                             n_jobs=-1)

min_score = round(min(cv_results), 4)
max_score = round(max(cv_results), 4)
mean_score = round(np.mean(cv_results), 4)
std_dev = round(np.std(cv_results), 4)
print(f"RF Model cross validation accuarcy score: {mean_score} +/- {std_dev} (std) min: {min_score}, max: {max_score}")

Sadly, this actually makes the performance worse, so let's undo this change. 

In [None]:
full_data['previous_cancellations'] = temp_previous_cancel

In [None]:
plt.figure(figsize=(15, 10))
plt.title("previous_cancellations Column")
sns.barplot(x=full_data['previous_cancellations'], y=full_data['is_canceled'])
plt.show()

print( full_data[["previous_cancellations","is_canceled"]].groupby(["previous_cancellations"], as_index = False).mean() )

In [None]:
X = full_data.drop(["is_canceled"], axis=1)[features]
y = full_data["is_canceled"]

model_steps = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', RandomForestClassifier(random_state=42,n_jobs=-1))])
    
cv_results = cross_val_score(model_steps, 
                             X, y, 
                             cv=split,
                             scoring="accuracy",
                             n_jobs=-1)

min_score = round(min(cv_results), 4)
max_score = round(max(cv_results), 4)
mean_score = round(np.mean(cv_results), 4)
std_dev = round(np.std(cv_results), 4)
print(f"RF Model cross validation accuarcy score: {mean_score} +/- {std_dev} (std) min: {min_score}, max: {max_score}")

# Final Tuning

Now we have done all the modifications. Let's finally adjust the hyperparameters and make the final predictions.

In [None]:
rf_model_enh = RandomForestClassifier(n_estimators=160,
                               max_features=0.4,
                               min_samples_split=2,
                               n_jobs=-1,
                               random_state=0)

split = KFold(n_splits=kfolds, shuffle=True, random_state=42)
model_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', rf_model_enh)])
cv_results = cross_val_score(model_pipe, 
                                 X, y, 
                                 cv=split,
                                 scoring="accuracy",
                                 n_jobs=-1)
# output:
min_score = round(min(cv_results), 4)
max_score = round(max(cv_results), 4)
mean_score = round(np.mean(cv_results), 4)
std_dev = round(np.std(cv_results), 4)
print(f"Enhanced RF model cross validation accuarcy score: {mean_score} +/- {std_dev} (std) min: {min_score}, max: {max_score}")

So our final perdiction accyracy is 87.12%.

# Checking Parameters

Now we have trained a random forest classifier. I am curious in which factors weight the most in this model and if there is a way for us to further improve it. 

In [None]:
from sklearn.model_selection import train_test_split

X = preprocessor.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


rf_model_enh.fit(X_train, y_train)

In [None]:
X_train

In [None]:
important_features = pd.Series(data=rf_model_enh.feature_importances_,index=X_train.columns)
important_features.sort_values(ascending=False,inplace=True)