<h1 style="text-align: center;">EDA of Hotel Bookings and ML to Predict Cancellations</h1>

In [None]:
# common imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning imports
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.pipeline import TransformerMixin
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator
from sklearn import metrics

# display setup
pd.set_option("display.max_columns", None) # the None parameter displays unlimited columns
sns.set(style="whitegrid") # for plots

# 1. Getting the Data

In [None]:
# read the csv file
df = pd.read_csv(r"../input/hotel-booking-demand/hotel_bookings.csv")

In [None]:
# display the first 5 rows for a quick look
df.head()

In [None]:
# DataFrame shape (rows, columns)
# understand the amount of data we are working with
df.shape

In [None]:
# description of data
df.info()

> In a first observation it is clear that some features have
> missing values (i.e. "company" and "agent" columns).
> We will need to take care of this later.

In [None]:
# summary of the numerical attributes
# null values are ignored
df.describe()

> ### Features in the DataFrame:
>> 1. hotel: Resort Hotel or City Hotel
>> 2. is_canceled: Value indicating if the booking was canceled (1) or not (0)
>> 3. lead_time: Number of days between the booking date to the arrival date
>> 4. arrival_date_year: Year of arrival
>> 5. arrival_date_month: Month of arrival
>> 6. arrival_date_week_number: Week number according to year of arrival
>> 7. arrival_date_day_of_month: Day of arrival
>> 8. stays_in_weekend_nights: Number of weekend nights booked (Saturday or Sunday)
>> 9. stays_in_week_nights: Number of week nights booked (Monday to Friday)
>> 10. adults: Number of adults
>> 11. children: Number of children
>> 12. babies: Number of babies
>> 13. meal: Type of meal booked
>> 14. country: Country of origin
>> 15. market_segment: Market segment designation, typically influences the price sensitivity
>> 16. distribution_channel: Booking distribution channel, refers to how the booking was made
>> 17. is_repeated_guest: Value indication if the booking was from a repeated guest (1) or not (0)
>> 18. previous_cancellations: Number of previous cancellations prior to current booking
>> 19. previous_bookings_not_canceled: Number of previous booking not canceled prior to current booking
>> 20. reserved_room_type: Code of room type reserved
>> 21. assigned_room_type: Code for the type of room assigned to the booking
>> 22. booking_changes: Number of changes made to the booking since entering the hotel management system
>> 23. deposit_type: Type of deposit made for the reservation
>> 24. agent: ID of the travel agency that made the booking
>> 25. company: ID of the company/organization that made the booking or is responsible for payment
>> 26. days_in_waiting_list: Number of days booking was in the waiting list until it was confirmed
>> 27. customer_type: Type of booking
>> 28. adr: Average Daily Rate (the sum of transactions divided by the number of nights stayed)
>> 29. required_car_parking_spaces: Number of car parking spaces requested
>> 30. total_of_special_requests: Number of special requests made by the customer
>> 31. reservation_status: Last reservation status (Canceled, Check-Out, No-Show)
>> 32. reservation_status_date: Date at which the last status was set
>>
>>> ##### *Understanding the features could help gain insight on how to treat null values.*

In [None]:
# a histogram plot for each numerical attribute
df.hist(bins=50, figsize=(20,15))
plt.tight_layout()
plt.show()

> Initial observations from the histograms:
>> 1. Some weeks have more bookings. This could be because of holiday or summer seasons, when people tend to travel more.
>> 2. According to the lead_time plot, most bookings were made shortly before arrival.
>> 3. Bookings tend to be without children or babies.
>> 4. It seems that the most accommodations are two weeks long or shorter.
>> 5. While most bookings were not canceled, there are thousands of instances that were.

> # Objective
> ## Predicting if a booking will be canceled.
>> ### Chosen Feature:
>> #### *is_canceled* column
>>> 0 means the booking was not canceled
>>>
>>> 1 means the booking was canceled
>> ### Motive:
>> Like any business, hotels are also looking to gain profit. A model that predicts if the booking
>> is likely to be canceled could be a good indication for hotels, as they
>> may prefer to accept the lower risk bookings first.

> ### Splitting the Data:
>> Before further analysis let's split the data into a training set and a testing set.
>> This will ensure avoidance of bias that could occur from learning the data as a whole.

In [None]:
# use sklearn train_test_split function to split the data
# the reason for selecting 0.15 as the test size is because the dataset is very large
# the random state parameter ensures that data will be shuffled and split the same way in each run
train_set, test_set = train_test_split(df, test_size=0.15, random_state=42)

In [None]:
print("Number of instances in training set: ", len(train_set))
print("Number of instances in testing set: ", len(test_set))

# 2. Understanding and Visualizing the Data
> ##### *The motivation for this section is to gain more insights.*

In [None]:
# deep copy of the training set
df2 = train_set.copy()

In [None]:
df2.head(2)

> ### Missing Features:

In [None]:
# the methods below calculate the number of missing values
missing_values = df2.isna().sum()
missing_values = missing_values[missing_values != 0]
missing_values

In [None]:
# replace missing values

# can assume that there were no children
df2.fillna({"children": 0}, inplace=True)

# missing countries can be labeled unknown
df2.fillna({"country": "Unknown"}, inplace=True)

# missing agent ID can be zero, presuming the booking was made privately
df2.fillna({"agent": 0}, inplace=True)

# missing company ID can be zero (for the same reason as agent ID)
df2.fillna({"company": 0}, inplace=True)

In [None]:
# check that the values were filled
df2.isna().sum()

> ### Numerical Attributes:

In [None]:
# method creates a correlations matrix
corr_matrix = df2.corr()

In [None]:
# looking at attributes correlation with is_canceled feature
corr_matrix["is_canceled"].sort_values(ascending=False)

In [None]:
# experimenting with attribute combinations

# create a column with total amount of guests
df2["guests_stayed"] = df2["adults"] + df2["children"] + df2["babies"]

# create a column with total nights booked
df2["nights_stayed"] = df2["stays_in_week_nights"] + df2["stays_in_weekend_nights"]

In [None]:
# looking at the correlation matrix again with the added columns
corr_matrix = df2.corr()
corr_matrix["is_canceled"].sort_values(ascending=False)

> ### Correlations with is_canceled Attribute - Overview:
>> The strongest positive correlations (0.1 or more) are:
>> * lead_time
>> * previous_cancellations
>>
>> The strongest negative correlations (-0.1 or less) are:
>> * total_of_special_requests
>> * required_car_parking_spaces
>> * booking_changes
>>
> The attribute combinations tested (guests stayed and nights stayed) both had weak correlations.

> ### Cancellations According to Lead Time

In [None]:
# hist plot of lead time
# kde = kernel density estimation (displays distribution function, density curve)
# shows the distribution and highest concentration points
plt.figure(figsize=(10,5))
lead_time = df2['lead_time']
lead_time = pd.DataFrame(sorted(lead_time, reverse = True), columns = ['Lead'])
sns.histplot(lead_time, kde=True)
plt.title("Lead Time", size=20)
plt.xlabel("lead time days", size=15)
plt.tight_layout()
plt.show()

In [None]:
# divides lead time by less than 100 days, 100-355 days and 365 or more days
lead_time_1 = df2[df2["lead_time"] < 100]
lead_time_2 = df2[(df2["lead_time"] >= 100) & (df2["lead_time"] < 365)]
lead_time_3 = df2[df2["lead_time"] >= 365]

In [None]:
# calculates cancellations according to lead time groups
lead_cancel_1 = lead_time_1["is_canceled"].value_counts()
lead_cancel_2 = lead_time_2["is_canceled"].value_counts()
lead_cancel_3 = lead_time_3["is_canceled"].value_counts()

In [None]:
# hist plot for each lead time group
fig, (bx1, bx2, bx3) = plt.subplots(1,3,figsize=(21,6))
sns.histplot(lead_time_1["lead_time"], ax = bx1, kde=True)
bx1.set_title("lead_time [0,100) days", size=20)
sns.histplot(lead_time_2["lead_time"], ax = bx2, kde=True)
bx2.set_title("lead_time [100,365) days", size=20)
sns.histplot(lead_time_3["lead_time"], ax = bx3, kde=True)
bx3.set_title("lead_time [365,max) days", size=20)
plt.tight_layout()
plt.show()

In [None]:
# total count of lead time according to cancellation
total_lead_days_cancel = pd.DataFrame(data=[lead_cancel_1,lead_cancel_2,lead_cancel_3],
             index=["[0,100) days", "[100,365) days", "[365,max) days"])
total_lead_days_cancel

In [None]:
# pie plot for each lead time group
fig, ax = plt.subplots(1,3, figsize=(21,6))
ax[0].pie(np.array([total_lead_days_cancel[0][0], total_lead_days_cancel[1][0]]),
          labels=["not_canceled", "canceled"], autopct='%1.1f%%', startangle=90,
          colors=['forestgreen', 'firebrick'])
ax[0].set_title("lead_time [0,100) days", size=20)
ax[1].pie(np.array([total_lead_days_cancel[0][1], total_lead_days_cancel[1][1]]),
          labels=["not_canceled", "canceled"], autopct='%1.1f%%', startangle=90,
          colors=['forestgreen', 'firebrick'])
ax[1].set_title("lead_time [100,365) days", size=20)
ax[2].pie(np.array([total_lead_days_cancel[0][2], total_lead_days_cancel[1][2]]),
          labels=["not_canceled", "canceled"], autopct='%1.1f%%', startangle=90,
          colors=['forestgreen', 'firebrick'])
ax[2].set_title("lead_time [365,max) days", size=20)
plt.tight_layout()
plt.show()

> #### Observations:
>> * Most bookings occur about 5 days prior to arrival.
>> * When the lead time is larger the chances for cancellation increase.
>> * The amount of bookings is steady overall between 20-100 days, then drops.

> ### Cancellations According to Previous Cancellations

In [None]:
# get previous cancellations column
prev_cancel = df2["previous_cancellations"]

In [None]:
# sort the index values
prev_cancel.value_counts().sort_index()

In [None]:
print("Cancellation Rates:\n")
print('Never canceled =' ,str(round(df2[df2['previous_cancellations']==0]
                                            ['is_canceled'].mean()*100,2))+' %')
print('Canceled once =' ,str(round(df2[df2['previous_cancellations']==1]
                                            ['is_canceled'].mean()*100,2))+' %')
print('Canceled more than 10 times:',str(round(df2[df2['previous_cancellations']>10]
                                            ['is_canceled'].mean()*100,2))+' %')
print('Canceled more than 11 times:' ,str(round(df2[df2['previous_cancellations']>11]
                                            ['is_canceled'].mean()*100,2))+' %')

In [None]:
# create a list with previous cancellations indices
prev_cancel_index = df2["previous_cancellations"].value_counts().index.to_list()
# sort the list
prev_cancel_index.sort()

# calculate the average percentage of cancellations for each value in the DataFrame
percentage_prev_cancel= []
for i in prev_cancel_index:
    percentage_prev_cancel.append((round(df2[df2["previous_cancellations"]==i]["is_canceled"].mean()*100,2)))

In [None]:
# create a DataFrame with the results
df_prev_cancel = pd.DataFrame(percentage_prev_cancel, index=prev_cancel_index, columns=["Previous Cancellations %"])
df_prev_cancel

In [None]:
# plot previous cancellations by percentages
df_prev_cancel.plot(figsize= (10,5), linewidth=3)
plt.title("Previous Cancellations", size=20)
plt.xlabel("Number of Previous Cancellations", size=15)
plt.ylabel("%", size=15)
plt.tight_layout()
plt.show()

> ### Observations:
>> The percentages show that when there are more previous cancellations, there is
>> a substantially higher chance the customer will cancel again.

> ### Cancellations According to Total of Special Requests

In [None]:
# number of instances for each value
df2["total_of_special_requests"].value_counts()

In [None]:
# group by cancellations
is_canceled = df2.groupby(by="is_canceled")

In [None]:
# get groups according to binary outcome
canceled = is_canceled.get_group(1)
not_canceled = is_canceled.get_group(0)

In [None]:
# count values for each outcome
special_requests_0 = not_canceled["total_of_special_requests"].value_counts()
special_requests_1 = canceled["total_of_special_requests"].value_counts()

In [None]:
# create a DataFrame for each outcome
df_special_requests_0 = pd.DataFrame(special_requests_0.values, index=special_requests_0.index, columns=["not_canceled"])
df_special_requests_1 = pd.DataFrame(special_requests_1.values, index=special_requests_1.index, columns=["canceled"])

In [None]:
# join both DataFrames side by side
df_special_requests= df_special_requests_0.join(df_special_requests_1)

In [None]:
# add total of both outcomes
special_requests_total = df_special_requests["not_canceled"] + df_special_requests["canceled"]

# calculate percentage of cancellations for each number of requests value individually
special_requests_percentage = []
for i in special_requests_total.index:
    special_requests_percentage.append(round((special_requests_1[i]/special_requests_total[i])*100,2))
special_requests_percentage

In [None]:
# add percentages as new column in DataFrame
df_special_requests.join(pd.DataFrame(special_requests_percentage, index=df_special_requests.index,
             columns=["cancellations %"]))

In [None]:
# plot special requests according to cancellations
plt.figure(figsize=(10,5))
sns.countplot(x=df2["total_of_special_requests"], hue=df2["is_canceled"])
plt.title("Special Requests", size=20)
plt.xlabel("Number of Special Requests", size=15)
plt.legend(["not canceled", "canceled"])
plt.tight_layout()
plt.show()

> ### Observations:
>> * Nearly half of the bookings without special requests are canceled.
>> * There are fewer cancellations when the number of special requests increases.

> ### Cancellations According to Required Car Parking Spaces

In [None]:
# number of instances for each value
df2["required_car_parking_spaces"].value_counts().sort_index()

In [None]:
# count values for each outcome with previous groupby
parking_spaces_0 = not_canceled["required_car_parking_spaces"].value_counts()
parking_spaces_1 = canceled["required_car_parking_spaces"].value_counts()

In [None]:
# value counts for non canceled instances
parking_spaces_0.sort_index()

In [None]:
# value counts for canceled instances
parking_spaces_1

In [None]:
# pie plot of cancellations with zero required parking spaces
plt.pie(x=[parking_spaces_0[0], parking_spaces_1[0]], labels=["not_canceled", "canceled"],autopct='%1.1f%%',
        startangle=90, colors=['forestgreen', 'firebrick'])
plt.title("Zero Required Parking Spaces Cancellations", size=20)
plt.tight_layout()
plt.show()

> ### Observations:
>> * Dividing the instances into groups according to cancellations shows that canceled
>> bookings were only ones without required parking spaces.
>> * This could potentially be a bad indication for cancellations. The model could learn
>> that a booking can be canceled **only** if no parking spaces were required, which does not
>> necessarily have to be the case.

> ### Cancellations According to Booking Changes

In [None]:
# number of instances for each value
df2["booking_changes"].value_counts().sort_index()

In [None]:
# count values for each outcome with previous groupby
booking_changes_0 = not_canceled["booking_changes"].value_counts()
booking_changes_1 = canceled["booking_changes"].value_counts()

In [None]:
# count index of not canceled
len(booking_changes_0.index)

In [None]:
# count index of canceled
len(booking_changes_1.index)

In [None]:
# fill missing values
# the outcome 0 has more values
# filling the values will enable joining the dataframes later
df_booking_changes_1 = pd.DataFrame(booking_changes_1, index=booking_changes_0.index)
df_booking_changes_1.fillna({"booking_changes": 0}, inplace=True)
booking_changes_1 = pd.Series(df_booking_changes_1["booking_changes"])

In [None]:
# add total of both outcomes
booking_changes_total = booking_changes_0 + booking_changes_1

# calculate percentage of cancellations for each number of booking changes individually
percentage_booking_changes = []
for i in booking_changes_total.index:
    percentage_booking_changes.append(round((booking_changes_1[i]/booking_changes_total[i])*100,2))

In [None]:
# create a DataFrame with the percentage of cancellations
df_percentage_booking_changes = pd.DataFrame(percentage_booking_changes, index=booking_changes_total.index,
                                             columns=["cancellations %"])

In [None]:
# create a DataFrame for each outcome
df_booking_changes_0 = pd.DataFrame(booking_changes_0.values, index=booking_changes_0.index, columns=["not_canceled"])
df_booking_changes_1 = pd.DataFrame(booking_changes_1.values, index=booking_changes_1.index, columns=["canceled"])

In [None]:
# join all three DataFrames side by side
df_booking_changes = df_booking_changes_0.join\
    ([df_booking_changes_1, df_percentage_booking_changes])

# remove rows with 0% cancellations
df_booking_changes = df_booking_changes[df_booking_changes["cancellations %"]!=0]
df_booking_changes

> ### Observations:
>> * While a large amount of bookings with no changes were canceled, this category can change overtime
>> which could possibly be a source of leakage.

> ### Understanding the ADR Feature
>> Since this feature is not entirely clear from the description on Kaggle,
>> I've decided to further assess it.
>>
>> The Average Daily Rate (ADR) is typically calculated by taking the average revenue
>> earned from the rooms and dividing by the number of rooms sold (excluding rooms occupied
>> by staff).
>>
>> Since it is not clear if an ADR of zero indicates that the booking was canceled or
>> if the hotel did not gain profit, I will look at instances listed with an ADR
>> of zero. This should provide enough insight to see if this feature should be removed
>> during before model evaluations.

In [None]:
df2[df2["adr"]==0]["reservation_status"].value_counts()

In [None]:
df2[df2["adr"]==0]["is_canceled"].value_counts()

> ### Observations:
>> * Most bookings are labeled as checked-out and not canceled when the ADR was zero.
>> This concludes the previous speculation.

> ### Categorical Attributes:

> ### Cancellations According to Hotels and Arrival Month

In [None]:
df2["hotel"].value_counts()

In [None]:
# a plot of the number of instances for each hotel according to cancellations
plt.figure(figsize=(10,5))
sns.countplot(x=df2["hotel"], hue=df2["is_canceled"])
plt.title("Hotel Cancellations", size=20)
plt.legend(["not canceled", "canceled"])
plt.tight_layout()
plt.show()

In [None]:
ordered_months = ["January", "February", "March", "April", "May", "June",
          "July", "August", "September", "October", "November", "December"]

resort_canceled_percent = []
city_canceled_percent = []

# divide cancellation outcome by hotel and month of arrival
resort_1 = canceled[canceled["hotel"]=="Resort Hotel"]["arrival_date_month"].value_counts()
resort_0 = not_canceled[not_canceled["hotel"]=="Resort Hotel"]["arrival_date_month"].value_counts()
city_1 = canceled[canceled["hotel"]=="City Hotel"]["arrival_date_month"].value_counts()
city_0 = not_canceled[not_canceled["hotel"]=="City Hotel"]["arrival_date_month"].value_counts()

# calculate cancellation percentage according to hotel
for i in ordered_months:
    resort_canceled_percent.append(round((resort_1[i] / (resort_0[i]+resort_1[i]))*100,2))
    city_canceled_percent.append(round((city_1[i]/(city_0[i]+city_1[i]))*100,2))

# create a DataFrame with the cancellation percentage of each hotel
df_resort_cancel = pd.DataFrame(resort_canceled_percent, index=ordered_months, columns=["Resort Hotel Canceled %"])
df_city_cancel = pd.DataFrame(city_canceled_percent, index=ordered_months, columns=["City Hotel Canceled %"])

# join DataFrames
df_hotel_cancel = df_resort_cancel.join(df_city_cancel)
df_hotel_cancel

> ### Observations:
>> * There are more instances for City Hotel than Resort Hotel in the data.
>> * City Hotel has a higher cancellation rate according to arrival months.

> ### Cancellations According to Meal Booked

In [None]:
# plot meal according to cancellations
plt.figure(figsize=(10,5))
sns.countplot(x=df2["meal"], hue=df2["is_canceled"])
plt.title("Cancellations According to Meal Booked", size=20)
plt.xlabel("meal", size=15)
plt.legend(["not canceled", "canceled"])
plt.tight_layout()
plt.show()

> ### Observations:
>> * The BB (Bed & Breakfast) meal is most common. It is also most frequently canceled.

> ### Cancellations According to Market Segment, Distribution Channel, Customer Type and Room Type

In [None]:
df2["market_segment"].value_counts()

In [None]:
# calculate cancellation percentage according to market segment
market_segment_percent = []

market_segment_1 = canceled["market_segment"].value_counts()
market_segment_total = df2["market_segment"].value_counts()

for i in market_segment_total.index:
    market_segment_percent.append(str(i+": ") +
                    str(round((market_segment_1[i]/market_segment_total[i])*100,2)))
market_segment_percent

In [None]:
df2["distribution_channel"].value_counts()

In [None]:
# calculate cancellation percentage according to distribution channel
distribution_channel_percent = []

distribution_channel_1 = canceled["distribution_channel"].value_counts()
distribution_channel_total = df2["distribution_channel"].value_counts()

for i in distribution_channel_total.index:
    distribution_channel_percent.append(str(i+": ") +
                    str(round((distribution_channel_1[i]/distribution_channel_total[i])*100,2)))
distribution_channel_percent

In [None]:
df2["customer_type"].value_counts()

In [None]:
# calculate cancellation percentage according to customer type
customer_type_percent = []

customer_type_1 = canceled["customer_type"].value_counts()
customer_type_total = df2["customer_type"].value_counts()

for i in customer_type_total.index:
    customer_type_percent.append(str(i+": ") +
                    str(round((customer_type_1[i]/customer_type_total[i])*100,2)))
customer_type_percent

In [None]:
# plot of cancellations according to room type
plt.figure(figsize=(10,5))
sns.countplot(x=df2["reserved_room_type"], hue=df2["is_canceled"])
plt.title("Cancellations According to Room Type", size=20)
plt.legend(["not canceled", "canceled"], loc=1)
plt.tight_layout()
plt.show()

> ### Observations:
>> * Market segment cancellation rates are highest amongst travel agencies and tour operators.
>> * Distribution channel cancellation rates are highest amongst groups, travel agencies and tour operators.
>> * Customer type cancellation rates are highest amongst transient
>> (meaning the booking is not part of a group or contract and is not associated to another transient booking).
>> * The room type "A" is canceled most frequently.

> ### Cancellations According to Deposit Type

In [None]:
df2["deposit_type"].value_counts()

In [None]:
# calculate deposit type instances percentage in data
deposit_percent = round(df2["deposit_type"].value_counts()/len(df["deposit_type"])*100,4)
deposit_percent

In [None]:
# use groupby to divide according to deposit type
deposit = df2.groupby(by="deposit_type")
non_refund = deposit.get_group("Non Refund")
refundable = deposit.get_group("Refundable")
no_deposit = deposit.get_group("No Deposit")

In [None]:
# calculate number of cancellations according to deposit type
no_deposit_0 = (no_deposit["is_canceled"]==0).sum()
no_deposit_1 = (no_deposit["is_canceled"]==1).sum()
non_refund_0 = (non_refund["is_canceled"]==0).sum()
non_refund_1 = (non_refund["is_canceled"]==1).sum()
refundable_0 = (refundable["is_canceled"]==0).sum()
refundable_1 = (refundable["is_canceled"]==1).sum()
all_canceled = no_deposit_1 + non_refund_1 + refundable_1
all_not_canceled = no_deposit_0 + non_refund_0 + refundable_0

In [None]:
# check that all values were calculated
all_canceled + all_not_canceled == df2["deposit_type"].size

In [None]:
# create a DataFrame with the number of instances for each deposit type
df_deposit_type = pd.DataFrame(index=["Not Canceled", "Canceled"])
df_deposit_type["no_deposit"] = [no_deposit_0, no_deposit_1]
df_deposit_type["non_refund"] = [non_refund_0, non_refund_1]
df_deposit_type["refundable"] = [refundable_0, refundable_1]
df_deposit_type

In [None]:
# pie plot of cancellations according to deposit type
cancel_labels = ["cancelled", "not_cancelled"]
fig, dx = plt.subplots(1,3, figsize=(21,6))
dx[0].pie(np.array([no_deposit_1, no_deposit_0]), labels=cancel_labels, autopct='%1.1f%%', startangle=90,
          colors=['firebrick', 'forestgreen'])
dx[0].set_title("No Deposit Cancellations", size=20)
dx[1].pie(np.array([non_refund_1, non_refund_0]), labels=cancel_labels, autopct='%1.1f%%', startangle=90,
          colors=['firebrick', 'forestgreen'])
dx[1].set_title("Non Refund Cancellations", size=20)
dx[2].pie(np.array([refundable_1, refundable_0]), labels=cancel_labels, autopct='%1.1f%%', startangle=90,
          colors=['firebrick', 'forestgreen'])
dx[2].set_title("Refundable Cancellations", size=20)
plt.tight_layout()
plt.show()

> #### Observations:
>> * The non refund values and graph look a bit off. It almost seems that the values
>> for canceled and not canceled were accidentally switched!
>> In light of this, it might be better to evaluate the model both with and without this
>> feature.

> ### Cancellations According to Country of Origin

In [None]:
df2["country"].unique().size

In [None]:
canceled["country"].value_counts()

In [None]:
# calculate countries by number of instances that appear in data
country_1 = (df2["country"].value_counts() <= 1).sum()
country_10 = (df2["country"].value_counts() <= 10).sum()
country_50 = (df2["country"].value_counts() <= 50).sum()
country_100 = (df2["country"].value_counts() <= 100).sum()
country_1000 = (df2["country"].value_counts() <= 1000).sum()

print("Number of countries with one or less instances:", country_1,
      "\nNumber of countries with 10 or less instances:", country_10,
      "\nNumber of countries with 50 or less instances:", country_50,
      "\nNumber of countries with 100 or less instances:", country_100,
      "\nNumber of countries with 1000 or less instances:", country_1000)

> ### Observations:
>> * There are 175 unique countries. This indicates that the data is representative
>> worldwide, contrary to a specific region.
>> * More than half of the instances have 50 or fewer observations in the DataFrame.
>> * A model would likely generalize better if we avoid using this column.

# 3. Data Cleaning

In [None]:
# clean copy of the training set
df3 = train_set.copy()

In [None]:
# custom transformer removes instances with zero guests

class RemoveZeroGuests(TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        XData = X.loc[((X["adults"]) + (X["children"]) + (X["babies"])) > 0]
        return XData

In [None]:
df3.shape

In [None]:
# use transformer to remove instances with zero guests stayed
df3 = RemoveZeroGuests().fit_transform(df3)

In [None]:
df3.shape

In [None]:
# separate predictors from target values

# drop- creates a copy without changing the training set
X_train = df3.drop("is_canceled", axis=1)

# create a deep copy of the target values
y_train = df3["is_canceled"].copy()

> ### Removing the Following Columns:
>> #### Numerical Attributes:
>> * arrival_date_year: This category references towards certain years. This could be
>> problematic for instances during years that do not appear in the training data, or
>> perhaps have bias towards certain years specifically due to the unequal amounts of
>> observations in the training data.
>> * arrival_date_day_of_month: The column arrival date week of month generalizes this.
>> * booking_changes: Could change over time, potentially causing data leakage.
>> * days_in_waiting_list: Could constantly change over time. Additionally, there are many
>> instances. This could prevent the model from generalizing.
>> * agent & company: Represented by an ID. These columns are uninformative since they
>> contain a substantial amount of various numerical values without having an actual
>> numerical meaning. Since other columns (such as market segment) indicate the type of
>> reservation, these columns won't be needed.
>>
>> #### Categorical Attributes:
>> * country: There are many categories, most with few instances. In order to make a model
>> that generalizes, it is better to dismiss this category.
>> * assigned_room_type: Similar to reserved_room_type and seems like the reserved room is
>> a more suitable choice.
>> * reservation_status: Major data leakage! The categories are Check-Out, Canceled and No-Show.
>> This is exactly what we are trying to predict.
>> * reservation_status_date: This is the date when the reservation status was last changed,
>> and therefore is irrelevant.

In [None]:
num_features = ["lead_time", "stays_in_weekend_nights", "stays_in_week_nights", "adults", "children", "babies",
                "is_repeated_guest", "previous_cancellations", "previous_bookings_not_canceled", "adr",
                "required_car_parking_spaces", "total_of_special_requests"]

cat_features = ["hotel", "arrival_date_month", "arrival_date_week_number", "meal", "market_segment",
                "distribution_channel", "reserved_room_type", "deposit_type", "customer_type"]

In [None]:
# Undefined/SC both represent no meal package and can be combined

class ReplaceMeal(TransformerMixin):

    def fit(self,X, y=None):
        return self

    def transform(self, X):
        XData = X.copy()
        XData["meal"].replace("Undefined", "SC", inplace=True)
        return XData

In [None]:
# SimpleImputer constant default fills values with zero
# MinMaxScaler normalizes data (rescales between 0-1)
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="constant")),
    ("min_max", MinMaxScaler())
])

In [None]:
# SimpleImputer fills missing values with 'Unknown'
# OneHotEncoder converts categories to a numeric dummy array
# (one binary attribute per category)
cat_pipeline = Pipeline([
    ("meal", ReplaceMeal()),
    ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
    ("one_hot", OneHotEncoder(handle_unknown="ignore"))
])

In [None]:
# column transformer:
# features generated by each transformer will be concatenated to form a single feature space
# columns of the original feature matrix that are not specified are dropped
full_pipeline = ColumnTransformer([
    ("numerical", num_pipeline, num_features),
    ("categorical", cat_pipeline, cat_features)
])

In [None]:
# transform training data using pipeline
X_train_prepared = full_pipeline.fit_transform(X_train)

# transform training data without fit for testing
X_tr_testing = full_pipeline.transform(X_train)

# 4. Training and Evaluating Models

> Accuracy is less relevant for an imbalanced classification problem.
> Evaluating by a metric that represents the data better is important.
>
> Chosen evaluation metric:
>
> The F1 Score is calculated by using precision (the accuracy of the positive predictions) and
> recall (the ratio of positive instances correctly classified) accuracy.
> This metric gives a higher value towards false positives rather than false negatives.

In [None]:
# function prints scores
def display_evaluation(actual, pred):
    print("Confusion Matrix:\n", metrics.confusion_matrix(actual, pred), "\n")
    print("Classification Report:\n", metrics.classification_report(actual, pred))

#### Model 1: KNN

In [None]:
# instantiate classifier
# default k=5
knn = KNeighborsClassifier()

In [None]:
# fit the training set
knn.fit(X_train_prepared, y_train)

In [None]:
# test on a few instances from training data
some_data = X_train.iloc[:10]
some_labels = y_train.iloc[:10]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", knn.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# predict using the training data
knn_pred = knn.predict(X_tr_testing)

In [None]:
# use function to show results
display_evaluation(y_train, knn_pred)

#### Model 2: KNN

In [None]:
# instantiate KNN model using distance instead of uniform
# distance means closer instances have a larger weight
# uniform weighs all instances equally
# default k=5
knn = KNeighborsClassifier(weights="distance")

In [None]:
# fit the training set
knn.fit(X_train_prepared, y_train)

In [None]:
# test on a few instances from training data
some_data = X_train.iloc[:10]
some_labels = y_train.iloc[:10]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", knn.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# predict using the training data
knn_pred_2 = knn.predict(X_tr_testing)

In [None]:
# use function to show results
display_evaluation(y_train, knn_pred_2)

> So far, the performance of the KNN model using distance weights instead of uniform
> drastically improved the results.

#### Model 3: Random Forest Classifier

> ### What is the Random Forest Classification Model?

Forests are based on multiple decision trees, so it is vital to first understand how decision
trees work.

A decision tree is a non-linear model built by constructing many linear boundaries.
The tree works as a sequence of yes or no, true or false questions that progress down
the tree until reaching a predicted class. The data is split into nodes based on
feature values. This model is good for occasions when there is no single linear line that
can divide the data. Gini Impurity of a node represents the probability that a randomly chosen
sample would be incorrectly classified, so the goal is to reduce this as much as possible.

Using a single decision tree could cause overfitting of the training data. For example,
a decision tree could create a leaf node (the predicted class) for each instance.
Using a forest could help generalize better to new data. The random forest model
samples random points and subsets of features when training. Then, the predictions are made
by averaging the predictions made by each decision tree.

In [None]:
# max features default is sqrt (number of features selected per split)
# bootstrap default is true (resampling data true)
# n estimators default is 100 (number of decision tree classifiers)
rf_clf = RandomForestClassifier(random_state=42, n_jobs=-1)

In [None]:
# fit the training set
rf_clf.fit(X_train_prepared, y_train)

In [None]:
# test on a few instances from training data
some_data = X_train.iloc[:10]
some_labels = y_train.iloc[:10]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", rf_clf.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# predict using the training data
rf_pred = rf_clf.predict(X_tr_testing)

In [None]:
# use function to show results
print(display_evaluation(y_train, rf_pred))

> The Random Forest Classification model performed slightly better than the second
> KNN model. The next step is to find the hyperparameters
> that provide the best results.
>
> Since the dataset is rather large, and it would take a long time
> to run all estimators, using randomized search cv is the ideal
> option. The size of the dataset is also the reason I have neglected
> randomized search for the KNN model.
>
> The randomized search runs an amount of iterations specified
> and tries random combinations of the attributes listed.

#### Random Search Cross Validation 1

In [None]:
# parameters for random search
param_dist_rf = [{"n_estimators": [10, 50, 100, 500], "max_features": ["sqrt", 8, 16], "bootstrap": [True, False]}]

In [None]:
# instantiate randomized search
rf_cv = RandomizedSearchCV(rf_clf, param_dist_rf, n_iter=10, random_state=42, cv=5, scoring="f1")

In [None]:
# fit the training set
rf_cv.fit(X_train_prepared, y_train)

In [None]:
# show the best score
rf_cv.best_score_

In [None]:
# show the best estimator parameters
rf_clf = rf_cv.best_estimator_
rf_clf

In [None]:
# show results for each iteration
cvres = rf_cv.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(mean_score, params)

#### Model 4: Random Forest Classifier

In [None]:
# predict using training data
rf_pred_2 = rf_clf.predict(X_tr_testing)

In [None]:
# display evaluation scores
display_evaluation(y_train, rf_pred_2)

#### Feature Importance

In [None]:
# pair the feature names with the results from randomized search
feature_importance = rf_cv.best_estimator_.feature_importances_
features = num_features+cat_features
sorted(zip(feature_importance,features), reverse=True)

> Deposit type is pretty high on the list which raises speculation. As seen earlier,
> the cancellation rate was nearly 100% in the category Non Refund.
> Lets train a model without this feature.
>
> Additionally, lets train a model without the parameters that have less
> than 0.005 feature importance. If the training error is nearly the
> same when using fewer features, it might be more efficient to
> train a model without them.

In [None]:
# features left

num_features_2 = ["lead_time", "stays_in_weekend_nights", "stays_in_week_nights", "adults", "children",
                  "previous_cancellations", "adr", "required_car_parking_spaces", "total_of_special_requests"]

cat_features_2 = ["hotel", "arrival_date_month"]


# category pipeline without meal transformer
cat_pipeline_2 = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
    ("one_hot", OneHotEncoder(handle_unknown="ignore"))
])

# pipeline with new features
full_pipeline_2 = ColumnTransformer([
    ("numerical", num_pipeline, num_features_2),
    ("categorical", cat_pipeline_2, cat_features_2)
])

In [None]:
# transform data with new pipeline
X_train_prepared_2 = full_pipeline_2.fit_transform(X_train)
X_tr_testing_2 = full_pipeline_2.transform(X_train)

#### Model 5: Random Forest Classifier

In [None]:
# instantiate model
rf_clf_2 = RandomForestClassifier(random_state=42, n_jobs=-1)

In [None]:
# fit the training set
rf_clf_2.fit(X_train_prepared_2, y_train)

In [None]:
# predictions on training data
rf_pred_3 = rf_clf_2.predict(X_tr_testing_2)

In [None]:
# display evaluation scores
display_evaluation(y_train, rf_pred_3)

#### Random Search Cross Validation 2

In [None]:
# parameters for random search
param_dist = [{"n_estimators": [10, 50, 100, 500], "max_features": [4, 8, 16], "bootstrap": [True, False]}]

In [None]:
# instantiate randomized search
rf_cv_2 = RandomizedSearchCV(rf_clf_2, param_dist, cv=5, n_iter=10, scoring="f1", random_state=42)

In [None]:
# fit the training set
rf_cv_2.fit(X_train_prepared_2, y_train)

In [None]:
# show the best score
rf_cv_2.best_score_

In [None]:
# show the best estimator parameters
rf_clf_3 = rf_cv_2.best_estimator_
rf_clf_3

In [None]:
# show results for each iteration
cvres = rf_cv_2.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(mean_score, params)

In [None]:
# predict using training data
rf_pred_4 = rf_clf_3.predict(X_tr_testing_2)

In [None]:
# display evaluation scores
display_evaluation(y_train, rf_pred_4)

#### Dummy Classifier
> The dummy classifier serves as an indication and comparison for model performance.

In [None]:
# dummy classifier
# classifies every instance as not canceled
# BaseEstimator allows to set and get estimator parameters
class NeverCanceledClassifier(BaseEstimator):

    def fit(self, X, y=None):
        pass

    def predict(self, X):
        return np.zeros((len(X), 1), dtype=int)

In [None]:
# instantiate dummy classifier
never_canceled = NeverCanceledClassifier()

In [None]:
# fit the training set
never_canceled.fit(X_train, y_train)

In [None]:
# predict using dummy classifier
never_canceled_pred = never_canceled.predict(X_train)

In [None]:
# evaluate scores for comparison
# can't assess using F1 score
# precision will divide by zero
print("Accuracy:", metrics.accuracy_score(y_train, never_canceled_pred))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_train, never_canceled_pred))

> ### Overview:
>> ####  Removing nearly half of the features did not drastically change the score.
>> * The model prior to feature selection had an F1 score of 0.99.
>> * The model after feature selection has an F1 score of 0.98.
>> * The model performs better than the dummy classifier.

In [None]:
# use predictions to get precision recall curve values
rf_scores = rf_cv_2.best_estimator_.predict_proba(X_tr_testing_2)[:, 1]
precisions, recalls, thresholds = metrics.precision_recall_curve(y_train, rf_scores)

In [None]:
# plot precision recall curve
plt.figure(figsize=(10,5))
plt.plot(precisions, recalls, linewidth=3)
plt.title("Precision Recall Curve for Hotel Cancellations", size=20)
plt.xlabel("Recall", size=15)
plt.ylabel("Precision", size=15)
plt.tight_layout()
plt.show()

# 5. Evaluating the Test Set

In [None]:
# separate test set predictors and labels
X_test = test_set.drop("is_canceled", axis=1)
y_test = test_set["is_canceled"].copy()

In [None]:
# transform test set
X_test_prep = full_pipeline_2.transform(X_test)

In [None]:
final_model = rf_cv_2.best_estimator_
final_model

In [None]:
# predict test set
final_predictions = final_model.predict(X_test_prep)

In [None]:
# evaluate predictions
display_evaluation(y_test, final_predictions)

> #### Resources:
> 1. Hotel Booking Demand Dataset <a href="https://www.kaggle.com/jessemostipak/hotel-booking-demand"
> title="Kaggle">link</a>
> 2. Hotel Booking Demand Article <a href="https://www.sciencedirect.com/science/article/pii/S2352340918315191"
> title="Article">link</a>
> 3. Average Daily Rate Article <a href="https://www.investopedia.com/terms/a/average-daily-rate.asp"
> title="Investopedia">link</a>
> 4. Random Forest Article <a href="https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76" title="towardsdatascience">link</a>

### Any feedback, suggestions, questions? Leave a comment below!
### Upvote if you liked this notebook, learned something new or found it useful!