# Hotel Booking Demand
## **Table of content**
> 1. Introduction 
> 2. Exploratory Data Analysis
> 3. Preprocessing
> 4. Modelling
> 5. Model Evaluation

## 1. Introduction 

### 1.1 Context:
This hotel booking dataset can help you answer questions about business of hotel! 
For examples: 
* Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? 
* What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests?

This data contains information for a <code>city hotel</code> and a <code>resort hotel</code>, includes information such as when the booking was made, length of stay, the number of adults, children, ... among other things.
All personally identifying information has beed removed from this data. 

### 1.2 Task:
After answering some questions about the business of hotel, we will predict the possibility of a booking for a hotel (Predicting the <code>is_canceled</code> column). Using desicion tree algorithm (and its evolutions), we can tell which is the most influential column for cancellation 's ration in **this dataset**. 





## 2. Exploratory Data Analysis 
In this section, we will have some questions about this data and answer it step by step. 
If you are a hotel owner, which questions do you want to answer? 
* Where do the guests come from? (GER or ENG, agency or direction, ...)
* How much do guests pay for a room per night?
* Which month has the highest number of cancellation? 
* Can I describe the relationship between ratio of cancellation with other columns? 

In [None]:
# load libraries
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
import matplotlib
import datetime
%matplotlib inline 
import os 
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

font = {
    'family' : 'normal',
        'weight' : 'normal',
    'size'   : 12
}

matplotlib.rc('font', **font)
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

Now, Let's look at the head of our data to have some imagination about it!!! 

In [None]:
df = pd.read_csv('/kaggle/input/hotel-booking-demand/hotel_bookings.csv')
print('Size:', len(df))
df.head()

And now we'll see how many NULL-record in each columns. 

In [None]:
df.isnull().sum()

### where do guests come from 

In [None]:
temp_df = df[['country', 'is_canceled']]
number_of_guests_from_each_country_df = temp_df['country'].value_counts().reset_index().rename(
    columns={'country':'number_of_guests','index':'country'}
)

unknown_countries = []
def convert_small_country_name(name, number_of_guests):
    if number_of_guests <=3000:
        unknown_countries.append(name)
        return 'Unk'
    return name

number_of_guests_from_each_country_df['new_country_name'] = number_of_guests_from_each_country_df.apply(lambda x: convert_small_country_name(x['country'], x['number_of_guests']),
                                            axis=1)
number_of_guests_from_each_country_df = number_of_guests_from_each_country_df.groupby(['new_country_name']).agg({
    'number_of_guests':'sum'
}).reset_index().sort_values(['number_of_guests'])


fig = plt.figure(figsize=(6,6))
plt.pie(number_of_guests_from_each_country_df['number_of_guests'], 
        labels=number_of_guests_from_each_country_df['new_country_name'], autopct='%1.1f%%')
plt.title("Ratio of country where guest come from")
# plt.legend(fontsize=10)
plt.show()
unknown_countries = list(set(unknown_countries))

There are 177 countries in this data, we need to decrease the number of countries to visualize, so I decide to change countries whose <code>number of guests</code> is lower than 3000 to **UNK** countries. 

Most of our reservations come from PRT (40.9%). So, the data that we are analyzing **may be** the data of Portugal hotels.  Note that, i tell **reservation**, not **guest**. 

Do you want to know which country have the highest cancellation ratio? Let's check our data. 

In [None]:
temp_df = df['country'].value_counts().reset_index().rename(
    columns={'country':'number_of_guests','index':'country'}
)
temp_canceled_df = df[df['is_canceled']==1]['country'].value_counts().reset_index().rename(
    columns={'country':'number_of_guests_canceled','index':'country'}
)

temp_df = pd.merge(temp_df, temp_canceled_df, how='left', on=['country'])
temp_df['number_of_guests_canceled'].fillna(0, inplace=True)
# temp_df['cancaled_rate'] = temp_df['number_of_guests_canceled']/temp_df['number_of_guests']
temp_df['new_country_name'] = temp_df['country'].apply(
    lambda x: x if x not in unknown_countries else 'UNK'                                       
)
temp_df = temp_df.groupby(['new_country_name']).agg({
    'number_of_guests_canceled':'sum',
    'number_of_guests':'sum'
}).reset_index()
temp_df['cancaled_rate'] = temp_df['number_of_guests_canceled']/temp_df['number_of_guests']
temp_df = temp_df.sort_values(['cancaled_rate'])

fig, ax = plt.subplots(1,1,figsize=(10,7))
rect1 = ax.bar(height=temp_df['cancaled_rate'], 
        x=temp_df['new_country_name'])
def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('%.2f'%(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')
autolabel(rect1)
# ax.set_xticklabels()
plt.title("Cancellation rate of each country")
# plt.legend(fontsize=10)
plt.show()


The cancellation ratio of Portugal is highest. If you was hotel 's owner, you should focus on this market, because 40% of your reservations come from Portugal. 
Okay, let 's see your actual guests, who reserved and did not cancel. 

In [None]:
temp_not_cancel_df = df[df['is_canceled']==0]['country'].value_counts().reset_index().rename(
    columns={'country':'number_of_guests','index':'country'}
)
temp_not_cancel_df['new_country_name'] = temp_not_cancel_df['country'].apply(
     lambda x: x if x not in unknown_countries else 'UNK'        
)
temp_not_cancel_df = temp_not_cancel_df.groupby(['new_country_name']).agg({
    'number_of_guests':'sum'
}).reset_index().sort_values(['number_of_guests'])

fig, ax = plt.subplots(1,1,figsize=(10,7))
rect1 = ax.bar(height=temp_not_cancel_df['number_of_guests'], 
        x=temp_not_cancel_df['new_country_name'])
def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('%d'%(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')
autolabel(rect1)
# ax.set_xticklabels()
plt.title("Number of real guests of each country")
# plt.legend(fontsize=10)
plt.show()

most of our guests come from Portugal. Following Portugal, Greate Britain and France have the second and third places (We don't discuss sth about UNK because it's a group of *small* countries)

### How does the price per night vary over the year?

To keep it simple, I'm using the average price per night and person, regardless of the room type and meal.



In [None]:
temp_df = df.groupby(['hotel', 'arrival_date_month']).agg({
    'adr':'sum',
    'adults':'sum',
    'children':'sum'
}).reset_index()
temp_df['adr_ppn'] = temp_df['adr'] / (temp_df['adults'] + temp_df['children'])

ordered_months = ["January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"]
temp_df["arrival_date_month"] = pd.Categorical(temp_df["arrival_date_month"], 
                                                          categories=ordered_months, ordered=True)
plt.figure(figsize=(12, 8))
sns.lineplot(x = "arrival_date_month", y="adr_ppn", hue="hotel", data=temp_df, 
            hue_order = ["City Hotel", "Resort Hotel"], ci="sd", size="hotel", sizes=(2.5, 2.5))
plt.title("Room price per night and person over the year", fontsize=16)
plt.xlabel("Month", fontsize=16)
plt.xticks(rotation=45)
plt.ylabel("Price [EUR]", fontsize=16)
plt.show()

The prices in the Resort hotel are much higher during the summer. 

The price of the city hotel varies less and is most expensive during spring and autumn.

### Which month has the highest number of cancellation? 

In [None]:
temp_df = df.groupby(['hotel', 'arrival_date_month']).agg({
    'is_canceled':'sum'
}).reset_index()

ordered_months = ["January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"]
temp_df["arrival_date_month"] = pd.Categorical(temp_df["arrival_date_month"], 
                                                          categories=ordered_months, ordered=True)
plt.figure(figsize=(12, 8))
sns.lineplot(x = "arrival_date_month", y="is_canceled", hue="hotel", data=temp_df, 
            hue_order = ["City Hotel", "Resort Hotel"], ci="sd", size="hotel", sizes=(2.5, 2.5))
plt.title("Cancellation number over the year", fontsize=16)
plt.xlabel("Month", fontsize=16)
plt.xticks(rotation=45)
plt.ylabel("Number of cancellations", fontsize=16)
plt.show()

Resource hotel has the highest cancellations in August. 

### What is the relationship between leading time and cancellation ratio? 

In [None]:
fig, ax = plt.subplots(1,1, figsize=(8,5))
sns.boxplot(x='is_canceled', y='lead_time', hue='hotel',data=df,
            hue_order=["City Hotel", "Resort Hotel"],
            fliersize=0)
plt.title('Relationship between Leading time and Cancellation ratio')
plt.show()


Leading time of cancellation reservations is slightly longer than another in both City Hotel and Resort Hotel. 
So, you have to take more attention for guests who reserve too soon. 

In [None]:
fig, ax = plt.subplots(1,1, figsize=(8,5))
sns.boxplot(hue='is_canceled', y='adr', x='hotel',data=df,
            hue_order=[0, 1],
            fliersize=0)
ax.set_ylim(0,400)
plt.title('Relationship between ADR and Cancellation ratio')
plt.legend(loc='upper right')
plt.show()


City Hotel has the same distribution of cancellation and not-cancellation. 
When we look at Resort Hotel, we can relize that cancellation records have slightly higher adr.  

## 3. Preprocessing

we will remove following columns: 
* <code>reservation_status</code>, <code>reservation_status_date</code> (leakage information: when a guest reserve a room, we can not have the reservation status, right?)
* <code>company </code> (NULL value is too high)
* <code> days_in_waiting_list</code> (almost 0)
* <code> arrival_date_year</code> , <code> arrival_date_day_of_month </code>, <code>arrival_date_week_number</code> (you want to predict the future, these columns don't make sense about it. Another reason is there are too few records for each value in these columns. How many records do you have in 2015-10-10?) 

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score, precision_score, accuracy_score, recall_score
import random

In [None]:
pre_df = df.copy()
pre_df['country'] = pre_df['country'].apply(
    lambda x: x if x not in unknown_countries else 'UNK'                                       
)
pre_df.drop(columns=['reservation_status', 'reservation_status_date', 'company', 'days_in_waiting_list'], 
            axis=1, inplace=True)
pre_df['agent'].fillna('UNK', inplace=True)
pre_df['agent'] = pre_df['agent'].astype('str')
pre_df['country'].fillna('UNK', inplace=True)

In [None]:
month_dict = {v:k+1 for k,v in dict(enumerate(ordered_months)).items()}

pre_df['arrival_date_month'] = pre_df['arrival_date_month'].apply(lambda x: month_dict[x])
pre_df['arrival_weekday'] = pre_df.apply(lambda x: datetime.datetime(x['arrival_date_year'],
                                                           x['arrival_date_month'],
                                                           x['arrival_date_day_of_month']
                                                           ).weekday(), axis=1)
pre_df.drop(columns=['arrival_date_year', 'arrival_date_day_of_month', 
                     'arrival_date_week_number', 'assigned_room_type'], 
            axis=1, inplace=True)
pre_df['reserved_room_type'] = pre_df.apply(lambda x: x['hotel'] + x['reserved_room_type'], axis=1)

In [None]:
categorical_columns = [
    'hotel', 'arrival_date_month', 'meal', 'country', 'market_segment', 
    'distribution_channel', 'is_repeated_guest', 'reserved_room_type',                   
    'deposit_type', 'customer_type', 'arrival_weekday', 'agent'
]
numeric_columns = [
    'lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies',
    'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', 'adr',
    'required_car_parking_spaces', 'total_of_special_requests'
]

In [None]:
label_encoder = LabelEncoder()
for col in categorical_columns:
    try:
        pre_df[col] = label_encoder.fit_transform(pre_df[col])
    except Exception as e:
        print(col)
        break

In [None]:
X, Y = pre_df[categorical_columns + numeric_columns], pre_df['is_canceled']
train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.2, random_state=10)
print(train_x.shape)
print(test_x.shape)

## 4. Modeling 