# HOTEL BOOKINGS CANCELLATION PREDICTION

# Import Libraries
As usual, before we begin any analysis and modeling, let's import several necessary libraries to work with the data.

In [None]:
# Data Analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Interactive Plotting
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"

# Additional Packages:
# pycountry: ISO country, subdivision, language, currency and script definitions and their translations
# ppscore: implementation of the Predictive Power Score (PPS)
!pip install pycountry-convert
!pip install ppscore
import pycountry
import pycountry_convert as pc
import ppscore as pps

import warnings
warnings.filterwarnings('ignore')

# Data Wrangling
Before we jump into any visualization or modeling step, we have to make sure our data is ready.

## Import Data

In [None]:
hotel = pd.read_csv("/kaggle/input/hotel-booking-demand/hotel_bookings.csv")
hotel.head()

In [None]:
hotel.info()

Our dataframe `hotel` contains 119390 rows of bookings and 32 columns with data description as follows:

- `hotel`: Hotel (H1 = Resort Hotel or H2 = City Hotel)
- `is_canceled`: Value indicating if the booking was canceled (1) or not (0)
- `lead_time`: Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
- `arrival_date_year`: Year of arrival date
- `arrival_date_month`: Month of arrival date
- `arrival_date_week_number`: Week number of year for arrival date
- `arrival_date_day_of_month`: Day of arrival date
- `stays_in_weekend_nights`: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
- `stays_in_week_nights`: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
- `adults`: Number of adults
- `children`: Number of children
- `babies`: Number of babies
- `meal`: Type of meal booked. Categories are presented in standard hospitality meal packages: 
    - Undefined/SC – no meal package;
    - BB – Bed & Breakfast;
    - HB – Half board (breakfast and one other meal – usually dinner); 
    - FB – Full board (breakfast, lunch and dinner)
- `country`: Country of origin. Categories are represented in the ISO 3155–3:2013 format
- `market_segment`: Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
- `distribution_channel`: Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
- `is_repeated_guest`: Value indicating if the booking name was from a repeated guest (1) or not (0)
- `previous_cancellations`: Number of previous bookings that were cancelled by the customer prior to the current booking
- `previous_bookings_not_canceled`: Number of previous bookings not cancelled by the customer prior to the current booking
- `reserved_room_type`: Code of room type reserved. Code is presented instead of designation for anonymity reasons.
- `assigned_room_type`: Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.
- `booking_changes`: Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
- `deposit_type`: Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: 
    - No Deposit – no deposit was made; 
    - Non Refund – a deposit was made in the value of the total stay cost; 
    - Refundable – a deposit was made with a value under the total cost of stay.
- `agent`: ID of the travel agency that made the booking
- `company`: ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
- `days_in_waiting_list`: Number of days the booking was in the waiting list before it was confirmed to the customer
- `customer_type`: Type of booking, assuming one of four categories:
    - Contract - when the booking has an allotment or other type of contract associated to it;
    - Group – when the booking is associated to a group; 
    - Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; 
    - Transient-party – when the booking is transient, but is associated to at least other transient booking
- `adr`: Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
- `required_car_parking_spaces`: Number of car parking spaces required by the customer
- `total_of_special_requests`: Number of special requests made by the customer (e.g. twin bed or high floor)
- `reservation_status`: Reservation last status, assuming one of three categories:
    - Canceled – booking was canceled by the customer; 
    - Check-Out – customer has checked in but already departed; 
    - No-Show – customer did not check-in and did inform the hotel of the reason why
- `reservation_status_date`: Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel.

## Missing Values
Check if there are any missing values in `hotel`:

In [None]:
hotel.isna().sum().sort_values(ascending = False)

There are four columns with missing values, here's how we handle them:
- Drop columns `agent` and `company`, since the missing values are too many and we won't use them for prediction.
- Create "UNKNOWN" category for `country`.
- Fill `children` with the value 0.

In [None]:
hotel.drop(columns = ['agent', 'company'], inplace = True)
hotel['country'].fillna("UNKNOWN", inplace = True)
hotel['children'].fillna(0, inplace = True)

Check whether there is another missing values:

In [None]:
hotel.isna().values.any()

## Data Type Conversion

### Categorical
We convert object to category data types to save memory. Also map the boolean columns `is_canceled` and `is_repeated_guest` into category for readability.

In [None]:
category_cols = ['hotel', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type', 'reservation_status']
boolean_cols = ['is_canceled', 'is_repeated_guest']

boolean_map = {1:'Yes', 0:'No'}

hotel['is_canceled'] = hotel['is_canceled'].map(boolean_map)
hotel['is_repeated_guest'] = hotel['is_repeated_guest'].map(boolean_map)

hotel[category_cols + boolean_cols] = hotel[category_cols + boolean_cols].astype('category')
hotel['is_canceled'].cat.reorder_categories(list(boolean_map.values()), inplace = True)
hotel['is_repeated_guest'].cat.reorder_categories(list(boolean_map.values()), inplace = True)


### Numerical
Convert `children` from float to integer.

In [None]:
hotel['children'].apply(float.is_integer).all()
hotel['children'] = hotel['children'].astype('int')

### Datetime
Convert `reservation_status_date` as datetime.

In [None]:
hotel['reservation_status_date'] = hotel['reservation_status_date'].astype('datetime64')

Here's the final data types of `hotel`:

In [None]:
hotel.dtypes

## Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from provided raw data. These features can be used to improve the performance of machine learning models.

## Room Type Assignment
Instead of considering each assigned and reserved room type, we create a new column `is_assigned_as_reserved` to make a flag whether the customer get their expected room type or not.

In [None]:
hotel['reserved_room_type'].cat.set_categories(hotel['assigned_room_type'].cat.categories, inplace = True)
hotel['is_assigned_as_reserved'] = (hotel['assigned_room_type'] == hotel['reserved_room_type']).astype('category')
hotel['is_assigned_as_reserved']

### Arrival Date
Combine `arrival_date_year`, `arrival_date_month`, `arrival_date_day_of_month` into one column `arrival_date` so that we can extract more information from the date.

In [None]:
arrival_date_cols = ['arrival_date_year', 'arrival_date_month', 'arrival_date_day_of_month']
hotel[arrival_date_cols] = hotel[arrival_date_cols].astype(str)
hotel['arrival_date'] = pd.to_datetime(hotel[arrival_date_cols].apply('-'.join, axis = 1), format = "%Y-%B-%d")
hotel.drop(columns = arrival_date_cols + ['arrival_date_week_number'], inplace = True)

### Booking Date
Create `booking_date` by subtracting `lead_time` days from `arrival_date`.

In [None]:
hotel['booking_date'] = hotel['arrival_date'] - pd.to_timedelta(hotel['lead_time'], unit = 'days')
hotel[['booking_date', 'arrival_date', 'lead_time']].head()

### Country and Continent Name
The column `country` represents code of a country in the ISO 3155–3:2013 format. By utilizing the code-to-name mapping provided in `pycountry` package, we can extract it into `country_name` and `continent_name`.

In [None]:
additional_code2name = {'TMP': 'East Timor'}

def convertCountryCode2Name(code):
    country_name = None
    try:
        if len(code) == 2:
            country_name = pycountry.countries.get(alpha_2 = code).name
        elif len(code) == 3:
            country_name = pycountry.countries.get(alpha_3 = code).name
    except:
        if code in additional_code2name.keys():
            country_name = additional_code2name[code]
    return country_name if country_name is not None else code
    
hotel['country_name'] = hotel['country'].apply(convertCountryCode2Name).astype('category')
hotel['country_name'].head()

In [None]:
additional_name2continent = {'East Timor': 'Asia', 'United States Minor Outlying Islands': 'North America', 'French Southern Territories': 'Antarctica', 'Antarctica': 'Antarctica'}

def convertCountryName2Continent(country_name):
    continent_name = None
    try:
        alpha2 = pc.country_name_to_country_alpha2(country_name)
        continent_code = pc.country_alpha2_to_continent_code(alpha2)
        continent_name = pc.convert_continent_code_to_continent_name(continent_code)
    except:
        if country_name in additional_name2continent.keys():
            continent_name = additional_name2continent[country_name]
        else:
            continent_name = "UNKNOWN"
    return continent_name if continent_name is not None else country_name

hotel['continent_name'] = hotel['country_name'].apply(convertCountryName2Continent).astype('category')
hotel['continent_name'].head()

## Suspicious Observations
There are some hidden anomalies present in the bookings observation. Let's create new variables and plot their frequencies:
- `total_guest`: Total number of `adults`, `children`, and `babies`
- `total_nights`: Number of nights the guest stayed at the hotel, sum of `stays_in_weekend_nights` and `stays_in_week_nights`

In [None]:
hotel['total_guest'] = hotel[['adults', 'children', 'babies']].sum(axis = 1)
hotel['total_nights'] = hotel[['stays_in_weekend_nights', 'stays_in_week_nights']].sum(axis = 1)

data2plot = [hotel['total_guest'].value_counts().sort_index(ascending = False),
             hotel['total_nights'].value_counts().sort_index(ascending = False)[-21:]]

ylabs = ["Total Guest per Booking (Person)", "Total Nights per Booking"]
titles = ["FREQUENCY OF TOTAL GUEST PER BOOKING\n", "FREQUENCY OF TOTAL NIGHTS PER BOOKING\n(UP TO 20 NIGHTS ONLY)"]

fig, axes = plt.subplots(1, 2, figsize = (15, 5))
for ax, data, ylab, title in zip(axes, data2plot, ylabs, titles):
    bp = data.plot(kind = 'barh', rot = 0, ax = ax)
    for rect in bp.patches:
        height = rect.get_height()
        width = rect.get_width()
        bp.text(rect.get_x() + width, 
                rect.get_y() + height, 
                int(width), 
                ha = 'left',
                va = 'top',
                fontsize = 8)
    bp.set_xlabel("Frequency")
    bp.set_ylabel(ylab)
    ax.set_title(title, fontweight = "bold")

There are 180 bookings without guest (`total_guest` = 0) and 715 bookings with zero nights of staying at the hotel (`total_nights` = 0). Ideally, such cases should not occur on our bookings data. Therefore, from this point onwards we will ignore the observations with either cases since it can affect our modeling outcome.

In [None]:
hotel = hotel[(hotel['total_guest'] != 0) & (hotel['total_nights'] != 0)]
hotel.shape

We end up with 118565 rows of bookings, originally it was 119390 rows.

# Exploratory Data Analysis (EDA)

## How is the proportion of booking cancellation based on reservation status?

In [None]:
df_cancel_status = pd.crosstab(index = hotel.is_canceled,
                               columns = hotel.reservation_status,
                               margins = True)

ax = df_cancel_status.iloc[:-1,:-1].plot(kind = 'bar', stacked = True, rot = 0)
for rect in ax.patches:
    height = rect.get_height()
    width = rect.get_width()
    if height != 0:
        ax.text(rect.get_x() + width, 
                rect.get_y() + height/2, 
                int(height), 
                ha = 'left',
                va = 'center',
                color = "black",
                fontsize = 10)

handles, labels = ax.get_legend_handles_labels()
ax.legend(handles = handles, labels = labels)

percent_no = (100*df_cancel_status/df_cancel_status.iloc[-1,-1]).loc["No", "All"]
ax.set_xticklabels(["Yes\n({:.2f} %)".format(100-percent_no), "No\n({:.2f} %)".format(percent_no)])
ax.set_xlabel("Canceled?")
ax.set_ylabel("Number of Bookings")
plt.title("BOOKING CANCELLATION PROPORTION", fontweight = "bold")
plt.show()

The proportion of the target variable `is_canceled` is somewhat balanced. There is 37.26% of the bookings which are canceled, which is divided into two cases:
- Canceled: Booking was canceled by the customer, or
- No-show: Customer did not check-in and did inform the hotel of the reason why.

## Where do most bookings happens?

In [None]:
df_choropleth = hotel.copy()
df_choropleth['booking_date_year'] = df_choropleth['booking_date'].dt.year
df_country_year_count = df_choropleth.groupby(['country', 'booking_date_year']).count()['hotel'].fillna(0).reset_index() \
                        .rename(columns={'country': 'country_code', 'booking_date_year': 'year', 'hotel':'count'})
df_country_year_count['country_name'] = df_country_year_count['country_code'].apply(convertCountryCode2Name)
df_country_year_count['count'] = df_country_year_count['count'].astype('int')

fig = px.choropleth(df_country_year_count[df_country_year_count["year"] != 2013], 
                    locations = "country_code", color = "count", animation_frame = "year",
                    hover_name = "country_name", 
                    range_color = (0, 5000),
                    color_continuous_scale = px.colors.sequential.Reds,
                    projection = "natural earth")
fig.update_layout(title = 'ANNUAL HOTEL BOOKING COUNTS',
                  template = "seaborn")
fig.show()

From the choropleth map we can see throughout the year, Europe is the continent with the most hotel booking counts. The specific country is Portugal (PRT).

## Which continent has the greatest cancellation rate?
From the previous section, we know Europe is the continent with most bookings. But how about the cancellation rate?

In [None]:
ax = pd.crosstab(index = hotel['continent_name'],
                 columns = hotel['is_canceled'],
                 margins = True).sort_values('All').iloc[:-1,:-1].plot(kind = 'barh')
ax.legend(bbox_to_anchor = (1, 1), title = "Canceled?")
ax.set_xlabel("Number of Bookings")
ax.set_ylabel("Continent Name")
ax.set_title("BOOKINGS BY EACH CONTINENT", fontweight = "bold")
plt.show()

In [None]:
ax = (pd.crosstab(index = hotel['continent_name'],
                  columns = hotel['is_canceled'],
                  normalize = 'index').sort_values('Yes') * 100).plot(kind = 'barh', stacked = True)
ax.legend(bbox_to_anchor = (1, 1), title = "Canceled?")
ax.set_xlabel("Percentage of Bookings")
ax.set_ylabel("Continent Name")
ax.set_title("PERCENTAGE OF BOOKINGS CANCELLATION BY EACH CONTINENT", fontweight = "bold")
plt.show()

It turns out that Africa has the greatest cancellation rate amongst the other continent.

## How is the cancellation rate over time?

In [None]:
df_cancellation = hotel.copy()
df_cancellation['date_period'] = df_cancellation['reservation_status_date'].dt.to_period('M')
df_cancellation_percent = df_cancellation.groupby(['date_period', 'is_canceled', 'hotel'])['hotel'].count() \
                            .groupby(['date_period', 'hotel']).apply(lambda x: 100*x/x.sum()) \
                            .unstack(level = 'is_canceled') \
                            .rename(columns = str).reset_index().rename_axis(None, axis = 1).rename(columns = {'hotel': 'Hotel Type'})
df_cancellation_percent['date_period'] = df_cancellation_percent['date_period'].values.astype('datetime64[M]')

fig = px.line(df_cancellation_percent, x = 'date_period', y = 'Yes', color = 'Hotel Type')
fig.update_traces(mode = "markers+lines",
                  hovertemplate = "Rate: %{y:.2f}%")
fig.update_layout(title = 'CANCELLATION RATE OVER TIME BY HOTEL TYPE',
                  xaxis_title = 'Cancellation Period',
                  yaxis_title = 'Cancellation Rate (%)',
                  hovermode = 'x',
                  template = "seaborn",
                  xaxis = dict(tickformat="%b %Y"))
fig.show()

Most of the time, City hotel has greater cancellation rate than Resort hotel. The good news is that: both rate are decreasing towards zero during mid 2017, meaning none of the booking was cancelled. But we have to anticipate during the beginning of year 2018, since for the past three years there are cancellation rate peaks during January. Therefore in the next section, we can use a Machine Learning model to predict whether certain bookings will be canceled by the customer or not from predictor variables present in our data.

# Predictive Power Score (PPS)
According to [Florian Wetschoreck](https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598), PPS is an asymmetric and data-type-agnostic score that can detect linear or non-linear relationships between two columns. One column acts as an univariate predictor, whereas the other acts as the target variable. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation matrix.

The PPS is calculated as follows:

$PPS = \dfrac{F1_{model} - F1_{naive}}{1 - F1_{naive}}$

where:
- $F1_{naive}$ is the weighted F1 score of a naive model that always predicts the most common class of the target column.
- $F1_{model}$ is the weighted F1 score of a classifier using `sklearn.DecisionTreeClassifier`.

Detailed explanation is available [here](https://github.com/8080labs/ppscore).

## Data preparation

Before we investigate the PPS, we do the following:
- Consider dayofyear instead of datetime for `booking_date`, `reservation_status_date`, and `arrival_date`.
- Ignore `assigned_room_type` and `reserved_room_type` because the levels are quite many, instead `is_assigned_as_reserved` will be considered.
- Ignore `country` and `country_name` because the levels are too many, instead `continent_name` will be considered.
- Convert the categorical columns into dummy variables.

In [None]:
datetime_cols = ['booking_date', 'reservation_status_date', 'arrival_date']
for col in datetime_cols:
    hotel[f"{col}_dayofyear"] = hotel[col].dt.dayofyear

ignore_cols = ['assigned_room_type', 'reserved_room_type', 'country', 'country_name']
hotel_pps_data = hotel.drop(datetime_cols + ignore_cols, axis = 1)

hotel_pps_dummy = pd.get_dummies(hotel_pps_data)
hotel_pps_dummy.head()

We treat each columns of `hotel_pps_dummy` as a univariate predictors of `is_canceled`, then calculate the PPS and present the result in the form of DataFrame.

In [None]:
pps_score = []
target = 'is_canceled_Yes'
for col in hotel_pps_dummy.columns:
    if col == target:
        continue
    d = {}
    d['feature'] = col
    d['dtypes'] = 'categorical' if hotel_pps_dummy[col].dtypes == 'uint8' else 'numerical'
    d['pps'] = pps.score(hotel_pps_dummy, x = col, y = target, task = 'classification')['ppscore']
    pps_score.append(d)
    
hotel_pps = pd.DataFrame(pps_score).set_index('feature')
hotel_pps.head()

In [None]:
ax = hotel_pps[hotel_pps['dtypes'] == 'numerical'].sort_values('pps')\
        .plot(kind = 'barh', legend = False, figsize = (5, 5))
for rect in ax.patches:
    height = rect.get_height()
    width = rect.get_width()
    ax.text(rect.get_x() + width, 
            rect.get_y() + height, 
            round(width, 5), 
            ha = 'left',
            va = 'top',
            fontsize = 8)
ax.set_xlabel("PPS")
ax.set_ylabel("Predictor Variable")
plt.title("NUMERICAL PREDICTORS PREDICTIVE POWER SCORE\n TARGET: is_canceled", fontweight = "bold")
plt.show()

From PPS of the numerical variables, we have ignore some columns for modeling:
- `reservation_status_day_dayofyear`: the score is high because from the business perspective, this value will be updated together with `is_canceled`. So we cannot use this as a predictor.
- `total_nights`: already explained by `stay_in_week_nights` and `stay_in_weekend_nights`.
- `required_car_parking_spaces`, `total_of_special_requests`, `total_guest`, `babies`, `children`, and `adults` will be ignored because the score is nearly 0. But `booking_changes` will be still considered.

In [None]:
ax = hotel_pps[hotel_pps['dtypes'] == 'categorical'].sort_values('pps')[:-1]\
        .plot(kind = 'barh', legend = False, figsize = (5, 12))
for rect in ax.patches:
    height = rect.get_height()
    width = rect.get_width()
    ax.text(rect.get_x() + width, 
            rect.get_y() + height, 
            round(width, 5), 
            ha = 'left',
            va = 'top',
            fontsize = 8)
ax.set_xlabel("PPS")
ax.set_ylabel("Predictor Variable")
plt.title("CATEGORICAL PREDICTORS PREDICTIVE POWER SCORE\n TARGET: is_canceled", fontweight = "bold")
plt.show()

From PPS of the categorical variables, we have ignore some columns for modeling:
- `reservation_status`: the score is high because from the business perspective, this value is actually `is_canceled` but breakdown into three categories. So we cannot use this as a predictor.

# MODELING