# Tutorial Part 2 - Cancellation Prediction

This notebook is based on and extends:
 - [EDA of bookings and ML to predict cancelations](https://www.kaggle.com/marcuswingen/eda-of-bookings-and-ml-to-predict-cancelations) by [Marcus Wingen](https://www.kaggle.com/marcuswingen)
 - [Exploring the Data & analysing the best Regression](https://www.kaggle.com/amarloni/exploring-the-data-analysing-the-best-regression) by [Amar Loni](https://www.kaggle.com/amarloni)

![sklearn-logo](https://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)

We will use [scikit-learn](https://scikit-learn.org/) to train a model that predicts whether a booking will be canceled.


<!-- - [Factors influencing Hotel Booking - Quick Study](https://www.kaggle.com/samiranbera/factors-influencing-hotel-booking-quick-study) by [SamiranBera](https://www.kaggle.com/samiranbera)
 - [EDA of Hotel Bookings](https://www.kaggle.com/listonlt/eda-of-hotel-bookings) by [Liston Tellis](https://www.kaggle.com/listonlt)
-->


## Dataset description

The [Hotel booking demand](https://www.kaggle.com/jessemostipak/hotel-booking-demand) dataset available on Kaggle has originally been described in [Antonio et al. (2019): Hotel booking demand datasets](https://doi.org/10.1016/j.dib.2018.11.126.). It was cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020.

It contains booking data (31 variables) on two hotels in Portugal:
 - **H1:** a resort hotel at the Algarve (40,060 observations)
 - **H2:** a city hotel in Lisbon(79,330 observations)
 
Each observation represents a hotel booking (due to arrive between the July 1, 2015 and the August 31, 2017), including **bookings that effectively arrived and bookings that were canceled**. 

The data is from real hotel bookings, but all data pertaining to hotel or costumer identification were deleted.

### Complete list of variables

- `hotel`: `Resort Hotel` or `City Hotel` *(Categorical)*
- `is_canceled` Value indicating if the booking was canceled (`1`) or not (`0`) *categorical*



- `lead_time` Number of days that elapsed betweenthe entering date of the booking into the PMS and the arrival date
- `arrival_date_year` Year of arrival date (Integer)
- `arrival_date_month` Month of arrival date with 12 categories: `January` to `December` *(categorical)*
- `arrival_date_week_number` Week number of the arrival date *(Integer)*
- `arrival_date_day_of_month` Day of the month of the arrival date *(Integer)*



- `stays_in_weekend_nights` Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel *(Integer)*
- `stays_in_week_nights` Number of week nights (Monday to Fri-day) the guest stayed or booked to stay at the hotel *(Integer)*



- `adults` Number of adults *(Integer)*
- `children` Number of children *(Integer)*
- `babies` Number of Babies *(Integer)*
- `meal` Type of meal booked. Categories arepresented in standard hospitality meal packages: 
    - `Undefined/SC` - no meal package; 
    - `BB` – Bed & Breakfast; 
    - `HB` – Half board (breakfast and one other meal–usually dinner); 
    - `FB` – Full board (breakfast, lunch and dinner)
- `country` Country of origin. Categories are represented in the ISO 3155–3:2013 format *(Categorical)*

- `market_segment` Market segment designation 
    - `TA` means TravelAgents and 
    - `TO` means TourOperators *(Categorical)*
- `distribution_channel` Booking distribution channel 
    - `TA` means TravelAgents and 
    - `TO` means Tour Operators *Categorical)*



- `is_repeated_guest` Value indicating if the booking was from a repeated guest (1) or not (0) *categorical*
- `previous_cancellations` Number of previous bookings that werecancelled by the customer prior to the current booking *(Integer)*
- `previous_bookings_not_canceled` Number of previous bookings notcancelled by the customer prior to thecurrent booking *(Integer)*


- `reserved_room_type` Code of room type reserved. Code ispresented instead of designation for anonymity reasons
- `assigned_room_type` Code for the type of room assigned to the booking. Sometimes the assigned roomtype differs from the reserved room typedue to hotel operation reasons (e.g.overbooking) or by customer request. Code is presented instead of designation for anonymity reasons. *(Catgorical)*


- `booking_changes` Number of changes/amendments madeto the booking from the moment thebooking was entered on the PMS untilthe moment of check-in or cancellation *(Integer)*
- `deposit_type`



- `agent` ID of the travel agency that made thebooking *(Categorical)*
- `company` ID of the company/entity that made thebooking or responsible for paying thebooking. ID is presented instead of des-ignation for anonymity reasons *(Categorical)*

- `days_in_waiting_list` Number of days the booking was in thewaiting list before it was confirmed to the customer *(Integer)*
- `customer_type` Type of booking, assuming one of four categories: 
    - `Contract` - when the booking has an allotment or other type of contract associated to it; 
    - `Group` – when the booking is asso-ciated to a group;
    - `Transient` – when the booking is notpart of a group or contract, and is not associated to other transient booking;
    - `Transient-party` – when the booking istransient, but is associated to at leastother transient booking

- `adr` Average Daily Rate - Calculated by dividing the sum of all lodging transactions by the total number of staying nights *(Numeric)*
- `required_car_parking_spaces` Number of car parking spaces requiredby the customer *(Integer)*
- `total_of_special_requests` Number of special requests made by thecustomer (e.g. twin bed or highfloor) *(Integer)*
- `reservation_status` Reservation last status, assuming one ofthree categories:
    - `Canceled` - booking was canceled bythe customer;
    - `Check-Out` - customer has checked inbut already departed;
    - `No-Show` - customer did not check-in and did inform the hotel of the reason why
- `reservation_status_date` Date at which the last status was set. This variable can be used in conjunction with the Reservation Status to understand when was the booking canceled or whendid the customer checked-out of the hotel *(Date)*

**Note:** there are some differences between the original data set described in the paper and the dataset here:
 - in the paper, there is a separate data set for each hotel, which have been merged with an added column `hotel`
 - omission of redundant variables (e.g., Categorical and Integer versions of month)
 

Let's see if that matches the data we have..

### Loading and preprocessing

In [None]:
# Load common libraries:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import folium

# set some display options:
sns.set(style="whitegrid")
pd.set_option("display.max_columns", 36)

# load data:
file_path = "../input/hotel-booking-demand/hotel_bookings.csv"
full_data = pd.read_csv(file_path)

# Replace missing values:
nan_replacements = {"children": 0.0, "country": "Unknown", "agent": 0, "company": 0}
full_data_cln = full_data.fillna(nan_replacements)

# "meal" contains values "Undefined", which is equal to SC.
full_data_cln["meal"].replace("Undefined", "SC", inplace=True)

# Get rid of bookings for 0 adults, 0 children, and 0 babies:
zero_guests = list(full_data_cln.loc[full_data_cln["adults"]
                   + full_data_cln["children"]
                   + full_data_cln["babies"]==0].index)
full_data_cln.drop(full_data_cln.index[zero_guests], inplace=True)

# Delete a record with ADR greater than 5000
full_data_cln = full_data_cln[full_data_cln['adr'] < 5000]
ax = sns.boxplot(x=full_data_cln['adr'])

## Cancellation prediction

Wo what do we want to predict?

 - the **number of cancellations** over time
 - **whether or not** a given booking will be canceled
 - **how likely** it is that a given booking will be canceled

### What variables are correlated with cancelations?

In [None]:
cor_mat = full_data.corr()
fig, ax = plt.subplots(figsize=(17,7))
sns.heatmap(cor_mat, ax=ax, cmap="RdBu", center=0, linewidths=0.1)

Our target variable `is_canceled` is positively correlated with:
 - `lead_time`: bookings that are made well in advance are more likely to be canceled.
 - `previous_cancellations`: bookings made by customers who have canceled bookings in the past are more likely to canceled.
 - `adults`: single bookings tend to be canceled less frequently than bookings for larger parties..
 - `days_in_waiting_list`: customers don't like to wait..
 - `adr`: more expensive bookings tend to be canceled more often (a bit).

It is negatively correlated with:
 - `is_repeat_guest`: repeat guests seem to be more loyal and don't cancel as much
 - `booking_changes`: higher number of changes is associated with less cancelations
 - `required_car_parking_spaces`: customers who come by car don't cancel as much (resort vs. city?)
 - `total_of_special_requests`: bookings with higher number of special requests are canceled less often.

A few other observations:
- Pos. corr. between `children` and `adr`: it's expensive to be parents.. ;)
- Pos. corr. between `previous_cancellations` and `lead_time`: customers that have canceled more often tend to book well in advance..
- Pos. corr. between `previous_cancellations`and `previous_bookings_not_canceled`: probably due to repeat customers who come more than once (and cancel some of their bookings)
- Neg. corr. between `arrival_date_week_number` and `arrival_date_year`: not surprising, as the data set begins and ends in August - therefore, the lower first year (2016) has higher week numbers (starting from 35) than the the last year (2017), which only includes week numbers up to 35.
- Pos. corr. between `stays_in_week_nights` and `stays_in_weekend_nights`: longer stays (1 week in resort) increase both

- the association of `is_repeat_guest`and `company` may be spurious, as `company` is a categorical

Let's have a closer look at the correlation with our target variable again:

In [None]:
cancel_corr = full_data.corr()["is_canceled"]
cancel_corr.sort_values(ascending=False)[1:]
#cancel_corr.abs().sort_values(ascending=False)[1:]

**OK, so `lead_time`, `total_of_special_requests`, `required_car_parking_spaces`, `booking_changes` and `previous_cancellations` are the 5 most important numerical features.**

Can we actually use them to make predictions?

As we'll see..
- we cannot use `booking_changes`, as this changes over time and is a possible source of leakage (we don't know in advance how often there will be a booking change for a new booking; therefore, we would train the model with data that it's not supposed to have/does not have in real-world use)
- `total_of_special_requests`: assuming that these requests are made when the booking is made, we can include them; otherwise we may also have to discard that.

## Preparations

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier

import numpy as np

### Separate features and predicted values

In [None]:
# Separate features and predicted value
y = full_data["is_canceled"] # what we want to predict
X = full_data.drop(["is_canceled"], axis=1) # remove target variable from features

### Train-test split

Let's divide the data into training and validation sets:

In [None]:
# 70 % for training, 30 % for validation
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.7,
                                                                test_size=0.3, random_state=0)

Note that we set a random seed (`random_state`) to get reproducible results.

## Feature engineering

### Numeric Features

Let's choose the columns that we want include as features.

Looking at the numeric features we have, we will exclude the following ones to prevent leakage and make the model more general:
 - `arrival_date_year`: we want to avoid correlated features - as we have seen above, the data begins and ends in summer, so there are two years with data only on half of the year; this is likely to introduce spurious patterns.
 - `assigned_room_type`: we only know that once the customer checks in, i.e., we cannot use that to make predictions ("leakage")
 - `booking_changes`: same here
 - `reservation_status`: same here; "predicting" that `is_canceled=1` if `reservation_status` is `Canceled` won't help us ;) 
 - `days_in_waiting_list`: we don't know at the time, how long the booking will be on the waiting list 

We will include the following numeric features:


In [None]:
num_features = ["lead_time",
                "arrival_date_week_number",
                "arrival_date_day_of_month",
                "stays_in_weekend_nights",
                "stays_in_week_nights",
                "adults",
                "children",
                "babies",
                "is_repeated_guest",
                "previous_cancellations",
                "previous_bookings_not_canceled",
                "agent",
                "company",
                "required_car_parking_spaces",
                "total_of_special_requests",
                "adr"]

### Categorical features

Next, we select which categorical features to include to make our predictions.

Here, we have to exclude `reservation_status` (values: `checked-out`and `Canceled`) to prevent leakage.

Depending on whether or not we want to use the trained model to make predictions only for these specific hotels or whether we want a model that generalizes to other hotels, we may have to think whether to include certain attributes:
we may want to exclude `country`, because distance may actually play a role, rather than just nationality.



In [None]:
cat_features = ["hotel",
                "arrival_date_month",
                "meal",
                "market_segment",
                "distribution_channel",
                "reserved_room_type",
                "deposit_type",
                "customer_type"]

### Construct training and validation sets with features

In [None]:
X_train = X_train_full[num_features + cat_features].copy()
X_valid = X_valid_full[num_features + cat_features].copy()

# preprocess numerical features: 
num_transformer = SimpleImputer(strategy="constant") # not really necessary, as we should not have any missing values

# Preprocessing for categorical features:
cat_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
    ("onehot", OneHotEncoder(handle_unknown='ignore'))])

# Bundle preprocessing for numerical and categorical features:
preprocessor = ColumnTransformer(transformers=[('num', num_transformer, num_features),
                                               ('cat', cat_transformer, cat_features)])

## Training

We will use a [random forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to make predictions. (cf. [this article](https://towardsdatascience.com/random-forest-and-its-implementation-71824ced454f) for a more in-depth explanation).

Random forest is a supervised learning algorithm that runs decision trees in parallel and combines their output to obtain final predictions.

![Random forest](https://miro.medium.com/max/1306/0*f_qQPFpdofWGLQqc.png)

<!--
![Random forest](randomForest.png)
-->

In [None]:
# Define Random Forest classifier:
rfc_model = RandomForestClassifier(random_state=0,n_jobs=-1)
rfc_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', rfc_model)])

# Preprocessing of training data, fit model:
rfc_pipeline.fit(X_train, y_train)

## Validation

In [None]:
from sklearn.metrics import accuracy_score

# Preprocessing of validation data, get predictions:
y_pred = rfc_pipeline.predict(X_valid)

In [None]:
y_pred

In [None]:
# Evaluate the model:
score = accuracy_score(y_valid, y_pred)
print("Random Forest accuracy_score: ", score)

So with this model, we achieve an accuracy of 0,86..

What does that mean?

#### Defnition: Accuracy
What does that mean? The [scikit-earn user guide](https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score) says that if $\hat{y}_i$ is the predicted value of the $i$-th sample and $y$ is the corresponding true value, then the fraction of correct predctions over $n_{samples}$ is defined as

$$\texttt{accuracy}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples}-1} 1(\hat{y}_i = y_i)$$



**So pretty straingt forward: about 86% of the predictions are correct.**

Now, that sounds impressive, but is that any good?<br/> 
What should be our baseline here? <br/>
What would your buest guess if you don't know anything?

In [None]:
full_data_cln["is_canceled"].mean()

So if you follow a naïve strategy and just guess that a booking will not be canceled without knowing anything about it, you will already be correct in 63% (1-0.37) of cases. 

So, 0.863 is not too bad, but also not all *that* impressive.. 

To control for an **imbalanced dataset** like this, we should use the balanced accuracy score:

#### Definition: Balanced Accuracy

If $y_i$ is the true value of the $i$-th sample, and $w_i$ is the corresponding sample weight, then we adjust the sample weight to:

$$\hat{w}_i = \frac{w_i}{\sum_j{1(y_j = y_i) w_j}}$$

Balanced accuracy is then defined as:

$$\texttt{balanced-accuracy}(y, \hat{y}, w) = \frac{1}{\sum{\hat{w}_i}} \sum_i 1(\hat{y}_i = y_i) \hat{w}_i$$

In [None]:
from sklearn.metrics import balanced_accuracy_score

balanced_accuracy_score(y_valid, y_pred)

So not too bad..


Let's say you want to act on the predictions, e.g. by overbooking a room. 

What else would you like to know before you make a decision on whether to do that?

Well, if you do, four things may happen:
 - the customer predicted to show up actually shows up (True positive → $T_p$)
 - the customer predicted to cancel actually cancels (True negative → $T_n$)
 
 or
 - the customer not predicted to cancel does cancel (False positive → $F_p$)
 - the customer predicted to cancel does not cancel (False negative → $F_n$)
 
So, how can we look into this? The *confusion matrix* does the job here:

In [None]:
confusion_matrix(y_valid, y_pred)

Let's put this a little more nicely..

In [None]:
true = pd.Categorical(list(np.where(np.array(y_valid) == 1, 'cancelled','not cancelled')), categories = ['cancelled','not cancelled'])
pred = pd.Categorical(list(np.where(np.array(y_pred) == 1, 'cancelled','not cancelled')), categories = ['cancelled','not cancelled'])

pd.crosstab(pred, true, 
            rownames=['pred'], 
            colnames=['Actual'], margins=False, margins_name="Total")

Now, this is helpful, but can we summarize these in simple metrics?

#### Definition: Precision, Recall, F1 ([Documentation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#sphx-glr-auto-examples-model-selection-plot-precision-recall-py))

Precision (a.k.a. false positive rate, specificity, positive predictive power) $P$ is defined as the number of true positives $T_p$ over the number of true positives plus the number of false positives $F_p$:

$$ P = \frac{T_p}{T_p+F_p} $$

Recall (a.k.a. true postive rate or sensitivity) $R$ is defined as the number of true positives $T_p$ over the number of true positives plus the number of false negatives $F_n$:

$$R = \frac{T_p}{T_p + F_n}$$

These quantities are also related to the $F1$ score, which is defined as the harmonic mean of precision and recall:

$$F1 = 2\frac{P \times R}{P+R}$$


In [None]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

print("Precision: ", precision_score(y_valid, y_pred))
print("Recall: ", recall_score(y_valid, y_pred))
print("F1: ", f1_score(y_valid, y_pred))

Since we deal with a binary classification problem (vs. a multi-label classification problem) here, we can nicely see the trade off between precision vs. recall in a plot.

In [None]:
from sklearn.metrics import plot_precision_recall_curve

disp = plot_precision_recall_curve(rfc_pipeline, X_valid, y_valid)
disp.ax_.set_title('Precision-Recall curve')

**Interpretation:** in principle, we could set a different threshold for the classifier (0.5 by default) to favor precision or recall and get, e.g.,
 - (almost) perfect precision (i.e., every predicted cancellation is an actual cancellation), if we are ok with only predicting approx. 40% of the cancellations
 - (almost) perfect recall (i.e., predict all actual cancellations), if we are ok with about 40% of our predictions being incorrect
 - or something in between..
 
 Other important metrics include [Area Under the Curve (AUC)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html#sklearn.metrics.auc), and [Receiver Operating Characteristics](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve). 
There are also various more specialized ones..

## How about Regression?

Even though the target variable is dichotomous (binary), we can just as well treat this as a regression problem. 

Random forests, which we used for classification before, can also be used for regression tasks.
In that case, instead of the mode of the predictions made by the individual trees, the mean prediction is returned.
In scikit-learn, we can use the [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) and otherwise the same preprocessing pipeline as before.

In [None]:
from sklearn.ensemble import RandomForestRegressor 

rfe_model = RandomForestRegressor(n_estimators = 100, random_state = 0) 
rfe_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', rfe_model)])
rfe_pipeline.fit(X_train, y_train)

### Validation

In [None]:
y_pred = rfe_pipeline.predict(X_valid)

In [None]:
y_pred

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

print('Mean Absolute Error:', mean_absolute_error(y_valid, y_pred).round(4))  
print('Mean Squared Error:', mean_squared_error(y_valid, y_pred).round(4))  
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_valid, y_pred)).round(4))
print('r2_score:', r2_score(y_valid, y_pred).round(3))

#  References
[Nuno Antonio, Ana de Almeida, Luis Nunes: Hotel booking demand datasets, Data in Brief, Volume 22, 2019, p. 41-49](https://doi.org/10.1016/j.dib.2018.11.126.)