# Practical task: Hotel cancellations

## Table of Contents

* [Introduction](#intro)
* [Importing libraries and data](#import)
* [Preparing the data and exploratory data analysis](#pre)
* [Building a model](#model)
* [Conclusion](#conclusion)

<a id='intro'></a>
# Introduction

This data set contains information on 119,390 hotel bookings between July 2015 and August 2017. Each observation represents a hotel booking.

The data for two hotels is given. Both hotels are located in Portugal: the Resort Hotel is in the region of the Algarve and the City Hotel is in the city of Lisbon. A variety of categorical and numeric features are provided, including whether the book was cancelled.

Hotel management would find it useful to be able to predict whether a booking is likely to be cancelled.

<a id='features'></a>
## Features

| Feature | Description | Notes
| --- | --- | --- |
|hotel| Type of hotel |Resort Hotel|
| | | City Hotel |
|meal|Type of meal booked |Undefined/SC – no meal package |
| | | BB – Bed & Breakfast|
| | | HB – Half board (breakfast and one other meal – usually dinner)|
| | | FB – Full board (breakfast, lunch and dinner) |
|market_segment| Where the booking came from|The term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
|distribution_channel|Booking distribution channel | The term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
|reserved_room_type|Code of room type reserved | |
|assigned_room_type|Code for the type of room assigned to the booking| Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons or customer request |
|deposit_type| What kind of deposit was taken| |
|customer_type| Type of booking| Contract - when the booking has a contract associated to it
| | | Group – when the booking is associated to a group |
| | | Transient – when the booking is not part of a group or contract, and is not associated to another transient booking| 
| | | Transient-party – when the booking is transient, but is associated with at least one other transient booking |
|lead_time| How long in advance the booking was made (days)|
|stays_in_weekend_nights|Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
|stays_in_week_nights|Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel  |
|adults| Number of adult guests on the booking|
|children| Number of children |
|babies| Number of babies|
|is_repeated_guest| Whether the booking came from a repeat customer| 0 - no |
| | | 1 - yes |
|previous_cancellations| Number of previous cancellations by the customer|
|previous_bookings_not_canceled| Number of previous bookings by the customer not cancelled|
|booking_changes| Number of changes made by the customer after initial booking|
|days_in_waiting_list| Number of days the booking was in the waiting list before it was confirmed to the customer|
|required_car_parking_spaces|Number of car parking spaces required by the customer |
|total_of_special_requests| Number of special requests made by the customer|
|adr|Average daily rate|
|is_canceled| Whether the booking was cancelled | 0 - not cancelled |
| | | 1 - cancelled |

<a id='import'></a>
# Importing libraries and data

<a id='libraries'></a>
## Importing the libraries

In [None]:
# pandas for data analysis
import pandas as pd

# seaborn for visualisation
import seaborn as sns

import matplotlib.pyplot as plt

# seaborn has some unhelpful warnings at the moment
import warnings
warnings.filterwarnings("ignore", module="seaborn")

# Loading and saving models
import pickle

# Import functions from sklearn for building the model, training-testing split, visualising the model and metrics 
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Function to draw the model
def plot_decision_tree(tree_model, fontsize=12):
    fig, ax = plt.subplots(figsize=(20,8))
    plot_tree(tree_model,  
        filled=True, 
        impurity=False, 
        feature_names=input_features, 
        class_names=["No","Yes"], 
        proportion=True, 
        ax=ax,  
        fontsize=fontsize)
    plt.show()

<a id='data'></a>
## Importing the data

In [None]:
hotel_data = pd.read_csv('bookings_2023.csv')
hotel_data

In [None]:
hotel_data.info()

In [None]:
hotel_data['canceled'] = hotel_data['is_canceled'].replace({0: 'No', 1: 'Yes'})
hotel_data['repeated_guest'] = hotel_data['is_repeated_guest'].replace({0: 'No', 1: 'Yes'})
hotel_data

Added two new columns to change is_canceled and repeated_guest variable from numerical to categorical variable to analyse it better

<a id='pre'></a>
# Preparing the data and exploratory data analysis

In [None]:
hotel_data.describe().T

## Explanation of the Numeric Variables
**lead_time:** The average time between booking and arrival is approximately 104 days,which means that customers book reseravation almost 3 months in advance.

**stays_in_weekend_nights:** Customers stay for around 1 weekend nights on average. It seems that customers plan their bookings with at least 1 weekend night included.

**children:** 0.1 children per booking is seen with a maximum of 10 children in some bookings.

**previous_cancellations:** Customers have canceled about 0.09 times on average. This might be a good indicator of understanding of customer future beahviour. As the number of previous cancellation might be correalted with high future cancellation.

**days_in_waiting_list:** The average time customer spends time on the waiting list is approximately 2.32 days. This is also one of the indicators that will help us understand customer cancellations. Customer may be more likely to cancel if they wait more.

**adr:** Average Daily Rate is about 101.83 approximately. ADR is the measure of the average paid for rooms sold in a given time period. This might suggest that customers are more likely to cancel if ADR is higher than regular price. It is also important to note that minimum value ADR of -6.38  indicates potential errors about the data

**Note:** adults variable has a minimum value of 0, which is impossible in real life as no one can book but adults. This is another concern about the data

## Explanation of Catogorical variables 

In [None]:
hotel_data.describe(include='object').T

In [None]:
hotel_data['hotel'].value_counts().plot.pie(autopct='%1.1f%%', figsize = (15,8));



**hotel:**: It seems there are two hotels. One being "City Hotel" which has more bookings, occurring 79,330 times out of 119,390 entries

**canceled**: Out of 119,390 entries, there are 75,166 non canceled bookings

**repeated_guest**: Majority of the customers are not repeated guests




In [None]:
hotel_data["customer_type"].value_counts()

There are four types of customers. The most frequent customer type is "Transient" followed by "Transient-Party"

In [None]:
hotel_data["deposit_type"].value_counts()

## Missing value handling

In [None]:
missing_values_percentage = (hotel_data.isnull().sum() / len(hotel_data)) * 100
missing_values_percentage[missing_values_percentage > 0]

Since only children varibale has some missings, it is better to replace these missings with 0. 

In [None]:
# Impute missing values in 'children' column with 0
hotel_data['children'] = hotel_data['children'].fillna(0)

In [None]:
missing_values_percentage = (hotel_data.isnull().sum() / len(hotel_data)) * 100
missing_values_percentage[missing_values_percentage > 0]

The missing values are handed as you can see, there is no variable that has missings

## Handling problematic variables

**Problematic variable include adults being 0 which is impossible, adr that is negative, babies more than 5 and children more than 8**

In [None]:
hotel_data[hotel_data["adults"] == 0].count().unique()

In [None]:
hotel_data[hotel_data["adr"] < 0].count().unique()

In [None]:
hotel_data[hotel_data["babies"] >= 5].count().unique()

In [None]:
hotel_data[hotel_data["children"] > 8].count().unique()

**Given that these are small proportion of the dataset, it seems reasonable to get rid of these rows.**

In [None]:

hotel_data = hotel_data[hotel_data["adr"] > 0]
hotel_data = hotel_data[hotel_data["adults"] != 0]
hotel_data = hotel_data[hotel_data["babies"] < 5]
hotel_data = hotel_data[hotel_data["children"] < 8]
hotel_data

In [None]:
df = hotel_data.copy()

## Encoding Categorical Variables

Encoding means the process of converting categorical or textual data into numerical format, so that it can be used as input for algorithms to process. Otherwise computer wouldn't understand it. In this dataset, there are four categorical variables we are interested in; "hotel", "market_segment", "deposit_type" and "customer_type"

In [None]:
one_hot_cols = ['hotel','market_segment','deposit_type', 'customer_type']
hotel_data = pd.get_dummies(hotel_data, columns=one_hot_cols, drop_first=True)
hotel_data

## Checking Data Imbalance

In [None]:
hotel_data["canceled"].value_counts()


There are 117,179 entries. 73,251 of them are no canceled which is 63% and 43,928 of them are canceled which is 37%. It seems there is issue with the imbalance of the data but not a big deal, so no need to apply any balancing techniques to the target variable

## Visualisation for exploratory data analysis

In [None]:
hotel_data.head()

In [None]:
plt.figure(figsize= (14,8))
sns.barplot(x = 'hotel', y = 'is_canceled',  data = df);
plt.tight_layout()

Bookings made in city hotel seem to be more canceled than bookings made in resort hotel. One of the reason that there is more cancellation in city hotels might be that the cancellation fee can be lower than resort hotels fee

In [None]:
sns.catplot(data=hotel_data, kind='box',x='lead_time', y='canceled', aspect=2);

Customers who cancels their bookings have more lead time. So this suggests that booking way too before the reservation date brings more uncertainity as its hard to predict what could possibly happen in the future. Because of this unpredicatable nature, people are more likely to cancel their booking

In [None]:
plt.figure(figsize= (14,8))
sns.barplot(x = 'canceled', y = 'babies', data = df);
plt.tight_layout()

It seems customers with more babies are less likely to cancel. It is a interesting fact as one might assume more babies mean more likely to cancel

In [None]:
plt.figure(figsize= (14,8))
sns.barplot(x = 'canceled', y = 'lead_time', hue='customer_type', data = df);
plt.tight_layout()

This also shows that customers who have more lead time are more likely to cancel. Espically with ones with contract type and transient-party type

In [None]:
features=['lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies',
'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', 'days_in_waiting_list',
'required_car_parking_spaces', 'total_of_special_requests', 'adr']
# use group by to get the means of these for low and high income
df.groupby('canceled')[features].mean().round(2)

This table shows us very good relationship between canceled bookings and other variables.

1. It seems that customer are more likely to cancel when their waiting time increases
2. Customers with more total special requests are less likely to cancel
3. Customers with more previous cancelations are more likely to cancel

<a id='model'></a>
# Building a model

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# Evaluation
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

**Some variables are not that important and don't have any direct impact on whether booking will be canceled or not, so it's best to remove them before putting the data into the model**

In [None]:
hotel_data.drop(['meal', 'distribution_channel','reserved_room_type','assigned_room_type','canceled','repeated_guest'], axis = 1, inplace = True) 

In [None]:
# Define the features (X) and the output labels (y)
X = hotel_data.drop('is_canceled', axis=1)
y = hotel_data['is_canceled'] 

Before letting the model work on the data, we split the data into two parts; one is for training the data which is 80% of the data and one is for testing the data to see how well it predicts which is 20% of the data. This way we know the model is tested on a dataset which model hasn't seen before.

In [None]:
# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

In [None]:
X_train.head()

In [None]:
y_train.head()

## DECISION TREE

**There are two models we used on this data, first one being the decision tree and second one being xgboost classifier**

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_train , y_train)
dt_ypred = dt.predict(X_test)
dt_ypredTrain = dt.predict(X_train)
dt_testAccuracy = accuracy_score(y_test, dt_ypred)
dt_trainAccuracy = accuracy_score(y_train, dt_ypredTrain)
print("Accuracy on testing data:" ,dt_testAccuracy,"\nAccuracy on training data: ",dt_trainAccuracy)

The accuracy of the decision tree model on testing data is around 82% which is at a good standard. 

The accuracy of the decision tree model on trainin data is around 98%. This is a problem as this means model memorised the dataset rather than actually understanding the relationship between the variables. This problem is known as overfitting

Even though we have a overfitting problem with this model, it is not that bad as it's accuracy on testing data is quite good

In [None]:
dt_f1score = f1_score(y_test, dt_ypred)

In [None]:
from yellowbrick.classifier import ClassificationReport
visualizer = ClassificationReport(dt, support=True)
visualizer.fit(X_train, y_train)       
visualizer.score(X_test, y_test)       
visualizer.show();

Here we have all of the results from the model. 

Recall is the proportion of actual positives was identified correctly and precision is the proportion of positive identifications was actually correct

So 76% of the acutal canceled bookings were identified as canceled by our model. 

And 76% of the total bookings that are predicted as canceled were actually canceled.

The f1 value which is the harmonic mean of the precision and recall of a classification model, the F1 score integrates precision and recall into a single metric to gain a better understanding of model performance. In this model again 76% accurate on predicting bookings as canceled and 85% accurate on predicting bookings that were not canceled

In [None]:
from yellowbrick.classifier import ConfusionMatrix
cm = ConfusionMatrix(dt)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)
cm.show();

In [None]:
from yellowbrick.classifier import ClassPredictionError
visualizer = ClassPredictionError(dt)
visualizer.fit(X_train, y_train) 
visualizer.score(X_test, y_test) 
visualizer.show();

## XGBoost Classifier

This is the second model

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train , y_train)
xgb_ypred = xgb.predict(X_test)
xgb_ypredTrain = xgb.predict(X_train)
xgb_testAccuracy = accuracy_score(y_test, xgb_ypred)
xgb_trainAccuracy = accuracy_score(y_train, xgb_ypredTrain)
print("Accuracy on testing data:" ,xgb_testAccuracy,"\nAccuracy on training data: ",xgb_trainAccuracy)

This model was better than the decision tree model. As its accuracy on testing data is 84% which is 2% better than the decision tree model. The accuracy on training data is 85% which is way better than decision tree model as it doesn't have the overfitting problem

Overall this model is better than the decison tree model that's why this model should be used to predict future bookings

In [None]:
visualizer = ClassificationReport(xgb, support=True)
visualizer.fit(X_train, y_train)       
visualizer.score(X_test, y_test)        
visualizer.show();

Here we have all of the results from the xgboost classifier model. 

85% of the acutal canceled bookings were identified as canceled by our model. 

And 70% of the total bookings that are predicted as canceled were actually canceled.

In this model F1 values show that 76% accurate on predicting bookings as canceled and 88% accurate on predicting bookings that were not canceled

**This resuts also indicate that this model is better model overall**

In [None]:
from yellowbrick.classifier import ConfusionMatrix
cm = ConfusionMatrix(xgb)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)
cm.show();

## FEATURE IMPORTANCE OF XGBOOST MODEL

In [None]:
feature_important = xgb.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.nlargest(40, columns="score").plot(kind='barh', figsize = (20,10)) 

This shows which variable has the greater impact on hotel booking cancelations. In this case adr and lead_time variable have huge impact on whether the booking will be canceled or not

## MODEL SAVE

In [None]:
pickle.dump(xgb, open('xgb.pkl', 'wb'))

This code saves the model so it can be shared and used by other data scientists

<a id='conclusion'></a>
# Conclusion

In this project, our aim was to create a model that would predict hotel cancelations and which variables had the most impact on this.

First, we explored the data and analysed it. We tried to find relatinships between variables using visulasiations and tables.

Then, we cleared the data by handling missing values, problematic variables and removing unnessary variables that have no impact on whether booking will be canceled or not. This process was needed to prepeare the data to be used with a model

After that, we used two models: decision tree model and xgboost classifier model. Looking at both of the models and their overall accuracy, precision, recall and f1 scores, we concluded that xgboost model was the better model to be used.

Finally we looked at feature importance where we found adr - average daily rate to be the variable that has the most impact on booking cancellation followed by lead_time.

Obvisouly, this model can be improved even further. For example using hyperparameter tuning techniques can develop this model further