In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

 **“Predicting Hotel Booking Cancellations with Python”**
1. Introduction
        By now, all of us are aware that the Covid-19 pandemic has sent shock waves of disruptions to worldwide travel plans as travel restrictions have been imposed and flights have beencancelled. This has contributed to visitors scrambling to cancel their bookings for hotels and tours. In fact,the global travel industry has been overwhelmed by the large number of corona-virus induced cancellations. But hotel cancellations are nothing new.

2. Objectives of Study
        •To evaluate feature importance i.e. which features are most important to predict hotel booking cancellations.
        •To predict the guests who are most likely to cancel their reservation and this will help to generate better forecasts and reduce business decision uncertainty.
        •Build a model that could predict bookings with a high cancellation probability.



In [None]:
#Import Libraries

import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, confusion_matrix, auc
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler 
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [None]:
# load data:
hotel = pd.read_csv("/kaggle/input/hotel-booking-demand/hotel_bookings.csv")

In [None]:
# Quick look or Sample Data 
hotel.head()

1. **Data Cleaning**

In [None]:
# checking for missing values
hotel.isnull().sum()

In [None]:
# handling Missing values
hotel['children'].fillna(0,inplace=True)
hotel['country'].fillna('PRT',inplace=True)
hotel.drop(['agent','company'],axis=1,inplace=True)

**Data Visualization**

In [None]:
# How many Bookings were Cancelled at the Hotel?
sns.set(style = "darkgrid")
plt.title("Is Booking Canceled or not", fontdict = {'fontsize': 20})
ax = sns.countplot(x = "is_canceled", data = hotel)

According to the this graph, 63% of bookings were not canceled and 37% of the bookings were canceled at the Hotel.

In [None]:
#How many Bookings were Cancelled by Hotel Type?
sns.set(style = "darkgrid")
plt.title("Is Canceled or not by Hotel Type", fontdict = {'fontsize': 20})
ax = sns.countplot(x = "hotel", hue = 'is_canceled', data = hotel)

About 27% of resort hotel bookings have been cancelled and about 40% of city hotel bookings have been cancelled. These figures are high and have possible effects for hotels in terms of sales and revenue.

In [None]:
# Which Month is the Most Occupied with Bookings at the Hotel?
month_sorted = ['January','February','March','April','May','June','July','August','September','October','November','December']
plt.figure(figsize=(14,6))
plt.title("what times of the year do we have the highest bookings", fontdict = {'fontsize': 20})
sns.countplot(hotel['arrival_date_month'], palette='tab10', order = month_sorted)
plt.xticks(rotation = 90)
plt.show()

According to the graph,  August is the most occupied(busiest) month with  11.66% bookings and January is the most unoccupied month with 5% bookings.

In [None]:
#Which Month Has Highest Number of Cancellations?
month_sorted = ['January','February','March','April','May','June','July','August','September','October','November','December']
plt.figure(figsize = (13,10))
sns.set(style="dark")
plt.title("No. of Cancellation -- Monthly ", fontdict={'fontsize': 20})
sns.barplot(x = 'arrival_date_month', y = 'is_canceled', data = hotel, order = month_sorted);

In the booking  cancellations between months, there is no major difference, but the lowest demand  months have the lowest percent cancellations and the highest demand months have the highest percent cancellations. The cancellations are highest in June, July & August and lowest during November, December & January.
This pattern would be recognized by hotels as it becomes easier to fill cancelled rooms during the peak season.



In [None]:
# Total Number of Bookings by Market Segment
plt.figure(figsize = (13,10))
sns.set(style = "darkgrid")
plt.title("Segments wise booking", fontdict = {'fontsize':20})
ax = sns.countplot(x = "market_segment", data = hotel)

Around 47% of bookings are made via Online Travel Agents , almost 20% of bookings are made via Offline Travel Agents and less than 20% are Direct bookings are made without any other agents.

In [None]:
# Total Number of Bookings cancellation by Market Segment
plt.figure(figsize = (13,10))
sns.set(style = "darkgrid")
plt.title("Booking Cancellation by Segments", fontdict = {'fontsize':20})
ax = sns.countplot(x = "market_segment", hue = 'is_canceled', data = hotel)

Groups segment has cancellation rate more than 50%. Offline TA/TO (Travel Agents/Tour Operators) and Online TA has cancellation rate more than 33%. Direct segment has cancellation rate less than 20%.
It is surprising that the cancellation rate in these segments is high despite the application of a deposit. The fact that cancellations are made collectively like group reservations has high cancellation rate.
    Cancellation rates for online reservations are high as expected in a dynamic environment where the circulation is high. Another situation that took my attention is that the cancellation rate in the direct segment is so low.  At this point, I believe a relationship of mutual trust has been established in the event that  individuals communicate one by one. I'm not going to focus on that much, but there is a  psychological factor here, I think.


In [None]:
# Arrival Date Year vs. Lead Time By Booking Cancellation Status
sns.set(style = "darkgrid")
plt.title("Arrival Date Year vs Lead Time By Booking Cancellation Status", fontdict = {'fontsize': 20})
ax = sns.barplot(x = "arrival_date_year", y = "lead_time" ,hue = 'is_canceled', data = hotel)


For all the 3 years, bookings with a lead time less than 100 days have fewer chances of getting canceled, and lead time more than 100 days have more chances of getting canceled.

In [None]:
# deposit type vs cancellation status
plt.figure(figsize=(14,6))
plt.title("Booking Canceled or not by Deposite type", fontdict = {'fontsize': 20})
sns.countplot(x='deposit_type',data=hotel,hue='is_canceled',palette='hls')
plt.show()

Around 28% of bookings were canceled by guests with no deposit, followed by 22% bookings were canceled with Refundable. These numbers are huge if the hotels were not able to replace the cancelled bookings in time. So it's obvious that guests who do not pay any deposit while booking are likely to cancel more reservations.
  Also it is interesting to note that non-refundable deposits had more cancellation than refundable deposits. Logically one would have assumed that refundable deposits have more cancellation as hotel rates are usually higher for refundable deposit type rooms and customers pay more in anticipation of cancellation.


In [None]:
#Some subplot of remaining attributes:
plt.figure(figsize=(16,12))
sns.set(palette = "tab10")
plt.subplot(221)
sns.countplot(hotel['meal'], hue=hotel['is_canceled'])
plt.xlabel('Meal Type')
plt.subplot(222)
sns.countplot(hotel['customer_type'], hue=hotel['is_canceled'])
plt.xlabel('customer_type Type')
plt.subplot(223)
sns.countplot(hotel['reserved_room_type'], hue=hotel['is_canceled'])
plt.xlabel('Reserved Room Type')
plt.subplot(224)
sns.countplot(hotel['reservation_status'], hue=hotel['is_canceled'])
plt.xlabel('Reservation Status')
plt.show()

It's clear that meal type and reserved room type don't have bookings evenly distributed. In these features, bookings heavily favor one category and hence we will drop these columns. We will drop deposit type (visualized previously) for the same reasons.

**Model building******



**1. Data Cleaning**
“.dropna” function is used to omit the null values .
Converting of categorical variables to dummy variables by using the following code:


In [None]:
hotel = hotel.drop(['meal','country','reserved_room_type','assigned_room_type','deposit_type','reservation_status','reservation_status_date'], axis=1)
hotel = pd.concat([hotel, 
                 pd.get_dummies(hotel['hotel'], drop_first=True), 
                 pd.get_dummies(hotel['arrival_date_month'], drop_first=True), 
                 pd.get_dummies(hotel['market_segment'], drop_first=True),
                 pd.get_dummies(hotel['distribution_channel'], drop_first=True),
                 pd.get_dummies(hotel['customer_type'], drop_first=True)
                 ], axis=1)
hotel = hotel.drop(['hotel','arrival_date_month','market_segment','distribution_channel','customer_type'], axis=1)

**Split data into training and test data.**
The data is further divided into training and test data sets by using ‘train_test_split’ function. 

In [None]:
X = hotel.iloc[:, 1:].values
y = hotel.iloc[:, 0].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
print(X)
print(y)

**Train or fit the data and apply K- Nearest Neighbors model:**

In [None]:
# Empty dictionary of model accuracy results
model_accuracy_results = {}

# Function for calculating accuracy from confusion matrix
from sklearn.metrics import confusion_matrix
def model_accuracy(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    accuracy = ((cm[0,0] + cm [1,1]) * 100 / len(y_test)).round(2)
    return accuracy

In [None]:
# Fit and train
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 10)
classifier.fit(X_train,y_train)

# Predict
y_pred = classifier.predict(X_test)

# Computing accuracy
model_accuracy_results['KNearestNeighbors'] = model_accuracy(y_test, y_pred)

**Train or fit the data and apply random forest classifier model:**

In [None]:
# Fit and train
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
classifier.fit(X_train,y_train)

# Predict
y_pred = classifier.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
# Computing accuracy
model_accuracy_results['RandomForest'] = model_accuracy(y_test, y_pred)

****Compairing model accuracy of K-NN model and Random Forest Classifier model:****

In [None]:
df_model_accuracies = pd.DataFrame(list(model_accuracy_results.values()), index=model_accuracy_results.keys(), columns=['Accuracy'])
df_model_accuracies

Here we can see that after compairing accuracy of both models, Random Forest Classifier model has the higher accuracy (85.34%) than K-NN (76.44%) model.

In [None]:
Conclusion
•The cancellation and its prediction is a real problem for the tourism industry and good understanding for this problem and the features that related with will be very useful to decrease the investments' risk of this important industry.
•Features which are most important to predict hotel booking cancellations are, lead_time, deposit_type, arrival _day_date_of_month, country, arrival_date_year, adr, and market segment.
•Here Random Forest Algorithm have high accuracy (85.34)to predict hotel booking cancellation than K-nearest Neighbor(81.44)
•Also online booking websites are encouraging more and more customers to book more hotels and then decide which one they will stay, participating in the increase of the number of cancellations. But technological advancements is not the only reason hotels see more cancellations. It turns out that psychology plays a role in this as well. Consumers are always looking for ways to minimize their cost of buying something, so if they found out that they can buy the same thing at a lower price than they paid for, they would attempt to cancel and repurchase, and that’s what usually happens with hotel bookings.
