__Hotel Booking__ - Notebook to perform EDA on hotel booking data and build a model to predict cancellation of bookings



![](https://media-cdn.tripadvisor.com/media/photo-s/16/1a/ea/54/hotel-presidente-4s.jpg)

## 1. Loading libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

## 2. Loading dataset

In [None]:
df = pd.read_csv("../input/hotel-booking-demand/hotel_bookings.csv")

## 3. Data overview

In [None]:
df.head()

36 Columns so we must check columns first.

In [None]:
df.columns

Our objective is to predict cancellation so many factors should play a role.

In [None]:
df.describe()

In [None]:
df.info()

Let's first split our categorical and continous data into two different lists.

## 4. Data Cleaning

In [None]:
def missing_percent(df):
    nan_percent= 100*(df.isnull().sum()/len(df))
    nan_percent= nan_percent[nan_percent>0].sort_values()
    return nan_percent

In [None]:
missing_percent(df)

Too many missing values for `company` lets just drop it.

In [None]:
df = df.drop("company",axis = 1)

Now let's deal with the `agent` column.

In [None]:
df["agent"].value_counts()

So let's just replace the `NaN` with the last value of that column. and then convert it to `str` type.

In [None]:
df["agent"] = df["agent"].fillna(method='ffill')

Now just drop every other missing row.

In [None]:
df = df.dropna()

Now we're good to go. Let's just drop unrelated columns.

## 5. Dropping columns

In [None]:
# df[]

For sure `credit_card` doesn't have any valuable information, dropped!

Same agrument goes for `email` and `name`. But what about `phone-number`? Could we extract something meaningful out of it? 

In [None]:
# phone_code = df["phone-number"].apply(lambda p : p.split("-")[0])
# phone_code.value_counts()

As we can see `phone_code` is something that might be useful in future. so we'll make a new columns and then drop the `phone-number`

In [None]:
# df["phone_code"] = phone_code.astype("str")
df["agent"] = df["agent"].astype("str")

In [None]:
# df = df.drop(["email","name"], axis = 1)

## 6. EDA

Since our main target is `is_canceled` let's jump into it. and then explore everything related.

In [None]:
sns.countplot(df["is_canceled"])

* Canceling rate is pretty much high.
* 70000> not canceled Vs. 40000> canceled.

In [None]:
sns.countplot(df["reservation_status"], hue=df["is_canceled"])

As we can see we got a pretty cheat code. `reservation_status` is `is_canceled` + some `No-show`s so if we want to predict the canceling guests, it's necessary to remove this column.

In [None]:
sns.histplot(df["lead_time"])
plt.xlim(0,500)

As we can see, in the `lead_time` we're dealing with an exponential distrubition!

Now let's compare the cancelings between two hotels.

In [None]:
sns.countplot(df["hotel"], hue=df["is_canceled"])

Greater canceling rate for `City Hotel`. this could help us with our model.

In [None]:
fig, ax =plt.subplots(2,1)
sns.barplot(x = df["arrival_date_year"], y = df["is_canceled"], ax = ax[0])
sns.barplot(x = df["arrival_date_month"], y = df["is_canceled"], ax = ax[1])
plt.xticks(rotation=90)
plt.show()

It almost shows that `arrival_date_year` and `arrival_date_month` don't play a heavy role here. So mayber we should less complexify our model and remove these columns

In [None]:
sns.barplot(x = df["customer_type"], y = df["is_canceled"])

We must definitely consider `customer_type` in our model. 

In [None]:
sns.barplot(x = df["is_canceled"], y = df["previous_cancellations"], hue = df["is_repeated_guest"])

* Customers with the history of cancellation tend to cancel more often.
* Interestingly repeated guests tend to cancel more!

In [None]:
sns.barplot(x = df["is_canceled"], y = df["days_in_waiting_list"])

More days in waiting list, more chance of cancelling

In [None]:
sns.barplot(y = df["is_canceled"], x = df["deposit_type"])

* Non Refundable payments tends to cancel more.
* No deposit and Refundable type are more or less the same.

In [None]:
month = pd.to_datetime(df["reservation_status_date"]).dt.month
year = pd.to_datetime(df["reservation_status_date"]).dt.year


In [None]:
fig, ax =plt.subplots(2,1)
sns.countplot(month, ax = ax[0])
sns.countplot(year, ax = ax[1])

Extracted some useful insights from `reservation_status_date` and then created `month` and `year` lists.

With these lists, we can conclude that reservation status date plays a huge role whether a guest cancels or not.


 We can see as we get close to the end of the year, cancelations tend to decrease after a sudden rise in July. (which could be an account for the summer travels)

Likewise, cancelation varies with year.

In [None]:
a = df.corr()
plt.figure(figsize=(12,12))
k = 15
cols = a.nlargest(k, 'is_canceled')['is_canceled'].index
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, annot=True, square=True, fmt='.2f', annot_kws={'size': 12}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

In [None]:
cor = (df.corr()**2)**0.5
cor_mat = cor["is_canceled"].sort_values(ascending=True)
cor_mat*100

Finally took the abs value of the cor matrix and then sorted it.

Dropped columns with cor < 3%.

## 7. Feature engineering

In [None]:
df.drop(['days_in_waiting_list', 'arrival_date_year', 'arrival_date_year', 'assigned_room_type', 'booking_changes',
               'reservation_status', 'country', 'days_in_waiting_list'], axis = 1, inplace=True)

Now we know what parameteres we're dealing with!

So it seems like `children` , `stays_in_weekend_nights` , `stays_in_week_nights` , `arrival_date_year`, `arrival_date_week_number`, `arrival_date_day_of_month` ,  `babies` don't contribute much to our model. 

In [None]:
df.columns

## 8. Feature Selection

First we split the data into categorical and numerical.

In [None]:
data = df.columns
catg = []
for i in data:
    if df[i].dtype == 'O':
        catg.append(i)
catg = df[catg]
catg

In [None]:
num = df.drop(catg,axis = 1)
num = num.drop("is_canceled", axis = 1)
num

Normalized the numerical feautres for better performance.

In [None]:
num['lead_time'] = np.log(num['lead_time'] + 1)
num['arrival_date_week_number'] = np.log(num['arrival_date_week_number'] + 1)
num['arrival_date_day_of_month'] = np.log(num['arrival_date_day_of_month'] + 1)
num['adr'] = np.log(num['adr'] + 1)

With decision trees typically we don't use `one hot encoding` with many features, like the situation here. so instead of that we just go with label encoding.

For more information check this article: <a href="https://kiwidamien.github.io/are-you-getting-burned-by-one-hot-encoding.html"> Are You Getting Burned By One-Hot Encoding? <a/>

First of all let's extract some useful information from `reservation_status_date` and then encode the `catg`.

In [None]:
# Extracting reservation

catg["reservation_status_date"] = pd.to_datetime(catg["reservation_status_date"])

catg["year"] = catg["reservation_status_date"].dt.year
catg["month"] = catg["reservation_status_date"].dt.month
catg["day"] = catg["reservation_status_date"].dt.day

catg = catg.drop("reservation_status_date", axis = 1)

In [None]:
from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()

columns = catg.columns
for col in columns:
    catg[col] = label.fit_transform(catg[col])
catg

In [None]:
catg = catg.drop(["agent"],axis =1)

In [None]:
X = catg.join(num).drop("adr",axis = 1)
y = df["is_canceled"]

## 9. Spliting features 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=42)

## 10. Creating the model

### 1. Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree_model = DecisionTreeClassifier()
tree_model.fit(X_train, y_train)
y_pred = tree_model.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix,classification_report

print(classification_report(y_test,y_pred))

__96% Accuracy with decision tree__

Now let's check with Grid Search whether we can perform better with hypertuning or not.

In [None]:
from sklearn.model_selection import GridSearchCV

param_dist = {
    "criterion" : ["gini", "entropy"],
    "max_depth" : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,None]
}

grid = GridSearchCV(tree_model, param_grid=param_dist, cv = 10,n_jobs = -1)
grid.fit(X_train,y_train)
grid.best_estimator_

In [None]:
tree_model = DecisionTreeClassifier(criterion="entropy", max_depth=30)
tree_model.fit(X_train, y_train)
y_pred = tree_model.predict(X_test)
print(classification_report(y_test,y_pred))

__Got 97% Accuracy with `entropy` and `max_depth` of 30.__

Although we might have overfitted our model for only 1 percent.

### MinMax Scaling features for Logistic Regression

In [None]:
from sklearn.preprocessing import MinMaxScaler

sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

### 2. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegressionCV

log_model = LogisticRegressionCV(max_iter=1000)
log_model.fit(X_train,y_train)
y_pred = log_model.predict(X_test)

In [None]:
print(classification_report(y_test,y_pred))

__79% Accuracy with Logistic Regression__

### Standar Scaling features for Logistic Regression

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
y_pred = knn_model.predict(X_test)

In [None]:
print(classification_report(y_test,y_pred))

__88% Accuracy with KNN__