### Summary
Every year, more than 140 million bookings made on the internet and many hotel bookings made through top-visited travel websites like Booking.com, Expedia.com, Hotels.com, etc. According to Google data, hotels are booked in advance of 12 weeks.

This dataset contains 31 features about booking information such as Average Daily Rate, Arrival Time, Room Type, Special Request, etc. between 2015 and 2017 years.

In this kernel, I would like to show some booking information details with exploratory data analysis, some feature engineering, reviewing correlations between features, hyperparameter tunning and visualizing most important features and their interesting distribution properties. As a result of all these analyses, I aim to find best model to predict hotel booking cancellations with tree-based algorithms based on rest of the features found in the dataset. The goal of predictive analysis is to avoid overfitting and find the model that has the highest accuracy.

### Boosting
Boosting refers to a family of algorithms that are able to convert weak learners to strong learners. The main principle of boosting is to fit a sequence of weak learners (models that are only slightly better than random guessing, such as small decision trees) to weighted versions of the data, where more weight is given to examples that were mis-classified by earlier rounds. The predictions are then combined through a weighted majority vote (classification) or a weighted sum (regression) to produce the final prediction. The principal difference between boosting and the committee methods such as bagging is that base learners are trained in sequence on a weighted version of the data.

### Type of Boosting Algorithm
1. AdaBoost (Adaptive Boosting)
1. Gradient Tree Boosting
1. XGBoost
1. LightGBM
1. CatBoost

## Load Library

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import pyplot
import seaborn as sns


from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score , classification_report, confusion_matrix, auc, roc_curve, precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

from xgboost import XGBClassifier

## Load Dataset

In [None]:
# Import Data

hotel_df = pd.read_csv('../input/hotel-booking-demand/hotel_bookings.csv')

In [None]:
# Show first 5 rows

hotel_df.head(5)

In [None]:
# print some information about data

hotel_df.info()

In [None]:
# print the size of the data
hotel_df.shape

## Exploratory Data Analysis

What type of hotels ared booked most of the time

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x='hotel', data = hotel_df, palette='gist_earth')
plt.title('Hotel Types', weight='bold')
plt.xlabel('Hotel', fontsize=12)
plt.ylabel('Count', fontsize=12)

Checked the number of booking canceled (Target variable)

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x='is_canceled', data= hotel_df, palette='gist_stern')
plt.title('Canceled Situation', weight='bold')
plt.xlabel('Count', fontsize=12)
plt.ylabel('Canceled or Not Canceled', fontsize=12)

Shows the relationship of arrival_date_year to lead_time with booking cancellation status.

In [None]:
plt.figure(figsize=(8,6))
sns.violinplot(x='arrival_date_year', y ='lead_time', hue="is_canceled", data=hotel_df, palette="Set2", bw=.2,
               cut=2, linewidth=2, iner= 'box', split = True)
sns.despine(left=True)
plt.title('Arrival Year vs Lead Time vs Canceled Situation', weight='bold')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Lead Time', fontsize=12)

For canceled booking, means and interquartile ranges are similar in all years. But the shapes of the distributions are quite different from each other. On the other hand distribution of not-canceled booking are almost the same. For all years and every booking situation, the small number of large lead time values are pulling the mean up. It shows that the mean is higher than the median.

Another exploration is made for the arrival_date_month feature. First month names converted to the numbers. It will help easier analysis.

In [None]:
#`arrival_date_month` names converted to the numbers

hotel_df['arrival_date_month'].replace({'January' : '1',
        'February' : '2',
        'March' : '3',
        'April' : '4',
        'May' : '5',
        'June' : '6',
        'July' : '7',
        'August' : '8',
        'September' : '9', 
        'October' : '10',
        'November' : '11',
        'December' : '12'}, inplace=True)

In [None]:
#`arrival_date_month` exploration 

plt.figure(figsize=(10,10))
sns.countplot(x='arrival_date_month', data = hotel_df,
              order=pd.value_counts(hotel_df['arrival_date_month']).index, palette='YlOrBr_r')
plt.title('Arrival Month', weight='bold')
plt.xlabel('Month', fontsize=12)
plt.ylabel('Count', fontsize=12)

The above graph is showing the number of bookings for each month. According to that, August is the busiest month and January is the most unoccupied month. It is half as busy as August.

Another important features which are related to time are stays_in_week_nights and stays_in_weekend_night features. 

In [None]:
# Table of `stay_in_weekend` and `stay_in_week_nights` features

pd.crosstab(index = hotel_df['stays_in_week_nights'],columns=hotel_df['stays_in_weekend_nights'], margins=True, margins_name = 'Total').iloc[:10]

The above table brings an idea about creating a new feature. Which is indicated just_stay_weekend, just_stay_weekday and stay_both_weekday_and_weekday. These 715 values which are not assigned any feature, indicated as undefined_data.

In [None]:
## Creating new feature: `Weekday vs Weekend` 

pd.options.mode.chained_assignment = None
def week_function(feature1, feature2, data_source):
    data_source['weekend_or_weekday'] = 0
    for i in range(0, len(data_source)):
        if feature2.iloc[i] == 0 and feature1.iloc[i] > 0:
            hotel_df['weekend_or_weekday'].iloc[i] = 'stay_just_weekend'
        if feature2.iloc[i] > 0 and feature1.iloc[i] == 0:
            hotel_df['weekend_or_weekday'].iloc[i] = 'stay_just_weekday'
        if feature2.iloc[i] > 0 and feature1.iloc[i] > 0:
            hotel_df['weekend_or_weekday'].iloc[i] = 'stay_both_weekday_and_weekend'
        if feature2.iloc[i] == 0 and feature1.iloc[i] == 0:
            hotel_df['weekend_or_weekday'].iloc[i] = 'undefined_data'

            
week_function(hotel_df['stays_in_weekend_nights'],hotel_df['stays_in_week_nights'], hotel_df)

In [None]:
#`arrival_date_month` vs `weekend_or_weekday` graph 

hotel_df['arrival_date_month']= hotel_df['arrival_date_month'].astype('int64')
group_data = hotel_df.groupby([ 'arrival_date_month','weekend_or_weekday']).size().unstack(fill_value=0)

group_data.sort_values('arrival_date_month', ascending = True).plot(kind='bar',stacked=True, cmap='Set3',figsize=(12,8))
plt.title('Arrival Month vs Staying Weekend or Weekday', weight='bold')
plt.xlabel('Arrival Month', fontsize=12)
plt.xticks(rotation=360)
plt.ylabel('Count', fontsize=12)

Another feature engineering is made for children and babies features. Since, there is no obvious difference, these features gathered under the one feature which name is all_children.


In [None]:
# Create new feature:`all_children` with merge children and baby features

hotel_df['all_children'] = hotel_df['children'] + hotel_df['babies']
pd.crosstab(hotel_df['adults'], hotel_df['all_children'], margins=True, margins_name = 'Total')

The below table shows frequency details about meal types according to the hotel types. Following the results, 67% of Bed&Breakfast booking made for City Hotel and almost every Full Board bookings made in the Resort Hotel.

In [None]:
# Groupby `Meal` and `Hotel` features

group_meal_data = hotel_df.groupby(['hotel','meal']).size().unstack(fill_value=0).transform(lambda x: x/x.sum())
group_meal_data.applymap('{:.2f}'.format)

The below graph gives information about the location which bookings made in.

In [None]:
# Create Top 10 Country of Origin graph

plt.figure(figsize=(10,10))
sns.countplot(x='country', data=hotel_df, 
              order=pd.value_counts(hotel_df['country']).iloc[:10].index, palette="brg")
plt.title('Top 10 Country of Origin', weight='bold')
plt.xlabel('Country', fontsize=12)
plt.ylabel('Count', fontsize=12)

In [None]:
# `Arrival Month` vs `ADR` vs `Booking Cancellation Status`

hotel_df['adr'] = hotel_df['adr'].astype(float)
plt.figure(figsize=(15,10))
sns.barplot(x='arrival_date_month', y='adr', hue='is_canceled', dodge=True, palette= 'PuBu_r', data=hotel_df)
plt.title('Arrival Month vs ADR vs Booking Cancellation Status', weight='bold')
plt.xlabel('Arrival Month', fontsize=12)
plt.ylabel('ADR', fontsize=12)

## Dealing with Missing Data and Correlation Matrix

In [None]:
## Display sum of null data

hotel_df.isnull().sum()

- company feature's 94% is missing so we will remove it
- children and all_children features have only 4 missing data so we will replace with 0
- country feature is missing less than 1%, these data will replace with most frequent value. 

In [None]:
# Fill missing data

hotel_df['children'] =  hotel_df['children'].fillna(0)
hotel_df['all_children'] =  hotel_df['all_children'].fillna(0)
hotel_df['country'] = hotel_df['country'].fillna(hotel_df['country'].mode().index[0])
hotel_df['agent']= hotel_df['agent'].fillna('0')
hotel_df=hotel_df.drop(['company'], axis =1)

In [None]:
# Change data type

hotel_df['agent']= hotel_df['agent'].astype(int)
#hotel_df['country']= hotel_df['country'].astype(O)

In [None]:
#Using Label Encoder method for categorical features

cols =  [cols for cols in hotel_df.columns if hotel_df[cols].dtype == 'O']

hotel_df.loc[:, cols] = hotel_df.loc[:, cols].astype(str).apply(LabelEncoder().fit_transform)

In [None]:
hotel_df.head()

In [None]:
#Create new dataframe for categorical data

hotel_data_categorical = hotel_df[['hotel','is_canceled','arrival_date_month','meal',
                                     'country','market_segment','distribution_channel', 
                                     'is_repeated_guest', 'reserved_room_type',
                                     'assigned_room_type','deposit_type','agent',
                                     'customer_type','reservation_status', 
                                     'weekend_or_weekday']]
hotel_data_categorical.info()

In [None]:
#Create new dataframe for numerical data

hotel_data_numerical= hotel_df.drop(['hotel','is_canceled', 'arrival_date_month','meal',
                                       'country','market_segment','distribution_channel', 
                                       'is_repeated_guest', 'reserved_room_type', 
                                       'assigned_room_type','deposit_type','agent', 
                                       'customer_type','reservation_status',
                                       'weekend_or_weekday'], axis = 1)
hotel_data_numerical.info()

In [None]:
# Correlation Matrix with Spearman method

plt.figure(figsize=(15,15))
corr_categorical=hotel_data_categorical.corr(method='spearman')
mask_categorical = np.triu(np.ones_like(corr_categorical, dtype=np.bool))

sns.heatmap(corr_categorical, annot=True, fmt=".2f", cmap='BrBG', vmin=-1, vmax=1, center= 0,
            square=True, linewidths=2, cbar_kws={"shrink": .5}).set(ylim=(15, 0))
plt.title("Correlation Matrix Spearman Method- Categorical Data ",size=15, weight='bold')

In [None]:
# Correlation Matrix with pearson method

plt.figure(figsize=(15,15))
corr_numerical=hotel_data_numerical.corr(method='pearson')
mask_numerical = np.triu(np.ones_like(corr_numerical, dtype=np.bool))
sns.heatmap(corr_numerical, annot=True, fmt=".2f", cmap='RdBu', mask= mask_numerical, vmin=-1, vmax=1, center= 0,
            square=True, linewidths=2, cbar_kws={"shrink": .5}).set(ylim=(17, 0))
plt.title("Correlation Matrix Pearson Method- Numerical Data ",size=15, weight='bold')

In [None]:
# Finding high correlated features

corr_mask_categorical = corr_categorical.mask(mask_categorical)
corr_values_categorical = [c for c in corr_mask_categorical.columns if any (corr_mask_categorical[c] > 0.90)]
corr_mask_numerical = corr_numerical.mask(mask_numerical)
corr_values_numerical = [c for c in corr_mask_numerical.columns if any (corr_mask_numerical[c] > 0.90)]
print(corr_values_categorical, corr_values_numerical)

In [None]:
# drop the highly correlated features

hotel_df = hotel_df.drop(['reservation_status', 'children', 'reservation_status_date'], axis=1)

## Seperate target and predictor varaiable

In [None]:
# Seperate target variable

hotel_data_tunning = hotel_df
y = hotel_data_tunning.iloc[:,1]
x = pd.concat([hotel_data_tunning.iloc[:,0],hotel_data_tunning.iloc[:,2:30]], axis=1)

## Train test split

In [None]:
# train and test split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=42)

In [None]:
print('X train size: ', x_train.shape)
print('y train size: ', y_train.shape)
print('X test size: ', x_test.shape)
print('y test size: ', y_test.shape)

## Train Model

## Logistic Regression

In [None]:
# Create adaboost classifer object
lr = LogisticRegression()

# Train Adaboost Classifer
lr.fit(x_train, y_train)

#Predict the response for test dataset
y_pred = lr.predict(x_test)

In [None]:
precision_score_lr =  precision_score(y_test, y_pred)
accuracy_score_lr = accuracy_score(y_test, y_pred)
print('The precision score is : ',round(precision_score_lr * 100,2), '%')
print('The accuracy score is : ',round(accuracy_score_lr * 100,2), '%')
print ('\nClassification Report TEST:\n', classification_report(y_test,y_pred))

## AdaBost Classifier

In [None]:
# base estimator (optional)
dt = DecisionTreeClassifier() 

# Create adaboost classifer object
abc = AdaBoostClassifier(n_estimators=250, base_estimator=dt,learning_rate=1.0, random_state=0)

# Train Adaboost Classifer
abc.fit(x_train, y_train)

#Predict the response for test dataset
y_pred_lg = abc.predict(x_test)

The most important parameters are base_estimator, n_estimators and learning_rate.

In [None]:
precision_score_ab =  precision_score(y_test, y_pred_lg)
accuracy_score_ab = accuracy_score(y_test, y_pred_lg)
print('The precision score is : ',round(precision_score_ab * 100,2), '%')
print('The accuracy score is : ',round(accuracy_score_ab * 100,2), '%')
print ('\nClassification Report TEST:\n', classification_report(y_test,y_pred_lg))

## Gradient Boosting classifier

In [None]:
# create object
gbc= GradientBoostingClassifier(learning_rate=0.1,min_samples_leaf=10, min_samples_split=200, max_features='sqrt',random_state=10)

# Train Adaboost Classifer
gbc.fit(x_train, y_train)

#Predict the response for test dataset
y_pred_gbc = gbc.predict(x_test)

In [None]:
precision_score_gbc =  precision_score(y_test, y_pred_gbc)
accuracy_score_gbc = accuracy_score(y_test, y_pred_gbc)
print('The precision score  is : ',round(precision_score_gbc * 100,2), '%')
print('The accuracy score  is : ',round(accuracy_score_gbc * 100,2), '%')
print ('\nClassification Report TEST:\n', classification_report(y_test,y_pred_gbc))

## XGBoost

In [None]:
xgbc = XGBClassifier(max_depth=13,n_estimators=300,learning_rate=0.5)
    
# Train Adaboost Classifer
xgbc.fit(x_train, y_train)

#Predict the response for test dataset
y_pred_xgbc = xgbc.predict(x_test)

In [None]:
precision_score_xgbc =  precision_score(y_test, y_pred_xgbc)
accuracy_score_xgbc = accuracy_score(y_test, y_pred_xgbc)
print('The precision score  is : ',round(precision_score_xgbc * 100,2), '%')
print('The accuracy score is : ',round(accuracy_score_xgbc * 100,2), '%')
print ('\nClassification Report TEST:\n', classification_report(y_test,y_pred_xgbc))

## Cumulate accuracy score of all the models

In [None]:
print('Logistic Regression accuracy score is : ',round(accuracy_score_lr * 100,2), '%')
print('AdaBoost accuracy score is : ',round(accuracy_score_ab * 100,2), '%')
print('Gradient boosting  accuracy score  is : ',round(accuracy_score_gbc * 100,2), '%')
print('XGBoost accuracy score is : ',round(accuracy_score_xgbc * 100,2), '%')