# Predicting Cancellation Rates

In this notebook, you will build a machine learning model to predict whether or not a customer cancelled a hotel booking. You will be introduced to the `scikit-learn` framework to do machine learning in Python. 

We will use a dataset on hotel bookings from the article ["Hotel booking demand datasets"](https://www.sciencedirect.com/science/article/pii/S2352340918315191), published in the Elsevier journal, [Data in Brief](https://www.sciencedirect.com/journal/data-in-brief). The abstract of the article states 

> This data article describes two datasets with hotel demand data. One of the hotels (H1) is a resort hotel and the other is a city hotel (H2). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. 

For convenience, the two datasets have been combined into a single csv file `data/hotel_bookings.csv`. Let us start by importing all the functions needed to import, visualize and model the data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
# Data imports
# Visualization imports
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
plt.rcParams['figure.figsize'] = [8, 4]
# ML Imports
## for data splitting and get validation
from sklearn.model_selection import KFold, cross_val_score
## for ensemble Learning Tqs.
from sklearn.ensemble import BaggingClassifier ,VotingClassifier ,GradientBoostingClassifier 
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier,StackingClassifier
##for Classification
from sklearn.neighbors import KNeighborsClassifier #simple Classifier
from sklearn.naive_bayes import GaussianNB #probability classifier 
from sklearn.tree import DecisionTreeClassifier #Graph Based Classifier
from sklearn.linear_model import LogisticRegression #for Stacking part
## for Pipelines and Feature Engineering
from sklearn.pipeline import Pipeline 
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
def running_models(models,preprocessor,nSplits,X,y):
    split = KFold(n_splits=nSplits, shuffle=True, random_state=1234)
    for name,model in models:
        steps=Pipeline(steps=[('preprocessor', preprocessor),('model', model)])
        cv_results=cross_val_score(steps,X,y,cv=split,scoring='accuracy',n_jobs=-1)
        min_score=round(np.min(cv_results),4)
        max_score=round(np.max(cv_results),4)
        mean_score=round(np.mean(cv_results),4)
        std_dev=round(np.std(cv_results),4)
        print(f"[{name}] Cross Validation Accuarcy Score: {round(mean_score*100,4)} % +/- {round(std_dev*100,4)} % (std) min: {round(min_score*100,4)} %,max:{round(max_score*100,4)} %")


## 0. Get the data

The first step in any machine learning workflow is to get the data and explore it.

In [None]:
hotel_bookings=pd.read_csv('/kaggle/input/hotel-booking/hotel_booking.csv')
hotel_bookings.head(5)


Let us look at the number of bookings by month.

In [None]:
bookings_by_month = hotel_bookings.groupby('arrival_date_month', as_index=False)[['hotel']].count().rename(columns={"hotel": "nb_bookings"})
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'] 
fig = px.bar(
    bookings_by_month, 
    x='arrival_date_month', 
    y='nb_bookings', 
    title=f'Hotel Bookings by Month', 
    category_orders={"arrival_date_month": months}
)
fig.show(config={"displayModeBar": False})

## 2. Choose a class of models, and hyperparameters.

The next step is to choose a class of models and specify hyperparameters. This is just for starters and we will see later how we can specify a range of values for hyperparameters and tune the model for optimal performance! We will pick the simple, yet very effective Decision Tree and Random Forest models.
We will use `scikit-learn` to fit the models and evaluate their performance.

In [None]:
from IPython.display import Image
Image("http://scikit-learn.org/dev/_static/ml_map.png", width=750)

In [None]:
models = [
  ("Decision Tree", DecisionTreeClassifier(random_state=1234)),
  ("Random Forest", RandomForestClassifier(random_state=1234,n_jobs=-1)),]

## 3. Preprocess the data

The next step is to set up a pipeline to preprocess the features. We will impute all missing values with a constant, and one-hot encode all categorical features.

In [None]:
# Preprocess numerical features:
features_num = [
    "lead_time", "arrival_date_week_number", "arrival_date_day_of_month", "stays_in_weekend_nights",
    "stays_in_week_nights", "adults", "children", "babies", "is_repeated_guest" ,
    "previous_cancellations", "previous_bookings_not_canceled", "agent", "company", 
    "required_car_parking_spaces", "total_of_special_requests", "adr"
]
transformer_num = SimpleImputer(strategy="constant")

# Preprocess categorical features:
features_cat = [
    "hotel", "arrival_date_month", "meal", "market_segment", "distribution_channel", 
    "reserved_room_type", "deposit_type", "customer_type"
]
transformer_cat = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
    ("onehot", OneHotEncoder(handle_unknown='ignore'))
])

# Create a preprocessing pipeline
preprocessor = ColumnTransformer(transformers=[
    ("num", transformer_num, features_num),
    ("cat", transformer_cat, features_cat)
])

## 4. Fit the models and evaluate performance

Finally, we will fit the Decision Tree and Random Forest models on the training data and use 4-fold cross-validation to evaluate their performance.

In [None]:
features = features_num + features_cat
X = hotel_bookings[features]
y = hotel_bookings["is_canceled"]

In [None]:
running_models(models,preprocessor,3,X,y)

<H1>Ensemple Learning</H1>

<H3>Preparing For Ensemple Learning </H3>

In [None]:
clf_knn= KNeighborsClassifier(n_neighbors=5,algorithm='ball_tree')
clf_dt= DecisionTreeClassifier(random_state=42)
clf_nv= GaussianNB()

pre_models=[('knn',clf_knn),('DT',clf_dt),('NV',clf_nv),]
running_models(pre_models,preprocessor,3,X,y)

<h3>Voting Classifier</h3>

In [None]:
voting_hard = VotingClassifier(estimators=pre_models,voting='hard')
voting_soft = VotingClassifier(estimators=pre_models, voting='soft')
votingModels=[('hard_voting',voting_hard),('soft_voting',voting_soft),]
running_models(votingModels,preprocessor,3,X,y)

<h3>Bagging Classifier</h3>

In [None]:
knn_bag=BaggingClassifier(clf_knn,max_samples=0.5, max_features=0.5)
dt_bag =BaggingClassifier(clf_dt,max_samples=0.5,max_features=0.5)
nv_bag =BaggingClassifier(clf_nv,max_samples=0.5,max_features=0.5)
baggingModels=[('KNN bagging',knn_bag),('DT Bagging',dt_bag),('NV Bagging',nv_bag),]
running_models(baggingModels,preprocessor,3,X,y)


<h3>Boosting</h3>

In [None]:
smallAdaBoost=AdaBoostClassifier(n_estimators=100)
bigAdaBoost=AdaBoostClassifier(n_estimators=1000)
smallGB=GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,max_depth=1, random_state=0)
bigGB=GradientBoostingClassifier(n_estimators=1000, learning_rate=0.1,max_depth=1, random_state=0)
boostingModels=[('100 estimators Adaboost',smallAdaBoost),
                ('1000 estimators Adaboost',bigAdaBoost),
                ('100 estimators XGB',smallGB),
                ('1000 estimators XGB',bigGB),]

running_models(boostingModels,preprocessor,3,X,y)


<h3>Stacking</h3>

In [None]:
final_estimator= LogisticRegression(solver='liblinear', random_state=0)
pre_model_stacking=StackingClassifier(estimators=pre_models,final_estimator=final_estimator)
bagging_model_stacking=StackingClassifier(estimators=baggingModels,final_estimator=final_estimator)
boosting_model_stacking=StackingClassifier(estimators=boostingModels,final_estimator=final_estimator)
stackingModels=[('pre model stacking',pre_model_stacking),
               ('bagging model stacking',bagging_model_stacking),
               ('boosting model stacking',boosting_model_stacking),]
running_models(stackingModels,preprocessor,3,X,y)