## Step 1: Importing required libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Step 2: Importing the hotel booking data set

In [None]:
df= pd.read_csv('../input/hotel-booking/hotel_booking.csv')

#### This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.

In [None]:
df.info()

## Step 3: Data Preparation

### Coping with missing values

In [None]:
#Calculating the percentage of missing data in each columns (feature) and then sort it
def missing_percentage(df):
    nan_percent= 100*(df.isnull().sum()/len(df))
    nan_percent= nan_percent[nan_percent>0].sort_values()
    return nan_percent
nan_percent= missing_percentage(df)
print(nan_percent)

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x=nan_percent.index, y=nan_percent, color=(0.2, 0.4, 0.6, 0.6), edgecolor='blue')
plt.xticks(rotation=90)

#### As indicated, more than 90% of data in company column are missing; so, we eliminate this feature.

In [None]:
df = df.drop(['company'],axis=1)

#### It is likely that missing values in children column should be equal to zero, hence, we fill them by 0.

In [None]:
df["children"]= df["children"].fillna(0)

In [None]:
df['agent'][df['agent'].isnull()]

#### It seems like that the missing values in agent should be equal to zero, therefore, we fill them by 0.

In [None]:
df["agent"]= df["agent"].fillna(0)

#### Here, we drop those columns that are string in nature

In [None]:
df_num= df.select_dtypes(exclude='object')

In [None]:
df_num.info()

#### As demonstrated, there are no missing data.

## Step 4: Exploratory Data Analysis

In [None]:
fig= plt.figure(figsize=(10,10), dpi=500)
plt.rcParams['font.size'] = '8'
sns.heatmap(df_num.corr(), annot=True, cmap="YlGnBu", vmin=-1, vmax=1)

#### In the above heatmap,

#### *  -1 indicates a perfectly negative linear correlation between two variables;
#### *  0 indicates no linear correlation between two variables;
#### *  1 indicates a perfectly positive linear correlation between two variables.

In [None]:
corr_matrix = df_num.corr()
print(corr_matrix["is_canceled"].sort_values(ascending=False))

#### Next, we drop useless features (with less than 0.01 correlation)

In [None]:
df_num = df_num.drop(['arrival_date_day_of_month', 'stays_in_weekend_nights', 'children', 'arrival_date_week_number'],axis=1)

In [None]:
fig= plt.figure(figsize=(6,6), dpi=300)

sns.countplot(data=df_num, x='is_canceled')

In [None]:
df_num['is_canceled'].value_counts()

#### Looks like we have imbalanced data. To put it in other words, the number of individuals that have canceled is way more than those who have not canceled.

# Step 5: Determining X (Features) and y (Target Variable)

In [None]:
y = df_num['is_canceled']
X = df_num.drop('is_canceled', axis = 1)

# Step 6: Spliting Train and Test

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=101,test_size=0.3)

# Step 7: Coping with imbalanced data applying SMOTE

#### Imbalanced data profoundly affects our results. Thus, we employ SMOTE (synthetic minority oversampling technique) in order to cope with imbalanced data. SMOTE (Synthetic Minority Over-sampling Technique) is an over-sampling approach in which the minority class is over-sampled by creating “synthetic” examples <span style="color:crimson;"> (Chawla et al., 2002).
#### The SMOTE algorithm is a popular approach for oversampling the minority class. This technique can be used to reduce the imbalance or to make the class distribution even. This can be achieved by simply duplicating examples in the minority class, but these examples do not add any new information. Instead, new examples from the minority can be synthesized using existing examples in the training dataset. These new examples will be “close” to existing examples in the feature space, but different in small but random ways <span style="color:crimson;">(Brownlee, 2021).

### References

#### * Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
#### * Brownlee, J. (2021). Imbalanced Classification With Python. https://machinelearningmastery.com/imbalanced-classification-with-python-7-day-mini-course/ (Accessed 12 August 2021)

In [None]:
# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=101, sampling_strategy='minority')
X_sm, y_sm = smote.fit_resample(X, y)
y_sm.value_counts()

In [None]:
X_train_sm, X_test_sm, y_train_sm, y_test_sm = train_test_split(X_sm, y_sm)

# Step 8: Classification Machine Learning Algorithms

## 1. Decision Tree Algorithm

#### A classification tree is used to predict a qualitative response rather than a quantitative one <span style="color:crimson;">(James et al., 2021).

### References

#### * James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An introduction to statistical learning, 2nd Edition. New York: springer.

### Regular Decision Tree Algorithm 

In [None]:
from sklearn.tree import DecisionTreeClassifier
Model_DecisionTree = DecisionTreeClassifier()
Model_DecisionTree.fit(X_train, y_train)
y_pred = Model_DecisionTree.predict(X_test)

from sklearn.metrics import confusion_matrix,classification_report
print(classification_report(y_test,y_pred))

### Decision Tree Algorithm using SMOTE

In [None]:
from sklearn.tree import DecisionTreeClassifier
Model_DecisionTree = DecisionTreeClassifier()
Model_DecisionTree.fit(X_train_sm, y_train_sm)
SMOTE_y_preds = Model_DecisionTree.predict(X_test_sm)

from sklearn.metrics import confusion_matrix,classification_report
print(classification_report(y_test_sm,SMOTE_y_preds))

In [None]:
from sklearn.tree import plot_tree
pruned_tree = DecisionTreeClassifier(max_leaf_nodes=7)
pruned_tree.fit(X_train_sm,y_train_sm)
plt.figure(figsize=(6,10),dpi=200)
plot_tree(pruned_tree,filled=True,feature_names=X.columns);

## 2. K Nearest Neighbors Algorithm

#### K nearset neighbors (KNN) assigns a label to new data according to the distance between the old data and the new data.

####  $Pr(Y=j|X=x_0) = 1/K \times \sum_{i \in N_0} I(y_i = j)$

#### *Given the fact that feature scaling is a compulsory task in the KNN algorithm, we do feature scaling before training the model.*

### Regular K Nearest Neighbors Algorithm

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
scaler.fit(X_train)
scaled_X_train= scaler.transform(X_train)
scaled_X_test= scaler.transform(X_test)

# Training the model
from sklearn.neighbors import KNeighborsClassifier
knn_model= KNeighborsClassifier(n_neighbors=1)
knn_model.fit(scaled_X_train, y_train)
y_pred= knn_model.predict(scaled_X_test)

from sklearn.metrics import confusion_matrix,classification_report
print(classification_report(y_test,y_pred))

### K Nearest Neighbors Algorithm using SMOTE

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
scaler.fit(X_train_sm)
scaled_X_train_sm= scaler.transform(X_train_sm)
scaled_X_test_sm= scaler.transform(X_test_sm)

# Training the model
from sklearn.neighbors import KNeighborsClassifier
knn_model= KNeighborsClassifier(n_neighbors=1)
knn_model.fit(scaled_X_train_sm, y_train_sm)
y_pred_sm= knn_model.predict(scaled_X_test_sm)

from sklearn.metrics import confusion_matrix,classification_report
print(classification_report(y_test_sm,y_pred_sm))

## 3. Logistic Regression Algorithm

#### Logistic Regression transforms a Linear Regression into classification model using the below equation:

#### $\sigma (x) = 1/(1 + e^{-x})$

#### Hence, the output always lays between 0 and 1.

### Regular Logistic Regression Algorithm

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
scaler.fit(X_train)
scaled_X_train= scaler.transform(X_train)
scaled_X_test= scaler.transform(X_test)

# Training the model
from sklearn.linear_model import LogisticRegression
log_model= LogisticRegression()
log_model.fit(scaled_X_train, y_train)
y_pred= log_model.predict(scaled_X_test)

from sklearn.metrics import confusion_matrix,classification_report
print(classification_report(y_test,y_pred))

# Discussion
    
#### As shown, we compared five different classification machine learning algorithms, i.e., Decision Tree, Decision Tree using SMOTE, K Nearest Neighbors, K Nearest Neighbors using SMOTE, and Logistic Regression, in order to predit cancelation. Eventually, results revealed that the Decision Tree using SMOTE outperformes other classification machine learning algorithms. According to outcomes, the Decision Tree using SMOTE brings about an accuracy of 0.84 for cancelation forecasting.    

In [None]:
# creating the dataset
fig= plt.figure(figsize=(4,2), dpi=250)
plt.rcParams['font.size'] = '4'
data = {'Decision Tree Algorithm':0.82, 'Decision Tree Algorithm using SMOTE':0.84, 'K Nearest Neighbors':0.80,
        'K Nearest Neighbors using SMOTE':0.82, 'Logistic Regression':0.74}
Algorithms= list(data.keys())
Accuracy= list(data.values())

 
# creating the bar plot
plt.bar(Algorithms, Accuracy, color ='maroon',
        width = 0.7)
 
plt.xlabel("Classification machine learning algorithms")
plt.ylabel("Accuracy")
plt.xticks(rotation=90)
plt.title("A comparison among various classification machine learning algorithms")


In [None]:
Results = {'Accuracy':[0.82, 0.84, 0.80, 0.82, 0.74]}  
Dataframe_of_results = pd.DataFrame(Results, index =['Decision Tree Algorithm', 'Decision Tree Algorithm using SMOTE', 'K Nearest Neighbors', 'K Nearest Neighbors using SMOTE', 'Logistic Regression'])
Dataframe_of_results.sort_values('Accuracy', ascending=False)