In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

First of all, we started by loading the data and importing the necessary libraries that we will use in the dataset preprocessing:

In [None]:
#Importing Packages
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from sklearn.preprocessing import LabelEncoder

In [None]:
#Importing Data
df = pd.read_csv('../input/hotel-booking-demand/hotel_bookings.csv')

After the data is loaded, the number of columns and the number of data table rows were viewed. Then, we visualized a few lines of data to get a general idea of this data, and to see if there are any missing values, which will be represented by NaN:

In [None]:
df.head()

**Verifiying Missing Values**

we check if there are missing values in our dataset, then we visualize some:

In [None]:
#Verifiying the existence of missing values
df.isnull().values.any()

In [None]:
#Vizualizing some missing values
df.isnull().sum()

In [None]:
#visualizing the percentages of missing values in each column
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=True)
print(missing_value_df)

In [None]:
#Dropping the rows that contain missing values except company and agent (except: company and agent)
df = df[df['children'].notna()]
df = df[df['country'].notna()]

In [None]:
df.shape

**arrival year, month and day to arrival_date**

Then we merge the three columns 'arrival_date_year', 'arrival_date_month', 'arrival date day_of_month' into a column called “arrival_date”, containing the day, month and year of the client's arrival in datetime form. To do this, we run the following code:

In [None]:
#Transforming arrival_date_month to datetime type
df["arrival_date_month"]=pd.to_datetime(df['arrival_date_month'],format='%B').dt.month

In [None]:
#combine year and month and day in a datetime variable
df["arrival_date"]=pd.to_datetime({"year":df["arrival_date_year"].values,"month":df["arrival_date_month"].values,"day":df["arrival_date_day_of_month"].values})

In [None]:
#Droping the year and month and day columns
df=df.drop(columns=['arrival_date_year','arrival_date_month','arrival_date_day_of_month'])

In [None]:
#Visualizing the shape of our dataframe (rows,columns) again
df.shape

In [None]:
#Visualizing a sample of 10 rows of our dataframe
df.sample(10)

**Verifiying that the timestamp of the variable reservation_status_date must occur after or at the same date as the input variable arrival_date**

In [None]:
#Visualizing the types of our dataframe's variables
df.dtypes

In [None]:
#Transforming the reservation_status_date variable type to Datetime 
df["reservation_status_date"]=pd.to_datetime(df["reservation_status_date"], format = '%Y-%m-%d')

In [None]:
#Visualizing the types of our dataframe's variables again
df.dtypes

**Preprocessing Extentions**

Cleaning: We propose to treat the missing values, to use the approach of filling each empty box with the median of the values of the column to which this empty box belongs, and we can extend this solution, by adding another column which will contain two values, True or False, to indicate if the value of the first column is original or it is calculated by the median. We implement this solution for the two columns “agent” and “company” as it is illustrated in the following figure:

In [None]:
#Filling null values in these two columns with the mean of values of each column
for column in ['agent','company']:
    df[column] =df[column].fillna(df[column].mean())

In [None]:
#Vizualizing the sum of missing values in each variable
df.isnull().sum()

In [None]:
#Filling null values in these two columns with the mean of values of each column
for column in ['arrival_date']:
    df[column] =df[column].fillna(df[column].mean())

In [None]:
#Vizualizing the sum of missing values in each variable
df.isnull().sum()

We check if there are duplicate lines, if so we opt to delete them, using the following command:

In [None]:
#Droping the duplicated values
df.drop_duplicates( inplace = True)

Transformation: in order to properly treat categorical variables, we propose the creation of columns among the number of categories for each variable. Each column is filled with the values 0 and 1. The value 0 replaces NULL, and 1 means that the corresponding row has this category. In our case; we first specify the categorical variables; and we transform them as follows:

In [None]:
#Transformation of categoriccal variables to numirical variables
categoricalV=["hotel","meal","country","market_segment","distribution_channel","reserved_room_type","assigned_room_type","deposit_type","customer_type"]
df[categoricalV[1:11]]=df[categoricalV[1:11]].astype('category')

In [None]:
df[categoricalV[1:11]]=df[categoricalV[1:11]].apply(lambda x:LabelEncoder().fit_transform(x))

In [None]:
df['hotel_Num']=LabelEncoder().fit_transform(df['hotel'])

In [None]:
df.dtypes

**Exploratory Analysis**

Dataset summary statistics – Date variables

The first aim is to create summary statistics for the dataset. For date variables, we use the describe () method, with an additional attribute (in order to make the dates of the variables numeric to apply the method) as shown in the following figure:

In [None]:
#Create dataset summary statistics for Date variables
df[["reservation_status_date","arrival_date"]].describe(datetime_is_numeric=True)

dataset summary statistics – Categorical variables

Secondly, we implement the describe () method, for the categorical variables already specified in the previous part. We notice that this time the output is different including std:

In [None]:
#Create dataset summary statistics for Categorical variables
df[categoricalV].describe()

dataset summary statistics – Integer and numeric variables

We specify the numeric variables, then we implement the describe () method again, as follows:

In [None]:
#Create dataset summary statistics for Integer and numeric variables
df[["lead_time","arrival_date_week_number","stays_in_weekend_nights","stays_in_week_nights","adults","children","babies","is_repeated_guest","previous_cancellations","previous_bookings_not_canceled","booking_changes","days_in_waiting_list","adr","required_car_parking_spaces","total_of_special_requests"]].describe(datetime_is_numeric=True)

The distribution of hotel type for cancellation

the distribution is plotted once against the cancellation and once against the number of adults. The code for the two manipulation is as follows:

In [None]:
#Ploting the distribution of cancellation in each type of hotels
plt.rcParams['figure.figsize'] = [10, 7]
sns.set(style = 'white', font_scale = 1.3)
# Plot
dist = sns.countplot(df['hotel'], hue = 'is_canceled', data = df, palette = 'Set2');
dist.set(title = "Distribution of the hotel based on cancellation");

Distribution of cancellation and Number of Adults

In [None]:
#Checking the distribution of cancellation and Number of Adults
dist = sns.countplot('adults',data=df,hue='is_canceled');
dist.set(title = "Adults Cancellations");

****Modeling****

From a demographic perspective, if we have precise data we will predict whether it is a resort hotel or a city hall. Supervised learning techniques will allow us to accomplish such a task, including Logistic Regression, KNN, SVM. In other words, the problem is purely a classification problem, which emphasizes segmentation of individuals based on the target variable hotel. This will help the hotel to divide the guests into groups based on the type of host. Which means a significant increase in profits and relevant revenue management.

***Logistic Regrssion***

In [None]:
#Importing the train_test_split module for spliting data
from sklearn.model_selection import train_test_split
#Importing datetime
import datetime as dt

In [None]:
#transforming Datetime variables to numerical variables
df['numerical_larrival_date']=df['arrival_date'].map(dt.datetime.toordinal)
df['numerical_reservation_status_date']=df['reservation_status_date'].map(dt.datetime.toordinal)

In [None]:
#transforming is_canceled to a numerical variable
df["is_canceled"].replace({'not canceled': 0,'canceled':1}, inplace=True)
df["reservation_status"].replace({'Canceled': 0,'Check-Out':1,'No-Show':2}, inplace=True)

In [None]:
#Defining X (target values) and Y (usefull columns)
usefull_columns = df.columns.difference(['hotel','hotel_Num','arrival_date','reservation_status_date'])
X = df[usefull_columns]
Y = df["hotel_Num"].astype(int)

In [None]:
#Spliting data to train data and test data
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.3,random_state=150)

Test_size = 0.3 means that 30% of the initial data is dedicated to model testing, and the 70% is dedicated to model training.
Random_state means the degree of randomness with which we will divide our dataset

In [None]:
#Importing some needed metrics for evaluating the models
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import plot_confusion_matrix

In [None]:
#Training our Logistic Regressing model 
logisticR = LogisticRegression()
logisticR.fit(X_train,Y_train)
Y_pred= logisticR.predict(X_test)
Y_train_pred = logisticR.predict(X_train)

In [None]:
#metrics and accuracy score 
print('Recall Score :',recall_score(Y_test,Y_pred))
print('Precision Score :',precision_score(Y_test,Y_pred))
print('F1 Score :',f1_score(Y_test,Y_pred))
print('-----------------------------------------------')
print('Accuracy Score :',accuracy_score(Y_test,Y_pred))

And we can see that our model's accuracy is 87%, which represents a good performance.

In [None]:
#Ploting the confusion matrix
plot_confusion_matrix(logisticR,X_test,Y_test)

***SVM - Support Vector Machine***

In [None]:
#Importing the needed packages for SVM algorithme
from sklearn import svm
from sklearn.svm import SVC

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaling = MinMaxScaler(feature_range=(-1,1)).fit(X_train)
X_train = scaling.transform(X_train)
X_test = scaling.transform(X_test)

In [None]:
#Defining an SVM classifier
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, Y_train)

In [None]:
#Training the model
Y_pred = svclassifier.predict(X_test)

In [None]:
#metrics and accuracy scores
print('Recall Score :',recall_score(Y_test,Y_pred))
print('Precision Score :',precision_score(Y_test,Y_pred))
print('F1 Score :',f1_score(Y_test,Y_pred))
print('-----------------------------------------------')
print('Accuracy Score :',accuracy_score(Y_test,Y_pred))

We have as a result a classification rate of 90%, considered as a very good precision.

In [None]:
#Ploting the confusion matrix
plot_confusion_matrix(svclassifier,X_test,Y_test)

KNN - k-Nearest Neighbors

In [None]:
#Importing the needed packages for KNN algorithme
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error
from math import sqrt

In this section, we will plot the mean error for the predicted values of the test set for all K values between 1 and 40. first the error mean for all predicted values where K is between 1 and
40, In each iteration, the average error for the predicted values of the set of
test is calculated and the result is added to the error list:

In [None]:
error = []

# Calculating error for K values between 1 and 40
for i in range(1, 40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, Y_train)
    pred_i = knn.predict(X_test)
    error.append(np.mean(pred_i != Y_test))

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')

Note that the use of the value 1 for K is the most optimal. In order to train the KNN algorithm, we rely on the use of Scikit-Learn. The first step is to import the KNeighborsClassifier class from from the sklearn.neighbors library. This class is initiated with a parameter, (n_neigbours). This is basically the value of K. The last step is to make predictions about our test data. To do this, we run the following script:

In [None]:
#Defining an KNN classifier and training the model
classifier = KNeighborsClassifier(n_neighbors=1)
classifier.fit(X_train, Y_train)
Y_pred = classifier.predict(X_test)

In [None]:
#Showing the metrics and accuracy scores
print('Recall Score :',recall_score(Y_test,Y_pred))
print('Precision Score :',precision_score(Y_test,Y_pred))
print('F1 Score :',f1_score(Y_test,Y_pred))
print('-----------------------------------------------')
print('Accuracy Score :',accuracy_score(Y_test,Y_pred))

The results show that our KNN algorithm was able to rank the test set records with an accuracy of 94%, which is excellent given the high dimensionality of our dataset.

In [None]:
#Ploting the confusion matrix
plot_confusion_matrix(classifier,X_test,Y_test)

Concerning the accuracy, we had as result 94% for KNN, 90% for SVM, and 86% for Logistic Regression. The greatest value is that of KNN ... In other words with Knn 94% of our predictions will be correct, it therefore represents the best model to adopt. Which is logical, in the literature we find that when the training data is much bigger than the other features, KNN is better than SVM. Besides KNN is easy to implement. Yet KNN is slower in execution time than LR, but not slow enough than SVM. From a more global perspective, KNN and SVM support nonlinear solutions, and they are unparameterized where Parameterized Logistic Regression, deals with linear solutions. SVM is less computationally demanding than kNN and it is easier to interpret but can only identify a limited set of patterns. On the other hand, kNN can find very complex models but its output is more difficult to interpret.

***Clustering with K-Means***

After we used supervised algorithms in the first part, now we have considered an unsupervised problem, a clustering problem based on K-Means, and we will analyze the results of each cluster to identify the most profitable clients in our data set based on lead time and ADR. The first challenge that we encounter when we want to use clustering with K-means, is to determine the optimal number of clusters that we want to have as results. So first to determine the number of clusters, we used the Elbow method:

In [None]:
import sklearn.cluster as cluster

In [None]:
df_Short = df[['lead_time','adr']]

In [None]:
K=range(1,12)
wss = []
for k in K:
    kmeans=cluster.KMeans(n_clusters=k,init="k-means++")
    kmeans=kmeans.fit(df_Short)
    wss_iter = kmeans.inertia_
    wss.append(wss_iter)

In [None]:
mycenters = pd.DataFrame({'Clusters' : K, 'WSS' : wss})
mycenters

In [None]:
sns.scatterplot(x = 'Clusters', y = 'WSS', data = mycenters, marker="+")

To determine the optimal number of clusters, one must select the value of k after which the distortion begins to decrease linearly. Thus, we conclude that the optimal number of clusters for the data is 4.
So we ran the k-means algorithm based on lead_time and ADR with a number of clusters equal to 4, and we displayed the cluster centers:

In [None]:
kmeans = cluster.KMeans(n_clusters=4 ,init="k-means++")

In [None]:
kmeans = kmeans.fit(df[['lead_time','adr']])

In [None]:
kmeans.cluster_centers_

In [None]:
df['Clusters'] = kmeans.labels_

Then we displayed the number of observations belonging to each cluster:

In [None]:
df['Clusters'].value_counts()

Finally we have displayed the clusters:

In [None]:
sns.lmplot(x="lead_time", y="adr",hue = 'Clusters',  data=df)
plt.ylim(0, 600)
plt.xlim(0, 800)
plt.show()

The clients with the lowest lead time and the highest ADR ie the clients that appear in the green cluster are considered to be the most profitable. While the red category shows the lowest ADR and the highest (least profitable) delivery time.
With regard to unsupervised learning in general - it is important to remember that this is largely a method of exploratory analysis - the goal is not necessarily to predict but rather to reveal information about data that may not have been taken into account before. For example, in our case after visualizing the graph, we can ask questions like: why some customers have a shorter delivery time than others? and are customers in certain countries more likely to match this profile? ect ...
These are all questions that the k-means clustering algorithm may not directly answer for us, but reducing the data into separate clusters provides a solid baseline to be able to ask questions like these.

**Conclusion**

In short, we learned three different ways to classify data using python for the supervised type (KNN, SVM, RL), and one for the unsupervised. We have found that, for the first type, KNN remains the best in terms of performance, for our case. And for the unsupervised, K-means allowed us to visualize the most profitable customers and the least profitable customers, based on the two variables lead_time (Number of days elapsed between the date of entry of the reservation in the PMS and the client arrival date) and adr (Average daily rate as defined by dividing the sum of all accommodation transactions by the total number of nights), and we used the result of this algorithm (which is under graph form) to ask specific questions about the variation in profitability of our customers, in order to give the hotel manager ideas to make their customers more profitable.
Overall, the ideal model chosen for machine learning very often depends on the problem. There will be some datasets where KNN could fail miserably, so it is good to implement all the other models, for each problem, in order to judge the performance of each and choose the best model to adopt.
It all comes down to specifying the variables to be processed, and choosing the right machine learning model.