## Importing Libraries: ##

At first, all the necessary libraries are imported. These are:
- **Matplotlib** is a comprehensive library for creating static, animated, and interactive visualizations in Python. Although there are other popular libraries like seaborn, I decided to stick with matplotlib for this basic project
- **Numpy** provides a high-performance multidimensional array and basic tools to compute with and manipulate these arrays.
- **Pandas** is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. Used here for dataframe operations and visualization.
- **Sklearn** is a machine learning library. Here, sklearns' support vector classifier, random forest classifier, MLP classifier and train_test_split has been used.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier

Now, the dataset has to be imported. The dataset used here is a hotel bookings dataset containing various information of different customers who booked either a resort or city hotel during 2015-17. The dataset is in .csv format and it is imported and stored as a pandas dataframe 'df'. The df.head() function returns the first few (default = 5) rows along with headers of the dataframe. It is very useful for a quick glimpse.

In [None]:
df = pd.read_csv("../input/hotel-booking-demand/hotel_bookings.csv")
df.head()

A quick summary of the dataframe can also be viewed with the df.describe() function. Many attributes for each column is summarised for further information on the data. The most frequent entries in each column and other statiscal parameters like- mean, standard deviation etc is presented.
Note the include = "all" parameter ensures that every column of the input will be included in the output

In [None]:
df.describe(include = 'all')

To see which column contains which type of data, df.dtypes attribute is called. This attributes contain each column's datatype. In this dataset there are three types of data- int64, float64 and objects 

In [None]:
df.dtypes

The dataset contains many empty cells or 'None' entries. These hinder further data analysis or classification. So these must be handeled as a data preprocessing step. To first see which column contains how many of these 'NA' values in total, df.isna().sum() is used.

In [None]:
df.isna().sum()

Four columns (children, country, agent, company) contain many 'NA' entries. These are handeled differently here-
- 'NA' countries are replced with term 'NRF' (no record found)
- agent and company 'NA' entries are simply replaced with 0
- Number of childrens is calculated by taking the mean over the whole column
<br>Another call to df.isna().sum() validates that no 'NA' entries remain.

In [None]:
df['country']=df['country'].fillna('NRF')
df['agent'] = df['agent'].fillna(0.0)
df['company'] = df['company'].fillna(0.0)
df['children'] = df['children'].fillna(int(df['children'].mean()))
df.isna().sum()

df.corr() computes pairwise correlation of columns. So here calling the function returns a 32x32 matrix with the correlation of each pairwise correlation in their respected cell entry. <br>
**Remember:** A positive correlation is a relationship between two variables in which both variables move in the same direction. Negative correlation is a relationship between two variables in which one variable increases as the other decreases and vice versa

In [None]:
corr_matrix = df.corr()
corr_matrix

Just to see the column names again to determine which data to view and use for classification, df.columns attribute can be called. This returns all the column labels of the DataFrame. To see which column represents what information, the original paper can be sought [1].

In [None]:
df.columns

Pandas also contains direct visualization methods, providing a wrapper around plt.plot(). On DataFrame, plot() is a convenience to plot all of the columns with labels. Here a Box plot over every column is viewed to find the outliers in each column. As there are many columns in the dataset, all the labels in the x-axis are overlapping. In such cases, unimportant columns can be excluded.

In [None]:
df.plot(kind = 'box', figsize = (9, 5))
plt.show()

## Data Visualization ##

Data visualization is the graphical representation of information and data. It provides with a quick, clear understanding of the information. Thanks to graphic representations, we can visualize large volumes of data in an understandable and coherent way, which in turn helps us comprehend the information and draw conclusions and insights.<br>
As the first step, to look at how the total number of customer varries in each month for the years 2015,16 & 17, The value_counts() method can be used. The function returns the total frequency of cutomers in each year. sort_values() is used to sort the outcome values(for dataframe with multiple columns, the 'by' parameter, defining sorting by which column, must be specified) and to_dict() is used to get the value in python dictionary format.<br>
A loop over each year is used and the dataframe containing arrival year and arrival month is grouped over these years by combining [groupby()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby) and [get_group()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.get_group.html?highlight=get_group#pandas.core.groupby.GroupBy.get_group). A lineplot of maxplotlib is used here 3 different lines representing 3 different years.

In [None]:
fig = plt.figure(figsize=(12, 4))
year_freq = df['arrival_date_year'].value_counts().sort_values().to_dict()
year, freq = zip(*year_freq.items())
ytick = []
for i in range(len(year)):
    month_f = df[['arrival_date_year', 'arrival_date_month']]
    month_f = month_f.groupby(['arrival_date_year']).get_group(year[i])
    month_f = pd.Categorical(month_f['arrival_date_month'], categories = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', "December"])
    month_f = month_f.value_counts().to_dict()
    month, m_freq = zip(*month_f.items())
    ytick.append(m_freq)
    plt.plot(month, m_freq)
ytick = np.array(ytick).reshape(3,12)
for j in range(3):
    for i in range(12):
        if ytick[j, i]!=0:
            plt.text(i, ytick[j, i], str(ytick[j, i]))

plt.legend(year)
plt.show()

From the image, it can be seen how the number of cusyomers follow a similar pattern in the same months of different year. As no data for  early 2015 and late 2017 is present in the dataset the trend can not be exactly defined. However it is obvious that during April to June the number of bookings increase as well as for September-October.<br>
Now to see the top 10 countries with most bookings in this dataset agaian the value_counts() function is used. But here a Barchart is used with plt.text() function to note down the exact number of bookings for each country.

In [None]:
country_freq = df['country'].value_counts().iloc[0:10].to_dict()
country, freq = zip(*country_freq.items()) 
plt.bar(country, freq)
plt.xticks(country)
peak = plt.plot(np.arange(10), freq, 'r.')
for i in range(len(freq)):
    plt.text(i, freq[i]+1000, str(freq[i]))
plt.show()

To visualize how much a customer has to wait in the queue for a booking depending on their date of arrival, first a summation for a dataframe grouped by each date of month is computed to a dictionary. A horizontal barchart is used here with y axis representing each date of month. This data may not be very significant (eg- day 31).

In [None]:
fig = plt.figure(figsize=(5,6))
wait = df[['days_in_waiting_list','arrival_date_day_of_month']]
wait = wait.groupby(['arrival_date_day_of_month']).sum().to_dict()
x, y = zip(*wait['days_in_waiting_list'].items())
plt.barh(x, y)
plt.yticks(x)
plt.show()

The frame contains infromation of customer bookings coming from different distribution channels. A pandas piechart is used here on a frame with 'distribution_channel' column with value_counts(). The chart conveniently describes how most bookings came from 'TA/TO' meaning “Travel Agents” and “Tour Operators”.

In [None]:
dist_chnl = df['distribution_channel'].value_counts()
dist_chnl.iloc[:4].plot.pie(figsize=(5,5), fontsize =10)

Finally, if we want to see which country has the most cancellation rates, both country value counts and cancellation frequency is calculated. The rates are calculated by dividing the cancellations per country by total entries for that country. As there are entries for 177 different countries, to prevent overpopulating the diagram, rates less than 50% are excluded.

In [None]:
fig = plt.figure(figsize=(60,7))
c = df[['country', 'is_canceled']].groupby('country').sum().sort_values(by='is_canceled', ascending=False).to_dict()
country_freq = df['country'].value_counts().to_dict()
for _, (ctr, cnc) in enumerate(c['is_canceled'].items()):
    for _, (ctr2, freq) in enumerate(country_freq.items()):
        if (ctr == ctr2):
            cnc = cnc/freq
            if(cnc>.5):
                plt.bar(ctr, cnc)
                print(ctr, cnc)

## Data Classification: ##

For classification, columns with 'object' data types can not be used. In order to transform these columns into usable format, first we find all the columns with 'object' data type.

In [None]:
dt_type = df.dtypes.to_dict()
cat = []
for _, (i, j) in enumerate(dt_type.items()):
    if j == object:
        cat.append(str(i))
print(cat, len(cat))

Pandas provide a very useful tool for this transformation. Each target column is transformed into categorical data, then int type. Again checking the df.dtypes attribute shows that.

In [None]:
for col in cat:
    df[col] = df[col].astype('category')
    df[col] = df[col].cat.codes
df.dtypes

Here, if we want to determine whether or not a booking has chances of being canceled, we have to predict that from other attributes of the booking. Although feature selection is an important step to decide which attributes are most significant in such classification scenario, we have taken columns arbitrarily. The correlation matrix also could have been useful here. The Y data contains the target variable('is_canceled' column here).

In [None]:
X = df[['hotel', 'lead_time','arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'adults', 'children', 'babies',
       'country', 'is_repeated_guest', 'previous_cancellations', 'booking_changes', 'deposit_type', 'agent', 'company', 'days_in_waiting_list', 'customer_type']]
Y = df['is_canceled']

Lets see a glimpse of X data with head function and the shapes of both X and Y.

In [None]:
X.head()

In [None]:
print(X.shape)
print(Y.shape)

**train_test_split** is a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data. With this function, we don't need to divide the dataset manually. By default, Sklearn train_test_split will make random partitions for the two subsets. Here 80% data is used for training and 20% for testing.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .2)

It is always better to cross-check the shapes of traing and testing data. The number of columns for both X and Y must match for training and testing.

In [None]:
print(X_train.shape)
print(X_test.shape)

print(Y_train.shape)
print(Y_test.shape)

First classifier we are going to use is Support vector classifier. The make_pipeline() function sequentially apply a list of transforms and a final estimator. <br>
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.
<br>StandardScaler() standardizes features by removing the mean and scaling to unit variance.

In [None]:
clf1 = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf1.fit(X_train, Y_train)
score1 = clf1.score(X_test, Y_test)

After defining the classifier the training data is fit into the classifier. the score() function on testing data returns a accuracy score for the classifier. The first classifier on this experiment scores in a 79.68% accuracy.

In [None]:
score1

Multi-layer Perceptron classifier model is a neural network that optimizes the log-loss function using LBFGS or stochastic gradient descent. Similar to previous way, the classifier is trained and tested and accuracy is observed. MLP does better than SVC with ~80.96% accuracy

In [None]:
clf2 =  make_pipeline(StandardScaler(), MLPClassifier(alpha=.001, max_iter=2000))
clf2.fit(X_train, Y_train)
score2 = clf2.score(X_test, Y_test)

In [None]:
score2

Finally, a random forest classifier is used as the third classifer. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The default parameters has been used here.

In [None]:
clf3 = RandomForestClassifier()
clf3.fit(X_train, Y_train)
score3 = clf3.score(X_test, Y_test)

In [None]:
score3

With 85.77% accuracy the Random forest classifier performs the best of the three in the presented scenario. So, this model can be used predict whether a booking will be cancelled approximately 86 out of 100 times correctly which can provide many important business decisions.<br>
In the conclusion, this was a very basic demonstration just to get some familiarity with basic matplotlib and pandas functions for data exploration. If there are any mistakes in the explanation or code, or some function that are presented complicatedly here, please do let me know [here](mailto:raisul.inc@gmail.com).

References:<br>
- [1] Antonio, Nuno, Ana de Almeida, and Luis Nunes. "Hotel booking demand datasets." Data in brief 22 (2019): 41-49.