# 1. Business Understanding

An Online travel booking company is suffering from loss in revenue because of the uncertain booking cancelation of its customers. The company wants to know which customer will cancel the booking. As a data-scientist we have to help the company to predict whether the customer will cancel the booking or not. We have all the booking details like arrival_date_year, stays_in_week_nights, arrival_date_day_of_month etc of the customers from various countries. We have to do some data analysis to answer some questions and we have to work on Machine Learning model(s) to help predict whether the customer will cancel the booking or not.

We will focus on Exploratory Data Analysis for answering business questions first and then move on to the prediction approach.

# 2. Import Packages and Load Dataset

In [None]:
import pandas as pd
import pandas_profiling as pp

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import seaborn as sns
import plotly.express as px

# A jupyter notebook specific command that let’s you see the plots in the notbook itself.
%matplotlib inline  

import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv("../input/hotel-booking-demand/hotel_bookings.csv")

df.head()

In [None]:
df.shape

In [None]:
df.describe()

# 3. Exploratory Data Analysis

### Q1: Read the dataset and visualize the target column (i.e. is_cancel). State whether it is imbalanced or not?

In [None]:
sns.countplot(data=df, x = 'is_canceled')
plt.show()

So, it is not imbalanced as we see from above chart.

In [None]:
pp.ProfileReport(df)

### Q2. Which type of hotel has the highest number of cancellations?

In [None]:
sns.countplot(data=df, x = 'hotel', hue='is_canceled')
plt.show()

So, City Hotel has highest no. of cancellations

### Q3: Report the name of the country that has the highest number of resort hotels and the country that has the highest number of city hotels?

In [None]:
counts = df['country'].value_counts()
counts

In [None]:
plt.subplots(figsize=(7,5))
sns.countplot(x='country', hue='hotel',  data=df[df['country'].isin(counts[counts > 2000].index)])
plt.show()

So, Portugal (PRT) has highest no. of city and resort hotels both.

### Q4: Report the percentage of check-outs of hotels in India (IND)

In [None]:
india_specific_data = df[df["country"] == 'IND']

india_specific_data.head(10)

In [None]:
india_specific_data['reservation_status'].value_counts(normalize=True)

In [None]:
india_checkouts = india_specific_data['reservation_status'].value_counts()

# pie plot
fig = px.pie(india_checkouts,
             values=india_checkouts.values,
             names=india_checkouts.index,
             title="Hotel Checkouts in India",
             template="seaborn")
fig.update_traces(rotation=-90, textinfo="percent+label")
fig.show()

### Q5: Report the name of the country where the maximum number of BB meals have been booked?

Various Meal Types are:
* BB - Bed and Breakfast
* HB - Half Board
* SC - No Meal Package
* FB - Full Board
* Undefined - Undefined

In [None]:
df['meal'].value_counts()

In [None]:
df['meal'].value_counts(normalize=True)

In [None]:
plt.subplots(figsize=(7,5))
sns.countplot(x='country', hue='meal',  data=df[df['country'].isin(counts[counts > 2500].index)])
plt.show()

Portugal - is the country where maximum no. of BB meals have been booked

### Q6: Report the name of at least three countries where the number of SC meals is zero?

In [None]:
group_meal_data = df.groupby(['country','meal']).size().unstack(fill_value=0)

group_meal_data.shape

In [None]:
# There are total 55 countries where SC meals is zero
country_with_no_SC_meals = group_meal_data[group_meal_data["SC"] == 0]

country_with_no_SC_meals.shape

In [None]:
country_with_no_SC_meals.tail(20)

Uganda, Nepal, Namibia - are 3 countries which we just picked up randomly out of around 55 countries where no. of SC meals is zero.

### Q7: It is said that if the deposit_type is “non-refund” then there are no cancelations. Could you prove/disprove this claim?

In [None]:
# group data for deposit_type:
deposit_cancel_data = df.groupby("deposit_type")["is_canceled"].describe()

#show figure:
plt.figure(figsize=(8, 6))
sns.barplot(x=deposit_cancel_data.index, y=deposit_cancel_data["mean"] * 100)
plt.title("Effect of deposit_type on cancelation", fontsize=16)
plt.xlabel("Deposit type", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)
plt.show()

As we observe, the deposit_type as 'Non Refund' and the 'is_canceled' column are correlated in a counter-intuitive way. Over 99% of people who paid the entire amount upfront have canceled their hotel bookings. This raises the question if there is something wrong with the data (or the description). What else stands out for Non Refund deposits? Here is a table of all mean values of the data, grouped by deposit type.

In [None]:
deposit_mean_data = df.groupby("deposit_type").mean()
deposit_mean_data

Comparing the mean values for "Non Refund" to "No Deposit" shows the following:

* Non Refund deposits are characterized by > 2x longer lead_time
* is_repeated_guest is ~ 1/10th
* previous_cancellations is 9x higher
* previous_bookings_not_canceled is 1/15th
* required_car_parking_spaces is almost zero

Based on these findings it seems that especially people who have not previosly visited one of the hotels book, pay and cancel repeatedly. This is unusual behavior!

### Booking per Distribution Channels

In [None]:
# total bookings per distribution channels (incl. canceled)
segments=df["distribution_channel"].value_counts()

# pie plot
fig = px.pie(segments,
             values=segments.values,
             names=segments.index,
             title="Bookings per distribution channels",
             template="seaborn")
fig.update_traces(rotation=-90, textinfo="percent+label")
fig.show()

In [None]:
fig = plt.figure(figsize=(12,4), dpi=150)

country_wise_guests = df[(df['is_canceled'] == 0)]['country'].value_counts().reset_index()
country_wise_guests.columns = ['country', 'No of guests']

country_wise_guests = country_wise_guests[country_wise_guests['No of guests'] > 200]

sns.barplot(data=country_wise_guests, x = 'country', y = 'No of guests')
plt.xticks(rotation=90,fontsize=11);

### How long people stay at hotels?

In [None]:
# Separate Resort and City hotel
# To know the acutal visitor numbers, only bookings that were not canceled are included. 
rh = df.loc[(df["hotel"] == "Resort Hotel") & (df["is_canceled"] == 0)]
ch = df.loc[(df["hotel"] == "City Hotel") & (df["is_canceled"] == 0)]

# Create a DateFrame with the relevant data:
rh["total_nights"] = rh["stays_in_weekend_nights"] + rh["stays_in_week_nights"]
ch["total_nights"] = ch["stays_in_weekend_nights"] + ch["stays_in_week_nights"]

num_nights_res = list(rh["total_nights"].value_counts().index)
num_bookings_res = list(rh["total_nights"].value_counts())
rel_bookings_res = rh["total_nights"].value_counts() / sum(num_bookings_res) * 100 # convert to percent

num_nights_cty = list(ch["total_nights"].value_counts().index)
num_bookings_cty = list(ch["total_nights"].value_counts())
rel_bookings_cty = ch["total_nights"].value_counts() / sum(num_bookings_cty) * 100 # convert to percent

res_nights = pd.DataFrame({"hotel": "Resort hotel",
                           "num_nights": num_nights_res,
                           "rel_num_bookings": rel_bookings_res})

cty_nights = pd.DataFrame({"hotel": "City hotel",
                           "num_nights": num_nights_cty,
                           "rel_num_bookings": rel_bookings_cty})

nights_data = pd.concat([res_nights, cty_nights], ignore_index=True)

In [None]:
#show figure:
plt.figure(figsize=(16, 8))
sns.barplot(x = "num_nights", y = "rel_num_bookings", hue="hotel", data=nights_data,
            hue_order = ["City hotel", "Resort hotel"])
plt.title("Length of stay", fontsize=16)
plt.xlabel("Number of nights", fontsize=16)
plt.ylabel("Guests [%]", fontsize=16)
plt.legend(loc="upper right")
plt.xlim(0,22)
plt.show()

In [None]:
avg_nights_res = sum(list((res_nights["num_nights"] * (res_nights["rel_num_bookings"]/100)).values))
avg_nights_cty = sum(list((cty_nights["num_nights"] * (cty_nights["rel_num_bookings"]/100)).values))
print(f"On average, guests of the City hotel stay {avg_nights_cty:.2f} nights, and {cty_nights['num_nights'].max()} at maximum.")
print(f"On average, guests of the Resort hotel stay {avg_nights_res:.2f} nights, and {res_nights['num_nights'].max()} at maximum.")

# Summary

We will work on the predictions as part of the next step.

Please feel free to provide feedback/questions that you may have.

Key aspect of data understanding / EDA phase is to see how we can answer some of the business questions which may help understand the data and patterns around it better and also helps in decision making purposes based on these descriptive analytics.