This dataset contains 119390 observations for a City Hotel and a Resort Hotel. Each observation represents a hotel booking between the 1st of July 2015 and 31st of August 2017, including booking that effectively arrived and booking that were canceled.

**hotel** - The datasets contains the booking information of two hotel. One of the hotels is a resort hotel and the other is a city hotel.

**is_canceled** - Value indicating if the booking was canceled (1) or not (0).

**lead_time** - Number of days that elapsed between the entering date of the booking into the PMS and the arrival date.

**arrival_date_year** - Year of arrival date.

**arrival_date_month** - Month of arrival date with 12 categories: “January” to “December”

**arrival_date_week_number** - Week number of the arrival date

**arrival_date_day_of_month** - Day of the month of the arrival date

**stays_in_weekend_nights** - Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

**stays_in_week_nights** - Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel BO and BL/Calculated by counting the number of week nights.

**adults** - Number of adults

**children** - Number of children

**babies** - Number of babies

**meal** - BB – Bed & Breakfast

**country** - Country of origin.

**market_segment** - Market segment designation. 
* TA - Travel Agents 
* TO - Tour Operators

**distribution_channel** - Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”

**is_repeated_guest** - Value indicating if the booking name was from a repeated guest (1) or not (0)

**previous_cancellations** - Number of previous bookings that were cancelled by the customer prior to the current booking

**previous_bookings_not_canceled** - Number of previous bookings not cancelled by the customer prior to the current booking

**reserved_room_type** - Code of room type reserved. Code is presented instead of designation for anonymity reasons

**assigned_room_type** - Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons

**booking_changes** - Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

 **deposit_type** 
 * No Deposit – no deposit was made; 
 * Non Refund – a deposit was made in the value of the total stay cost; 
 * Refundable – a deposit was made with a value under the total cost of stay.
 
**agent** - ID of the travel agency that made the booking

**company** - ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons

**days_in_waiting_list** - Number of days the booking was in the waiting list before it was confirmed to the customer

**customer_type** 
* Group – when the booking is associated to a group; 
* Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking

**adr** - Average Daily Rate (Calculated by dividing the sum of all lodging transactions by the total number of staying nights)

**required_car_parking_spaces** - Number of car parking spaces required by the customer

**total_of_special_requests** - Number of special requests made by the customer (e.g. twin bed or high floor)

**reservation_status** 
* Check-Out – customer has checked in but already departed; 
* No-Show – customer did not check-in and did inform the hotel of the reason why

**reservation_status_date** - Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

In [None]:
import pandas as pd
import numpy as np

import warnings 
warnings.filterwarnings("ignore")
import calendar

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
data = pd.read_csv("../input/hotel-booking/hotel_booking.csv")
data.head()

Firstly, I'm checking size and dropping duplicates

In [None]:
def size_of_df(data):
    print(f"| {data.shape[0]} rows, {data.shape[1]} columns \n| {data.size} elements summary \n| {round(data.memory_usage().sum()/1048576, 1)} Mb")

In [None]:
size_of_df(data)

In [None]:
data.drop_duplicates(inplace=True, ignore_index=True)
size_of_df(data)

Defining function to display information about DF

In [None]:
def data_info(data, set_color=False):
    #  most popular unique values in DF
    unique_data = [data[x].value_counts().index[:2] for x in data.columns]
    unique_res_data = pd.DataFrame()
    for index, x in enumerate(data.columns):
        unique_res_data[x] = unique_data[index] 
        
    #  create new DF with all information about our DF - unique values, data types, num. unique values, num. NAs    
    df_info = pd.concat([unique_res_data, 
                     pd.DataFrame(data.dtypes).T, 
                     pd.DataFrame(data.nunique()).T, 
                     pd.DataFrame(data.isna().sum()).T], ignore_index=True)

    df_info.rename(index={0:"unique_value_1",
                          1:"unique_value_2",
                          2:"type", 
                          3:"num_unique", 
                          4:"is_na"}, inplace=True)
    
    # for convenience we will highlight the information in color
    if set_color:
        props = [("background-color", "lavender")]
        df_info = df_info.style.set_table_styles({"type": [{"selector": "", "props": props}],
                                          "num_unique": [{"selector": '', "props": props}],
                                          "is_na": [{"selector": "", "props": props}]}, 
                                          axis=1, 
                                          overwrite=False)
    return(df_info)

In [None]:
data_info(data).T

In [None]:
data.describe().T

In [None]:
# drop orders where "adults" = 0 (or > quantile = 0.99) and  rows with missing values
data = data[~((data.children.isna()) | (data.country.isna()))]
data = data[data.adults>0]
data = data[data.adults<data.adults.quantile(0.99)]

Who are the most common customers by each hotel

In [None]:
pd.concat([data[data.hotel=='Resort Hotel'].name.value_counts().head(), 
           data[data.hotel=='City Hotel'].name.value_counts().head()], 
           axis=1, keys=("Resort Hotel", "City Hotel")).fillna("-")

Who are the most frequent guests, grouped by each hotel

In [None]:
pd.concat([data[data.hotel=='Resort Hotel'].country.value_counts().head(), 
           data[data.hotel=='City Hotel'].country.value_counts().head()], 
           axis=1, keys=("Resort Hotel", "City Hotel")).fillna("-")

In [None]:
# add column babies to children
data["children"] = data.babies + data.children

# drop excess data
data.drop(["company", "name", "email", "phone-number", "agent", "credit_card", 
           "distribution_channel", "babies", "assigned_room_type", "reservation_status", "reservation_status_date"], 
          axis=1, inplace=True)

In [None]:
data_info(data, True)

In [None]:
#  change data types
cat_data = ["hotel", "is_canceled", "arrival_date_month", "meal", "market_segment", 
      "distribution_channel", "is_repeated_guest", "reserved_room_type", "country", 
      "assigned_room_type", "deposit_type", "customer_type"]
int_data = ["children", "adr"]

# data[cat_data] = data[cat_data].astype("category")
data[int_data] = data[int_data].astype("int")

Check DF size after transformations (32.8 Mb before)

In [None]:
size_of_df(data)

In [None]:
data_info(data, True)

In [None]:
data.is_canceled.value_counts()

In [None]:
print(f"We have {round(data.is_canceled.value_counts()[1]/len(data.is_canceled)*100, 2)}% cancels in summary")

Let's check the relationships between "is_cancelled" and other variables

In [None]:
columns_for_visual = data.nunique()[data.nunique()<10].drop(["is_canceled"]).index
fig = plt.figure(figsize=(17,15))
for index, col in enumerate(columns_for_visual):
    ax = fig.add_subplot(4, 3, index+1)
    ax.set_title(col,fontsize=15)
    ax.tick_params(labelrotation=20)
    sns.countplot(data[col], hue=data.is_canceled, ax=ax, palette="mako")
plt.tight_layout(pad=3);

We can see some intresting columns - "hotel", "market_segment", "is_repeated_guest", "deposit_type", "customer_type", "total_of_special_requests"

Look at the data with percentages

In [None]:
def perc_canceled_for_cols(data, *cols):
    result_data = pd.DataFrame()
    for col in cols:
        temp_df = data.pivot_table(index="is_canceled", columns=col, aggfunc="count").adr\
                            .reset_index().drop("is_canceled", axis=1).apply(lambda x: round(x/x.sum()*100, 2))
        result_data = pd.concat([result_data, temp_df], axis=1)
    return result_data

In [None]:
hotel_market = perc_canceled_for_cols(data, "hotel", "market_segment")
hotel_market

In [None]:
repeat_deposit = perc_canceled_for_cols(data, "is_repeated_guest", "deposit_type")
repeat_deposit

In [None]:
customer_type_data = perc_canceled_for_cols(data, "customer_type")
customer_type_data

In [None]:
total_of_special_requests_data = perc_canceled_for_cols(data, "total_of_special_requests")
total_of_special_requests_data

In [None]:
print(f"The City Hotel has {round(hotel_market.iloc[1, 0] - hotel_market.iloc[1, 1], 2)} more cancellations than Resort Hotel")
print(f"The market segment \"Groups\" has 2-3 times more cancellations, than other segments")
print(f"If a person has already been to the hotel, the probability of cancellation is {round(repeat_deposit.iloc[1, 0] - repeat_deposit.iloc[1, 1], 2)}% lower")
print(f"When customers choose the deposit type \"Non Refund\", almost all bookings are canceled")
print(f"The \"Groups\" guest type has 3-4 times less cancellations, than other types")

Look at the canceled data by each month and year

In [None]:
canceled = data[data['is_canceled'] == 1]
canceled_by_month_year = canceled.pivot_table(index="arrival_date_year", columns="arrival_date_month", aggfunc="count").fillna(0).adr.T
canceled_by_month_year 

In [None]:
# test by February 2015
canceled[(canceled.arrival_date_year==2015)&(data.arrival_date_month=="February")]

Reindex data in the right order by month and visualize data

In [None]:
canceled_by_month_year = canceled_by_month_year.reindex(calendar.month_name[1:])
plt.figure(figsize=(15, 5));
plt.title("Count of canceled orders by month and year");
sns.lineplot(data=canceled_by_month_year, palette="mako_r", linewidth=2);

Look at the data by each month

In [None]:
plt.figure(figsize=(15, 5));
plt.title("Count of orders by month");
sns.countplot(data.arrival_date_month, hue=data.is_canceled, palette="mako", order=calendar.month_name[1:]);

In [None]:
perc_canceled_for_cols(data, "arrival_date_month")

Most canceled - April, less canceled - January, by we have NaN in first half of 2015 and second half 2017, so look at the data by 2016

In [None]:
perc_canceled_for_cols(data[data.arrival_date_year==2016], "arrival_date_month")

So January canceled less than other, the most canceled months - October, June, April

Look at the other columns in DF

In [None]:
columns_for_visual_2 = ["lead_time", "stays_in_weekend_nights", "stays_in_week_nights",
                        "previous_cancellations", "previous_bookings_not_canceled", 
                      "booking_changes", "days_in_waiting_list", "adr"]
fig = plt.figure(figsize=(17,15))
for index, col in enumerate(columns_for_visual_2):
    ax = fig.add_subplot(3, 3, index+1)
    ax.set_title(col,fontsize=15)
    ax.tick_params(labelrotation=20)
    sns.boxplot(data.is_canceled, data[col], ax=ax, palette="mako")
plt.tight_layout(pad=3);

The most intresting column - "lead_time"

In [None]:
print(f"Median of days (from booking to arrive) = {round(data[data.is_canceled==1].lead_time.median())} for canceled orders \n\
Median of days (from booking to arrive) = {round(data[data.is_canceled==0].lead_time.median())} for not canceled orders")

Also we can see strong outliers in columns "adr" and "days_in_waiting_list"

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(15, 5));
sns.histplot(data.adr, ax=ax[0]);
sns.histplot(data[data.adr<data.adr.quantile(0.99)].adr, ax=ax[1]);
sns.histplot(data[data.adr<data.adr.quantile(0.9)].adr, ax=ax[2]);

In [None]:
# drop outliers and check data
data.adr.clip(0, data.adr.quantile(0.99), inplace=True)
data.days_in_waiting_list.clip(0, data.days_in_waiting_list.quantile(0.99), inplace=True)
fig, ax = plt.subplots(1, 2, figsize=(15,5))
sns.boxplot(data.is_canceled, data.adr, palette="mako", ax=ax[0]);
sns.boxplot(data.is_canceled, data.days_in_waiting_list, palette="mako", ax=ax[1]);

In [None]:
# now we can drop duplicates again and apply onehotencoder|other instruments for machine learning alghoritms for classification canceled orders
data.drop_duplicates(inplace=True)
size_of_df(data)

In [None]:
data_info(data, True)