# Introduction/Overview

The data is originally from the article [Hotel Booking Demand Datasets](https://www.sciencedirect.com/science/article/pii/S2352340918315191), written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.

It consists of hotel booking data between July 2015 and August 2017 for two hotels in Portugal, a city hotel and a resort hotel. The data was collected from each hotel's reservation system, where it was stored in a total of 8 tables, before being joined together into a single csv file. 

In this notebook I will first clean the data, then visualize some of the booking patterns, before finally exploring the relationships between individual variables and whether a booking was canceled or not.

# Importing and Cleaning Data

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.graph_objs as go
import plotly.express as px
from plotly.subplots import make_subplots

import datetime as dt
from itertools import zip_longest
from sklearn import preprocessing

py.offline.init_notebook_mode(connected=True)

In [None]:
#importing data
df = pd.read_csv('../input/hotel-booking-demand/hotel_bookings.csv')

In [None]:
#sample of the data
df.head()

In [None]:
df.info()

We can see the dataframe has 32 columns and 119,390 rows. Are any of these rows duplicated?

In [None]:
df.duplicated().value_counts()

There are 31,994 rows that are duplicates, which accounts for more than 25% of the rows in the dataframe. As the dataframe was created by combining multiple tables from a reservation system, I wonder if the process of joining these tables created the large number of duplicates. Regardless, the duplicated rows will need to be dropped.

In [None]:
df.drop_duplicates(inplace=True)

Lets see if we have any missing values for each variable.

In [None]:
df.isnull().sum()

We have missing values for 4 variables, *children, country, agent,* and *company*. Given the large number of missing values for the *agent* and *company* columns I chose to drop these columns. For the *country* column I chose to drop the rows where values are missing. 

In [None]:
#dropping rows with no country information
df.dropna(subset=['country'],inplace=True)

#dropping agent and company columns
df.drop(['agent','company'], axis=1, inplace=True)

We have 4 missing values for the variable *children*. I am going to assume these missing values are bookings with 0 children.

In [None]:
#filling missing children values with 0.
df['children'].fillna(0,inplace=True)

#changing datatype of the children column to match adults and babies columns
df['children'] = df['children'].astype(int)

Next, lets rename some variables for easier reading and writing.

In [None]:
#renaming columns
df.rename(columns={'arrival_date_year':'year',
                   'arrival_date_month':'month',
                   'arrival_date_week_number':'week',
                   'arrival_date_day_of_month':'day'},inplace=True)

While the existing variables in the dataframe provide a good starting point there are some additional calculated variables that are of interest to me. One of which is the day of arrival (Monday, Tuesday,.. etc.). We can find this information easily using the *datetime* module but first we must convert the *year, month,* and *day* columns into a datetime datatype corresponding to the arrival date. 

In [None]:
name_to_num = {'January':1,
               'February':2,
               'March':3,
               'April':4,
               'May':5,
               'June':6, 
               'July':7,
               'August':8,
               'September':9,
               'October':10,
               'November':11,
               'December':12}


#converting month column to numerical value instead of string value
df['month'] = df['month'].map(name_to_num)

#converting columns year, month, day to create a datetime value for arrival date
df[['year','month','day']] = df[['year','month','day']].astype(str)
df['arrival_date'] = pd.to_datetime(df[['year','month','day']], errors='coerce')

#converting columns back to original datatype
df[['year','month','day']] = df[['year','month','day']].astype(int)

#ensuring no errors were created when producing the arrival_date column
df['arrival_date'].isnull().sum()

Now we are ready to create some new variables. A total of 4 came to mind, total duration of stay, booking size, average daily rate per person, and day of arrival. You may notice below I opted to exclude the variable *babies* from the calculation of the average daily rate per person. My reasoning for this is that babies are often free of charge when reserving a hotel room. Later in this notebook you will see that there is almost no correlation between the *adr* and *babies* variables. 

In [None]:
#creating new variables
df['duration'] = df['stays_in_weekend_nights'] + df['stays_in_week_nights']
df['booking_size'] = (df['adults'] + df['children'] + df['babies'])
df['adr_pp'] = round(df['adr']/(df['adults'] + df['children']),2)
df['day_of_week'] = df['arrival_date'].dt.strftime('%A')

Lets take a look at some summary statistics for each numerical variable to see if any further adjustments are required.

In [None]:
df.describe()

There are several problems we need to address. First, notice that the mean and maximum values for the *adr_pp* are infinite. Referring back to the definition of this variable we can see the problem lies in the fact that some bookings have no adults or children, thus the *adr* is being divided by 0. Lets see how many bookings we have where the *booking_size* is 0.

In [None]:
df[df['booking_size']==0].shape[0]

We have a total of 161 bookings where the *booking_size* is 0. In order to prevent the *adr_pp* from displaying this erroneous behavior I chose to drop all bookings with a *booking_size* of 0.

In [None]:
df.drop(df[df['booking_size']==0].index, inplace=True)

The next issue to observe is that the minimum value of *duration* is 0. Lets see how many bookings we have with a *duration* of 0 nights.

In [None]:
df[df['duration']==0].shape[0]

A total of 586 bookings have a *duration* of 0 nights. While there may be a explanation for these bookings, the typical minimum stay at a hotel is 1 night therefore they appear to be invalid. For the purpose of this analysis I chose to drop these rows.

In [None]:
df.drop(df[df['duration']==0].index, inplace=True)

The last issue is regarding the *adr*. We have 1 booking with a negative *adr* and 1046 bookings with an *adr* of 0. More than  half of the bookings with an *adr* of 0 have an assigned *market_segment* of complementary which suggests they do in fact have the correct *adr*. However, for the remaining bookings there is no simple explanation as to why the *adr* is 0, although some possibilities include hotel rewards program redemptions (if the hotels offer such a program), promotions or corporate contracts. Rather than dropping theses rows I will simply filter out the bookings with an *adr* of 0 when looking at *adr* trends. I will be dropping the single booking with the negative *adr*. 

In [None]:
#Number of bookings with an adr of 0 by assigned market segment
df.loc[df['adr']==0,'market_segment'].value_counts()

In [None]:
df.drop(df[df['adr']<0].index, inplace=True)

Before moving on I would like to quickly address the issue of outliers in the dataframe. As you can see from the summary statistics some variables clearly contain outliers. For example, *lead_time* has a maximum value of 737 days while the maximum numbers of *adults* on a booking is 55. There are many methods for identifying outliers and removing them but in some situations it can better to keep them. In the context of this dataframe I believe it is better to keep them. Without understanding how the data is collected it is possible many percieved outliers are actually valid numbers under the restrictions of the booking reservation system. Of course, some outliers may be errors produced by manually entered booking details, but once again, without understanding how the data was collected we cannot know with any certainty. 

Lastly, lets define some frequently used subsets of the dataframe. 

In [None]:
#completed booking for each hotel
city_completed = df.loc[(df['hotel']=='City Hotel') & (df['is_canceled']==0)]
resort_completed = df.loc[(df['hotel']=='Resort Hotel') & (df['is_canceled']==0)]

#all bookings for each hotel
city = df.loc[df['hotel']=='City Hotel']
resort = df.loc[df['hotel']=='Resort Hotel']

In the next section I will be referring to completed bookings frequently so lets see how we have for each hotel.

In [None]:
print('''Total number of completed bookings at the City Hotel: {:,}
Total number of completed bookings at the Resort Hotel: {:,}'''.format(city_completed.shape[0], resort_completed.shape[0]))

# Visualizing Booking Patterns

This section will focus on gaining insights into the booking patterns of the hotels guests who **completed their stay**. These patterns should vary between the Resort Hotel and the City Hotel as different booking costs, amenities, and locations will appeal to different types of guests. As such, all analysis in this section will group completed bookings by hotel type. 

**Hotel Guests by Country of Origin**

In [None]:
print('''Guests staying at the City Hotel originated from {} unique countries compared to {} unique countries at the Resort Hotel.'''.format(city_completed['country'].nunique(), resort_completed['country'].nunique()))

In [None]:
resort_countries = resort_completed['country'].value_counts().rename_axis('country').reset_index(name='count')
#Countries with less than a 100 completed booking are grouped into the category 'Other'
resort_countries.loc[resort_countries['count'] < 100, 'country'] = 'Other'

resort_values = resort_countries['count'].tolist()
resort_labels = resort_countries['country'].tolist()

city_countries = city_completed['country'].value_counts().rename_axis('country').reset_index(name='count')
city_countries.loc[city_countries['count'] < 100, 'country'] = 'Other'

city_values = city_countries['count'].tolist()
city_labels = city_countries['country'].tolist()

In [None]:
specs = [[{'type':'domain'}, {'type':'domain'}]]

fig = make_subplots(1,2, specs=specs, subplot_titles=['Resort Hotel','City Hotel'])

fig.add_trace(go.Pie(name='Resort Hotel', labels=resort_labels, values=resort_values),1,1)
fig.add_trace(go.Pie(name='City Hotel', labels=city_labels, values=city_values),1,2)

fig.update_traces(textposition='inside', 
                  textinfo='label+percent+value', 
                  hovertemplate='Country: %{label} <br>Completed Bookings: %{value} <br>Percent: %{percent}')

fig.update_layout(title='Hotel Guests by Country of Origin',
                  template='seaborn')

Comparing the two hotels we can see that Resort Hotel had a larger share of domestic visitors than the City Hotel. At the Resort Hotel roughly 1 in 3 guests were from Portugal compared to only 1 in 5 guests at the City Hotel. Despite this, we see that the majority of guests at both hotels were foreign. Looking at individual countries we can see that German and French guests prefered the City Hotel to the Resort Hotel while the opposite was true for guests from Great Britain and Ireland.

**Completed Bookings by Market Segment**

The dataset defined the following market segments: Online Travel Agents (TA), Offline Travel Agents / Tour Operators (TO), Direct with Hotel, Group, Corporate, Complementary, and Aviation.

In [None]:
city_values = city_completed['market_segment'].value_counts(normalize=True).apply(lambda x: x*100).rename_axis('segment').reset_index(name='city')
resort_values = resort_completed['market_segment'].value_counts(normalize=True).apply(lambda x: x*100).rename_axis('segment').reset_index(name='resort')

merged_hotels = pd.merge(city_values, resort_values, how='left',on=['segment','segment'], sort=False).fillna(0)

city_values = merged_hotels['city'].tolist()
resort_values = merged_hotels['resort'].tolist()
market_segments = merged_hotels['segment'].tolist()
hotels = df['hotel'].unique().tolist()

In [None]:
fig = go.Figure()

trace=0

# adding a trace for each market segment by looping through a list of tuples structured like [(resort_value, city_value),..]
for segment_values in list(zip_longest(resort_values, city_values, fillvalue=0)):
    fig.add_trace(go.Bar(y=hotels, x=segment_values, name=market_segments[trace], orientation='h'))
    trace+=1

fig.update_traces(hovertemplate='(%{x:.1f}%, %{y})')

fig.update_xaxes(showticklabels=True,
                 tickmode = 'linear',
                 tick0 = 0,
                 dtick = 100,
                 showgrid=True, 
                 gridcolor='grey', 
                 zeroline=True, 
                 zerolinecolor='grey')

fig.update_layout(title = 'Completed Bookings by Market Segment', 
                  plot_bgcolor='white', barmode='stack',template='seaborn')

For both hotels the majority of completed bookings were made through either an Online TA or an Offline TA or TO. Booking direct with the hotel was more common at the Resort Hotel than the City Hotel as were corporate bookings to some surprise. There was also one market segment unique to the City Hotel which was Aviation. These bookings were likely flight crews on short layovers.

**Completed Bookings by Duration of Stay**



In [None]:
print('''The mean duration of stay at the Resort Hotel was {:.1f} nights with the longest being {} nights.'''.format(resort_completed['duration'].mean(), resort_completed['duration'].max()))
print('''The mean duration of stay at the City Hotel was {:.1f} nights with the longest being {} nights.'''.format(city_completed['duration'].mean(), city_completed['duration'].max()))

In [None]:
city_duration = city_completed['duration'].tolist()
resort_duration = resort_completed['duration'].tolist()

In [None]:
fig = make_subplots(2,1)

fig.add_trace(go.Histogram(x=resort_duration, 
                           name='Resort Hotel', 
                           histnorm='percent',
                           marker_color='rgb(2,56,88)'),1,1)

fig.add_trace(go.Histogram(x=city_duration, 
                           name='City Hotel', 
                           histnorm='percent', 
                           marker_color='rgb(5,112,176)'),2,1)

fig.update_traces(hovertemplate='(%{x}, %{y:.1f}%)')

fig.update_xaxes(showticklabels=False,
                 range=[0.5,14.5], row=1, col=1)

fig.update_xaxes(title = 'Duration of Stay (Nights)',
                 tickmode = 'linear',
                 tick0 = 1,
                 dtick = 1,
                 range=[0.5,14.5], row=2, col=1)

fig.update_yaxes(title ='Percent (%)',
                 tickmode = 'linear',
                 tick0 = 0,
                 dtick = 5)

fig.update_layout(title='Completed Bookings by Duration of Stay')

For both hotels the majority of completed bookings had a duration of 7 nights or less. However, the Resort Hotel had a much higher share of completed bookings for 7 nights than the City Hotel. These bookings were likely 7-night-stay package vacations booked primarily via a Travel Agent or Tour Operator. The most common completed booking duration at the Resort Hotel was 1 night. For the City Hotel the most common completed booking duration was 3 nights although stays of 1-2 nights and 4 nights were also frequent.

In [None]:
week_long_resort = df.loc[(df['duration']==7) & (df['hotel']=='Resort Hotel'),'market_segment'].value_counts(normalize=True).apply(lambda x: x*100)

print('''{:.1f}% of 7 night stays at the Resort Hotel were booked via an {} or an {} compared to an average of 65.4% across all completed Resort Hotel bookings.'''.format(week_long_resort[:2].sum(), week_long_resort.keys()[0], week_long_resort.keys()[1]))

**Completed Bookings by Booking Size**

In [None]:
print('''The mean booking size at the Resort Hotel was {:.1f} guests with the largest being {} guests.'''.format(resort_completed['booking_size'].mean(), resort_completed['booking_size'].max()))
print('''The mean booking size at the City Hotel was {:.1f} guests with the largest being {} guests.'''.format(city_completed['booking_size'].mean(), city_completed['booking_size'].max()))

Typically the maximum occupancy of a hotel room is 4 adults. Therefore, if we assume guests were unable to reserve more than one room on a single booking then we shouldn't see any bookings with more than 4 guests. This is a reasonable assumption as more than 99.8% of completed bookings across both hotels had 4 or fewer guests. So what explains the ~0.2% of completed bookings that had more than 4 guests? Lets take a closer look at some of these bookings. 

In [None]:
columns = ['adults','children','babies','booking_size']
df.loc[(df['is_canceled']==0) & (df['booking_size']>4), columns].sort_values('booking_size', ascending=False)

We can see there are several scenarios where a booking may have had more than 4 guests. A booking may have 5 guests containing only adults and children only if there are 2 adults and 3 children or 3 adults and 2 children. Alternatively a booking may have up to a combined total of 4 adults and children plus any additional number of babies. This explains why there are bookings with 9 and 10 babies!

For the histogram below I chose to exclude bookings with more than 4 guests as they make up such a small share of the distribution that they won't be visible.  

In [None]:
city_booking_size = city_completed['booking_size'].tolist()
resort_booking_size = resort_completed['booking_size'].tolist()

In [None]:
fig = make_subplots(2,1)

fig.add_trace(go.Histogram(x=resort_booking_size, 
                           name='Resort Hotel', 
                           histnorm='percent',
                           marker_color='rgb(2,56,88)'),1,1)

fig.add_trace(go.Histogram(x=city_booking_size, 
                           name='City Hotel', 
                           histnorm='percent', 
                           marker_color='rgb(5,112,176)'),2,1)

fig.update_traces(hovertemplate='(%{x}, %{y:.1f}%)')

fig.update_xaxes(showticklabels=False,
                 range=[0.5,4.5], row=1, col=1)

fig.update_xaxes(title = 'Number of Guests',
                 tickmode = 'linear',
                 tick0 = 1,
                 dtick = 1,
                 range=[0.5,4.5], row=2, col=1)

fig.update_yaxes(title ='Percent (%)')

fig.update_layout(title='Completed Bookings by Booking Size')

The distribution of completed bookings by booking size is remarkably similar between the hotels with roughly two-thirds of completed bookings having 2 guests. Hotel room inventory (ie. number of available 2 person and 4 person rooms) may be correlated with this distribution, although we unfortunately do not have access to this information. 

**Completed Bookings by Day of Arrival**

In [None]:
days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']

city_labels = city_completed['day_of_week'].value_counts(normalize=True).reindex(days).keys().tolist()
city_values = city_completed['day_of_week'].value_counts(normalize=True).reindex(days).apply(lambda x: x*100).tolist()

resort_labels = resort_completed['day_of_week'].value_counts(normalize=True).reindex(days).keys().tolist()
resort_values = resort_completed['day_of_week'].value_counts(normalize=True).reindex(days).apply(lambda x: x*100).tolist()

In [None]:
fig = make_subplots(2,1)

fig.add_trace(go.Bar(x=resort_labels, 
                     y=resort_values,
                     name='Resort Hotel',
                     marker_color='rgb(2,56,88)'),1,1)

fig.add_trace(go.Bar(x=city_labels, 
                     y=city_values,
                     name='City Hotel',
                     marker_color='rgb(5,112,176)'),2,1)

fig.update_traces(textposition='inside',
                  texttemplate='%{value:.1f}%',
                  hovertemplate='Day: %{label} <br>Percent: %{value:.2f}%')

fig.update_xaxes(showticklabels=False, row=1, col=1)
fig.update_xaxes(title = 'Day of Week', row=2, col=1)

fig.update_yaxes(title ='Percent (%)')

fig.update_layout(title='Completed Bookings by Day of Arrival')

For both hotels there was minimal variation in the share of completed bookings by day of arrival. Monday, Thursday, and Saturday arrivals were slightly more common at the Resort Hotel while Monday and Friday were at the City Hotel. In general, Monday is a popular travel day for business travellers while Friday is often the starting point for weekend getaways. Both facts could explain why arrivals were slightly higher on these days at the City Hotel.

In [None]:
saturday_resort = resort_completed.loc[resort_completed['day_of_week']=='Saturday','duration'].value_counts(normalize=True).apply(lambda x: x*100)

print('''{:.1f}% of Saturday arrivals at the Resort Hotel stayed for 7 nights compared to an average of 16.1% across all completed Resort Hotel bookings. It is common for 7-night-stay vacation packages to begin on Saturday suggesting these bookings could be driving the slightly higher Saturday arrivals for the Resort Hotel.'''.format(saturday_resort[7]))

**Completed Bookings by Week**

To assess weekly changes in demand a full calendar year worth of bookings data is required. Since the dataset only contains bookings between July 2015 and August 2017 we only have one full year (2016) to analyze. Fortunately demand fluctuations by week are relatively consistent year over year affected by factors such as school holidays, large events as well as more general seasonal variations.

To make the graph shown below more intuitive to the reader I restructured the way the week numbers were assigned to each day of the year. Originally the first week of year contained only 2 days (Jan 1st - 2nd) and the last week (or 53rd week) of the year contained a full 7 days (Dec 25th - 31st). Instead, the first week of year will now have the full 7 days (Jan 1st - 7th) and the last week only 2 days (Dec 30th - 31st). This adjustment should also better capture the demand around Christmas.

In [None]:
bookings_2016 = df.loc[df['year']==2016]

#original structure of days within each week
unique_days = bookings_2016.groupby('week')['arrival_date'].nunique()
unique_days.iloc[[0,-1]]

In [None]:
# Changing weeks so week 1 is January 1st - 7th
bookings_2016['week_2016'] = ((bookings_2016['arrival_date'] - dt.datetime(2016,1,1)).dt.days // 7) + 1

unique_days = bookings_2016.groupby('week_2016')['arrival_date'].nunique()
unique_days.iloc[[0,-1]]

Note that the warning above can be ignored as I am purposefully avoiding assigning the 'week_2016' column to the original dataframe. 

In [None]:
resort_week_2016 = list(bookings_2016.loc[(bookings_2016['is_canceled']==0) & (bookings_2016['hotel']=='Resort Hotel'), 'week_2016'])
city_week_2016 = list(bookings_2016.loc[(bookings_2016['is_canceled']==0) & (bookings_2016['hotel']=='City Hotel'), 'week_2016'])

In [None]:
fig = go.Figure()

fig.add_trace(go.Histogram(x=city_week_2016, 
                           name='City Hotel',
                           marker_color='rgb(5,112,176)'))

fig.add_trace(go.Histogram(x=resort_week_2016, name='Resort Hotel',
                           marker_color='rgb(2,56,88)'))

fig.update_layout(title='Completed Bookings by Week - 2016 Calendar Year', barmode='overlay')
fig.update_traces(opacity=0.8)

fig.update_xaxes(title = 'Week',
                 tickmode = 'linear',
                 tick0 = 0,
                 dtick = 5)

fig.update_yaxes(title = 'Completed Bookings')

fig.show()

The profile of completed bookings for each hotel generally mirror each other with the fewest completed bookings from early December to early February with a notable exception of around Christmas (especially for the City Hotel). On a week to week basis we can see certain weeks had a high number of completed bookings compared to the week prior or after. For example in weeks 12 (late March) and 44 (late October) the Resort Hotel saw a spike in bookings which could have been related to school holidays. 

**Average Daily Rate Per Person (*adr_pp*)**

As previously mentioned, more than 1000 bookings have an *adr* of 0. When looking at the *adr* over time I chose to filter the dataframe to select only bookings with a positive *adr* (i.e. paid bookings only). Additionally, I will be looking at the *adr_pp* not the *adr*. Since *adr* is correlated with *booking_size* we must look at *adr_pp* to accurately assess trends in hotel pricing over time.

**Note**: The chart below considers all bookings rather than just the completed bookings.

In [None]:
#filtering dataset for paid bookings only
adr_non_zero = df.loc[df['adr']>0]

adr_month = adr_non_zero.groupby(['hotel','year','month']).mean().reset_index().sort_values(['year','month'])
adr_month['adr_pp'] = adr_month['adr_pp'].round(decimals=2)

years = adr_month['year'].unique().tolist()
hotels = adr_month['hotel'].unique().tolist()

In [None]:
fig = make_subplots(rows=2, cols=1, subplot_titles=['City Hotel','Resort Hotel'])

#adding traces to subplots by looping through hotel type and year
row = 1 
for hotel in hotels:
    for year in years:
        fig.add_trace(go.Scatter(x=list(adr_month.loc[(adr_month['year']==year) & (adr_month['hotel']==hotel),'month']), 
                                 y=list(adr_month.loc[(adr_month['year']==year) & (adr_month['hotel']==hotel),'adr_pp']), 
                                 name=str(year)), row=row, col=1)
    row += 1

fig.update_layout(title='Average Daily Rate Per Person by Month',
                  hovermode='x unified',
                  showlegend=False)

fig.update_xaxes(title = 'Month',
                 tickmode = 'linear',
                 tick0 = 1,
                 dtick = 1,
                 range =[0.5,12.5])

fig.update_yaxes(title='ADR Per Person (Euros)',
                 range=[20,100])

fig.show()

Seasonal effects are present for both hotels but to different degrees and with different profiles. For the City Hotel the months of May and September coincided with the highest *adr_pp* while the lowest was during the middle of the winter (Dec-Feb). Seasonal effects at the Resort Hotel are even more pronounced with greater fluctations in *adr_pp* throughout the year than the City Hotel and a more defined peak in the month of August and dip during the winter. Across both hotels we can also see a general rise in the *adr_pp* year over year. 

**Completed Bookings by Customer Type**



In [None]:
city_values = city_completed['customer_type'].value_counts().tolist()
city_labels = city_completed['customer_type'].value_counts().keys().tolist()

resort_values = resort_completed['customer_type'].value_counts().tolist()
resort_labels = resort_completed['customer_type'].value_counts().keys().tolist()

In [None]:
specs = [[{'type':'domain'},{'type':'domain'}]]

fig = make_subplots(rows=1, cols=2, specs=specs, subplot_titles=['Resort Hotel','City Hotel'])

fig.add_trace(go.Pie(name='Resort Hotel', values=resort_values, labels=resort_labels), row=1, col=1)
fig.add_trace(go.Pie(name='City Hotel', values=city_values, labels=city_labels), row=1, col=2)

fig.update_layout(title='Completed Bookings by Customer Type',
                  template='seaborn')

The breakdown of completed bookings by customer type shows a very similar distribution between the Resort Hotel and City Hotel. The only notable difference is the higher percentage of contract bookings at the Resort Hotel.

# Exploring Relationships Between Variables

The focus of this section will be exploring the relationship between both numerical and categorical variables and whether a booking was canceled or not. Please note I am **not** trying to build a model to predict booking cancelations rather I am simply looking at which variables are correlated with booking cancelations. Once again I will be looking at both hotels separately. 

## Numerical Variables

First, lets look at a list of numerical variables in the dataframe.

In [None]:
numerical_variables = []

for column in df.columns:
    if (df.dtypes[column] == 'int64') |  (df.dtypes[column] == 'float64'):
        numerical_variables.append(column)

print(numerical_variables)

### Resort Hotel

This subsection will look at the relationships between numerical variables and booking cancelations at the Resort Hotel.

In [None]:
numericalMatrix = df[df['hotel']=='Resort Hotel'].corr().round(decimals=2)
mask = np.triu(np.ones_like(numericalMatrix))

sns.set(rc={'figure.figsize':(15.0,9.27)})
sns.heatmap(numericalMatrix, 
            annot=True,
            annot_kws={"size":10},
            linewidths=0.1,
            mask=mask,
            cmap='mako_r')

plt.title('Resort Hotel Correlation Matrix: Numerical Variables', size=18)

For the Resort Hotel the variables most correlated with *is_canceled* are *required_car_parking_spaces, lead_time, adr, booking_size, adr_pp, duration* and *year*. I will be looking at ***lead_time, booking_size, adr_pp, duration,*** and ***year***. In addition, I will also look at the relationship between *duration* and *lead_time*.

In [None]:
duration_resort = resort.loc[resort['duration']<15, ['duration','is_canceled']]
pivot = pd.pivot_table(duration_resort, index='duration', columns='is_canceled', aggfunc=len)
pivot['total'] = pivot[0] + pivot[1]
pivot['canceled_perc'] = round(pivot[1]/pivot['total']*100,1)
pivot['completed_perc'] = round(pivot[0]/pivot['total']*100,1)

duration = list(pivot.index)
canceled = list(pivot['canceled_perc'])
completed = list(pivot['completed_perc'])
count_stays = list(pivot['total'])

In [None]:
fig = make_subplots(rows=2, cols=1)

fig.add_trace(go.Bar(name='Canceled', 
                     x=duration, 
                     y=canceled,
                     marker_color='#EF553B'), row=1, col=1)

fig.add_trace(go.Bar(name='Completed', 
                     x=duration, 
                     y=completed,
                     marker_color='#636EFA'), row=1, col=1)

fig.add_trace(go.Bar(name='Total Bookings',
                     x=duration, 
                     y=count_stays, 
                     showlegend=False, 
                     marker_color='rgb(2,56,88)'), row=2, col=1)

fig.update_traces(texttemplate='%{value:.3s}%', 
                  textposition='auto',
                  textfont = dict(color='white', size=8), 
                  row=1, col=1)

fig.update_yaxes(title = 'Bookings (%)', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=1, col=1)

fig.update_xaxes(showticklabels=False,
                 row=1, col=1)

fig.update_yaxes(title = 'Total Bookings', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=2, col=1)

fig.update_xaxes(title = 'Duration of Stay (Nights)',
                 tickmode = 'linear',
                 tick0 = 0,
                 dtick = 1,
                 row=2, col=1)

fig.update_layout(title='Resort Hotel Cancelations by Duration of Stay',
                  barmode='stack',
                  plot_bgcolor='white')

fig.show()

For the Resort Hotel the overall trend is that longer stays (3+ nights) were more frequently canceled than shorter stays (1-2 nights).

In [None]:
resort_size = resort[['booking_size','is_canceled']]
pivot = pd.pivot_table(resort_size, index='booking_size', columns='is_canceled', aggfunc=len)
pivot['total'] = pivot[0] + pivot[1]
pivot['canceled_perc'] = round(pivot[1]/pivot['total']*100,1)
pivot['completed_perc'] = round(pivot[0]/pivot['total']*100,1)

guests = list(pivot.index)
canceled = list(pivot['canceled_perc'])
completed = list(pivot['completed_perc'])
count_stays = list(pivot['total'])

In [None]:
fig = make_subplots(rows=2, cols=1)

fig.add_trace(go.Bar(name='Canceled', 
                     x=guests, 
                     y=canceled,
                     marker_color='#EF553B'), row=1, col=1)

fig.add_trace(go.Bar(name='Completed', 
                     x=guests, 
                     y=completed,
                     marker_color='#636EFA'), row=1, col=1)

fig.add_trace(go.Bar(name='Total Bookings',
                     x=guests, 
                     y=count_stays, 
                     showlegend=False, 
                     marker_color='rgb(2,56,88)'), row=2, col=1)

fig.update_traces(texttemplate='%{value:.3s}%', 
                  textposition='auto',
                  textfont = dict(color='white', size=10), 
                  row=1, col=1)

fig.update_yaxes(title = 'Bookings (%)', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=1, col=1)

fig.update_xaxes(showticklabels=False,
                 range=[0.5,4.5],
                 row=1, col=1)

fig.update_yaxes(title = 'Total Bookings', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=2, col=1)

fig.update_xaxes(title = 'Number of Guests',
                 tickmode = 'linear',
                 tick0 = 0,
                 dtick = 1,
                 range=[0.5,4.5],
                 row=2, col=1)

fig.update_layout(title='Resort Hotel Cancelations by Booking Size',
                  barmode='stack',
                  plot_bgcolor='white')

fig.show()

For the Resort Hotel we can see that bookings with fewer guests typically had lower cancelation rates. Bookings with 4 guests (predominately families?) were canceled at a rate nearly twice that of bookings with 2 guests.

In [None]:
adult_only = resort[(resort['booking_size']==4) & (resort['adults']==4)].shape[0]

print('''{:.1f}% of bookings with 4 guests at the Resort Hotel had either a child or a baby.'''.format((count_stays[3]-adult_only)*100/count_stays[3]))

### City Hotel

This subsection will look at relationships between numerical variables and booking cancelations at the City Hotel.

In [None]:
numericalMatrix = city.corr().round(decimals=2)
mask = np.triu(np.ones_like(numericalMatrix))

sns.set(rc={'figure.figsize':(15.0,9.27)})
sns.heatmap(numericalMatrix, 
            annot=True,
            annot_kws={"size":10},
            linewidths=0.1,
            mask=mask,
            cmap='mako_r')

plt.title('City Hotel Correlation Matrix: Numerical Variables', size=18)

For the City Hotel the variables most correlated with *is_canceled* are *lead_time, total_of_special_requests, required_car_parking_spaces, duration, booking_size, adr,* and *year*. I will be looking at the variables ***lead_time, duration, booking_size*** and ***year***. I chose to exclude the variable *adr* as I suspect much of its correlation with *is_canceled* is being driven by its strong relationship with *booking_size*. The variable *adr_pp* is a much better measure of hotel pricing however it is not correlated with booking cancelations for the City Hotel. 

In [None]:
duration_city = city.loc[city['duration']<15, ['duration','is_canceled']]
pivot = pd.pivot_table(duration_city, index='duration', columns='is_canceled', aggfunc=len)
pivot['total'] = pivot[0] + pivot[1]
pivot['canceled_perc'] = round(pivot[1]/pivot['total']*100,1)
pivot['completed_perc'] = round(pivot[0]/pivot['total']*100,1)

duration = list(pivot.index)
canceled = list(pivot['canceled_perc'])
completed = list(pivot['completed_perc'])
count_stays = list(pivot['total'])

In [None]:
fig = make_subplots(rows=2, cols=1)

fig.add_trace(go.Bar(name='Canceled', 
                     x=duration, 
                     y=canceled,
                     marker_color='#EF553B'), row=1, col=1)

fig.add_trace(go.Bar(name='Completed', 
                     x=duration, 
                     y=completed,
                     marker_color='#636EFA'), row=1, col=1)

fig.add_trace(go.Bar(name='Total Bookings',
                     x=duration, 
                     y=count_stays, 
                     showlegend=False, 
                     marker_color='rgb(5,112,176)'), row=2, col=1)

fig.update_traces(texttemplate='%{value:.3s}%', 
                  textposition='auto',
                  textfont = dict(color='white', size=8), 
                  row=1, col=1)

fig.update_yaxes(title = 'Bookings (%)', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=1, col=1)

fig.update_xaxes(showticklabels=False,
                 row=1, col=1)

fig.update_yaxes(title = 'Total Bookings', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=2, col=1)

fig.update_xaxes(title = 'Duration of Stay (Nights)',
                 tickmode = 'linear',
                 tick0 = 0,
                 dtick = 1,
                 row=2, col=1)

fig.update_layout(title='City Hotel Cancelations by Duration of Stay',
                  barmode='stack',
                  plot_bgcolor='white')

fig.show()

For a given duration of stay, cancelation rates at the City Hotel are higher than at the Resort Hotel. However, the general trend of longer stays being canceled more frequently than shorter stays remains consistent across both hotels.

In [None]:
city_size = city[['booking_size','is_canceled']]
pivot = pd.pivot_table(city_size, index='booking_size', columns='is_canceled', aggfunc=len)
pivot['total'] = pivot[0] + pivot[1]
pivot['canceled_perc'] = round(pivot[1]/pivot['total']*100,1)
pivot['completed_perc'] = round(pivot[0]/pivot['total']*100,1)

guests = list(pivot.index)
canceled = list(pivot['canceled_perc'])
completed = list(pivot['completed_perc'])
count_stays = list(pivot['total'])

In [None]:
fig = make_subplots(rows=2, cols=1)

fig.add_trace(go.Bar(name='Canceled', 
                     x=guests, 
                     y=canceled,
                     marker_color='#EF553B'), row=1, col=1)

fig.add_trace(go.Bar(name='Completed', 
                     x=guests, 
                     y=completed,
                     marker_color='#636EFA'), row=1, col=1)

fig.add_trace(go.Bar(name='Total Bookings',
                     x=guests, 
                     y=count_stays, 
                     showlegend=False, 
                     marker_color='rgb(5,112,176)'), row=2, col=1)

fig.update_traces(texttemplate='%{value:.3s}%', 
                  textposition='auto',
                  textfont = dict(color='white', size=10), 
                  row=1, col=1)

fig.update_yaxes(title = 'Bookings (%)', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=1, col=1)

fig.update_xaxes(showticklabels=False,
                 range=[0.5,4.5],
                 row=1, col=1)

fig.update_yaxes(title = 'Total Bookings', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=2, col=1)

fig.update_xaxes(title = 'Number of Guests',
                 tickmode = 'linear',
                 tick0 = 0,
                 dtick = 1,
                 range=[0.5,4.5],
                 row=2, col=1)

fig.update_layout(title='City Hotel Cancelations by Booking Size',
                  barmode='stack',
                  plot_bgcolor='white')

fig.show()

Similar to the Resort Hotel, City Hotel bookings were canceled less frequently when the booking had fewer guests. However, the disparity in cancelation rates across booking size was less significant compared to the Resort Hotel.

### Grouped Relationship Charts - Resort Hotel and City Hotel

In [None]:
fig = px.box(df, x='hotel', y='lead_time', color='is_canceled', notched=True)

fig.update_layout(title='Booking Cancelations by Lead Time and Hotel',
                  xaxis_title='Hotel',
                  yaxis_title='Booking Lead Time (Days)')

fig.show()

It is clear that bookings made with larger lead times were more frequently canceled at both hotels. A possible explanation for this is that bookings made far ahead of the arrival date provide more time for the travel plans of the guest(s) to change. For example, someone who is booking a stay on Monday for Friday of the same week may be more likely to have definitive travel plans and follow through with the booking than someone booking 6 months in advance. 

In [None]:
colors = {'City Hotel':'rgb(5,112,176)', 
          'Resort Hotel':'rgb(2,56,88)'}

fig = px.box(df, 
             x='duration',
             y='lead_time', 
             range_x=[0.5,14.5], 
             color='hotel', 
             notched=True, 
             color_discrete_map=colors)

fig.update_layout(title='Lead Time vs. Duration of Stay',
                  yaxis_title='Lead Time',
                  xaxis_title='Duration of Stay (Nights)')

fig.update_xaxes(tickmode = 'linear',
                 tick0 = 1,
                 dtick = 1)

fig.show()

The general trend is that the longer the stay the longer the lead time. As we saw in the previous chart shorter lead times are correlated with fewer cancelations but so are shorter durations of stay. This raises the question of which of these variables, *duration* or *lead_time*, is the main predictor of cancelations? I'm inclined to think that *duration* is the better indicator and it just so happen that guests tend to make bookings with longer durations further in advance. Alternatively, it could be the other way around, or perhaps most likely, both variables have some influence?

In [None]:
adr_month['cancelation_perc'] = (adr_month['is_canceled']*100).round(decimals=2)
adr_month[['year','month']] = adr_month[['year','month']].astype(str)

In [None]:
fig = px.scatter(adr_month,
                 x='adr_pp',
                 y='cancelation_perc', 
                 color='year',
                 facet_row='hotel',
                 hover_data=['month'])

fig.update_traces(marker=dict(size=16, line=dict(color='black', width=2)))
fig.update_layout(title='Booking Cancelations vs. Average Daily Rate per Person', 
                  xaxis_title='Average Daily Rate Per Person (Euros)')
fig.update_yaxes(title='Cancelations (%)')

fig.show()

For the City Hotel we can see there is no clear relationship between the *adr_pp* and the cancelation rate. On the other hand it is apparent that as the *adr_pp* rises at the Resort Hotel, so does the cancelation rate. This suggests that guests of the Resort Hotel may be more price-sensitive than those at the City Hotel, and providing they have the ability, will not hestitate to cancel a booking should they find a better rate elsewhere. In addition, we can also see that  the cancelation rate seems to be rising year over year for both hotels.

On an unrelated note we have one outlier on the City Hotel subplot. For the month of July 2015 the cancelation rate was by far the highest of any month in the dataset despite the *adr_pp* being the lowest of any month. A possible explanation is that some external event in the city or region may have negatively impacted hotel demand and led to a large number of cancelations.

## Categorical Variables

The goal here was to construct a correlation matrix for each hotel containing only the categorical variables and the variable of interest, *is_canceled*. I wanted to achieve this without changing the original dataframe therefore I created a copy of the dataframe.

In [None]:
df1 = df.copy()

There are several ways we can convert categorical variables to numerical. I chose to use sklearn's LabelEncoder. LabelEncoder works by assigning an integer to each unique string/category in an array. For example, in the code below all occurances  of 'City Hotel' and 'Resort Hotel' in the *hotel* variable were converted to 0 and 1, respectively. The same process is applied to each categorical variable in the dataframe.

In [None]:
le = preprocessing.LabelEncoder()

categorical_variables = []

#converting all categorical variables to numerical
for column in df1.columns:
    if df1.dtypes[column] == 'object':
        if column == 'hotel':
            le.fit(df1[column])
            le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
            print(le_name_mapping)
            df1[column] = le.transform(df1[column])
        else:
            categorical_variables.append(column)
            le.fit(df1[column])
            df1[column] = le.transform(df1[column])

Now we've converted the categorical variables to numerical, lets look at a list of the categorical variables in the dataframe.

In [None]:
print(categorical_variables)

In [None]:
#inserting 'is_canceled' so it will be included in the categorical correlation matrices
categorical_variables.insert(0, 'is_canceled')

#creating a subset of the dataframe for each hotel that includes only categorical variables
resort_categorical = df1.loc[df1['hotel']==1, categorical_variables]
city_categorical = df1.loc[df1['hotel']==0, categorical_variables]

**Booking Cancelation Rate by Hotel**

You may have noticed that despite being a categorical variable, *hotel* was excluded from the list of categorical variables. This was deliberate as *hotel* will be perfectly correlated in both subsets of the data used in the categorical correlation matrices. That leaves us with the question of, was the hotel type correlated with booking cancelations? Lets find out.

In [None]:
df1.corr()['hotel']['is_canceled'].round(decimals=2)

We can see that *hotel* and *is_canceled* are weakly correlated.

In [None]:
resort_cancel_rate = df.loc[df['hotel']=='Resort Hotel','is_canceled'].value_counts(normalize=True).apply(lambda x: x*100)
city_cancel_rate = df.loc[df['hotel']=='City Hotel','is_canceled'].value_counts(normalize=True).apply(lambda x: x*100)

print('''The overall booking cancelation rate for the Resort Hotel was {:.1f}% compared to {:.1f}% at the City Hotel.'''.format(resort_cancel_rate[1],city_cancel_rate[1]))

### Resort Hotel 

This subsection will look at the relationships between categorical variables and booking cancelations for the Resort Hotel.

In [None]:
categoricalMatrix = resort_categorical.corr().round(decimals=2)
mask = np.triu(np.ones_like(categoricalMatrix))

sns.set(rc={'figure.figsize':(12.0,6.0)})
sns.heatmap(categoricalMatrix, 
            annot=True,
            linewidths=0.1,
            mask=mask,
            cmap='mako_r')

plt.title('Resort Hotel Correlation Matrix: Categorical Variables', size=18)

For the Resort Hotel we can see that the categorical variables most correlated with *is_canceled* are ***reservation_status***, ***country***, ***market_segment*** and ***distribution_channel***. Notice that *market_segment* and *distribution_channel* are nearly perfectly correlated therefore we only need to look at one or the other. I chose to look at *market_segment*. *Reservation_status* can also be ignored as it nearly perfectly correlated with *is_canceled*.

In [None]:
market_segment = resort[['market_segment','is_canceled']]
pivot = pd.pivot_table(market_segment, index='market_segment', columns='is_canceled', aggfunc=len)
pivot['total'] = pivot[0] + pivot[1]
pivot['canceled_perc'] = round(pivot[1]/pivot['total']*100,1)
pivot['completed_perc'] = round(pivot[0]/pivot['total']*100,1)

market_segment = list(pivot.index)
canceled = list(pivot['canceled_perc'])
completed = list(pivot['completed_perc'])
count_stays = list(pivot['total'])

In [None]:
fig = make_subplots(rows=2, cols=1)

fig.add_trace(go.Bar(name='Canceled', 
                     x=market_segment, 
                     y=canceled,
                     marker_color='#EF553B'), row=1, col=1)

fig.add_trace(go.Bar(name='Completed', 
                     x=market_segment, 
                     y=completed,
                     marker_color='#636EFA'), row=1, col=1)

fig.add_trace(go.Bar(name='Total Bookings',
                     x=market_segment, 
                     y=count_stays, 
                     showlegend=False, 
                     marker_color='rgb(2,56,88)'), row=2, col=1)

fig.update_traces(texttemplate='%{value:.3s}%', 
                  textposition='auto',
                  textfont = dict(color='white', size=10), 
                  row=1, col=1)

fig.update_yaxes(title = 'Bookings (%)', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=1, col=1)

fig.update_xaxes(showticklabels=False,
                 row=1, col=1)

fig.update_yaxes(title = 'Total Bookings', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=2, col=1)

fig.update_xaxes(title = 'Market Segment',
                 tickmode = 'linear',
                 tick0 = 0,
                 dtick = 1,
                 row=2, col=1)

fig.update_layout(title='Resort Hotel Cancelations by Market Segment',
                  barmode='stack',
                  plot_bgcolor='white')

fig.show()

Bookings made via an Online TA had by far the highest cancelation rate of all *market_segments* for the Resort Hotel while bookings made directly with the hotel or via an Offline TA/TO had among the lowest.

In [None]:
total_bookings = resort['country'].value_counts().rename_axis('Country').reset_index(name='Total Bookings')
canceled_bookings = resort.loc[resort['is_canceled']==1, 'country'].value_counts().rename_axis('Country').reset_index(name='Cancelations')

bookings = pd.merge(total_bookings, canceled_bookings, how='left',on=['Country','Country'], sort=False).fillna(0)

bookings['Cancelation Rate (%)'] = round(bookings['Cancelations']*100/bookings['Total Bookings'],2)

In [None]:
fig = px.choropleth(bookings, 
                    title='Resort Hotel Cancelation Rate by Country of Origin',
                    locations='Country', 
                    color='Cancelation Rate (%)', 
                    color_continuous_scale=px.colors.sequential.Reds,
                    hover_data=['Total Bookings','Cancelations'])

fig.show()

Among the countries with the largest share of guests at the Resort Hotel several stand out as having particularly high  or low cancelations rates. Roughly 1 in 3 bookings from Portugal were canceled compared to only 1 in 9 from Great Britain and 1 in 5 from Spain.

### City Hotel

This subsection will look at the relationships between categorical variables and booking cancelations for the City Hotel.

In [None]:
categoricalMatrix = city_categorical.corr().round(decimals=2)
mask = np.triu(np.ones_like(categoricalMatrix))

sns.set(rc={'figure.figsize':(12.0,6.0)})
sns.heatmap(categoricalMatrix, 
            annot=True,
            linewidths=0.1,
            mask=mask,
            cmap='mako_r')

plt.title('City Hotel Correlation Matrix: Numerical Variables', size=18)

For the City Hotel we can see the categorical variables most correlated with *is_canceled* are ***reservation_status*, *market_segment*, *distribution_channel***, and ***deposit_type***. For the reasons previously stated *reservation_status* and *distribution channel* will be ignored.

In [None]:
#we have two completed bookings with an undefined market segment
market_segment = city.loc[city['market_segment']!='Undefined', ['market_segment','is_canceled']]
pivot = pd.pivot_table(market_segment, index='market_segment', columns='is_canceled', aggfunc=len)
pivot['total'] = pivot[0] + pivot[1]
pivot['canceled_perc'] = round(pivot[1]/pivot['total']*100,1)
pivot['completed_perc'] = round(pivot[0]/pivot['total']*100,1)

market_segment = list(pivot.index)
canceled = list(pivot['canceled_perc'])
completed = list(pivot['completed_perc'])
count_stays = list(pivot['total'])

In [None]:
fig = make_subplots(rows=2, cols=1)

fig.add_trace(go.Bar(name='Canceled', 
                     x=market_segment, 
                     y=canceled,
                     marker_color='#EF553B'), row=1, col=1)

fig.add_trace(go.Bar(name='Completed', 
                     x=market_segment, 
                     y=completed,
                     marker_color='#636EFA'), row=1, col=1)

fig.add_trace(go.Bar(name='Total Bookings',
                     x=market_segment, 
                     y=count_stays, 
                     showlegend=False, 
                     marker_color='rgb(5,112,176)'), row=2, col=1)

fig.update_traces(texttemplate='%{value:.3s}%', 
                  textposition='auto',
                  textfont = dict(color='white', size=10), 
                  row=1, col=1)

fig.update_yaxes(title = 'Bookings (%)', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=1, col=1)

fig.update_xaxes(showticklabels=False,
                 row=1, col=1)

fig.update_yaxes(title = 'Total Bookings', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=2, col=1)

fig.update_xaxes(title = 'Market Segment',
                 tickmode = 'linear',
                 tick0 = 0,
                 dtick = 1,
                 row=2, col=1)

fig.update_layout(title='City Hotel Cancelations by Market Segment',
                  barmode='stack',
                  plot_bgcolor='white')

fig.show()

Similar to the Resort Hotel, City Hotel bookings made via an Online TA had a much higher cancelation rate compared to bookings made directly with the hotel or via an Offline TA/TO.

In [None]:
deposit = city[['deposit_type','is_canceled']]
pivot = pd.pivot_table(deposit, index='deposit_type', columns='is_canceled', aggfunc=len)
pivot['total'] = pivot[0] + pivot[1]
pivot['canceled_perc'] = round(pivot[1]/pivot['total']*100,1)
pivot['completed_perc'] = round(pivot[0]/pivot['total']*100,1)

deposit = list(pivot.index)
canceled = list(pivot['canceled_perc'])
completed = list(pivot['completed_perc'])
count_stays = list(pivot['total'])

In [None]:
fig = make_subplots(rows=2, cols=1)

fig.add_trace(go.Bar(name='Canceled', 
                     x=deposit, 
                     y=canceled,
                     marker_color='#EF553B'), row=1, col=1)

fig.add_trace(go.Bar(name='Completed', 
                     x=deposit, 
                     y=completed,
                     marker_color='#636EFA'), row=1, col=1)

fig.add_trace(go.Bar(name='Total Bookings',
                     x=deposit, 
                     y=count_stays, 
                     showlegend=False, 
                     marker_color='rgb(5,112,176)'), row=2, col=1)

fig.update_traces(texttemplate='%{value:.3s}%', 
                  textposition='auto',
                  textfont = dict(color='white', size=10), 
                  row=1, col=1)

fig.update_yaxes(title = 'Bookings (%)', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=1, col=1)

fig.update_xaxes(showticklabels=False,
                 row=1, col=1)

fig.update_yaxes(title = 'Total Bookings', 
                 showgrid=True, 
                 gridcolor='grey', 
                 gridwidth=1,
                 zeroline=True, 
                 zerolinecolor='grey',
                 zerolinewidth=1, 
                 row=2, col=1)

fig.update_xaxes(title = 'Deposit Type',
                 tickmode = 'linear',
                 tick0 = 0,
                 dtick = 1,
                 row=2, col=1)

fig.update_layout(title='City Hotel Cancelations by Deposit Type',
                  barmode='stack',
                  plot_bgcolor='white')

fig.show()

We can see that almost all bookings had no deposit. However, for those bookings that had a non-refundable deposit nearly all were canceled. This cancelation rate is significantly higher than I would have expected for bookings of this type and leads me question the validity of the data.

If you've made it this far, thanks for reading! To build on this analysis I may look to create some multiple regression models to predict cancelations, or dive into the topic of machine learning.