# SDAH MODUL Seminar 3 : Hotel Bookings Part 1 - Exploratory Analysis


This notebook, prepared for a tutorial, is inspired by and partly adapted from:
 - [Factors influencing Hotel Booking - Quick Study](https://www.kaggle.com/samiranbera/factors-influencing-hotel-booking-quick-study) by [SamiranBera](https://www.kaggle.com/samiranbera)
 - [EDA of bookings and ML to predict cancelations](https://www.kaggle.com/marcuswingen/eda-of-bookings-and-ml-to-predict-cancelations) by [Marcus Wingen](https://www.kaggle.com/marcuswingen)
 - [EDA of Hotel Bookings](https://www.kaggle.com/listonlt/eda-of-hotel-bookings) by [Liston Tellis](https://www.kaggle.com/listonlt)


## Dataset description and initial assessment

The [Hotel booking demand](https://www.kaggle.com/jessemostipak/hotel-booking-demand) dataset has originally been described in [Antonio et al. (2019): Hotel booking demand datasets](https://doi.org/10.1016/j.dib.2018.11.126.). It was cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020.

It contains booking data (31 variables) on two hotels in Portugal:
 - **H1:** a resort hotel at the Algarve (40,060 observations)
 - **H2:** a city hotel in Lisbon(79,330 observations)
 
Each observation represents a hotel booking (due to arrive between July 1, 2015 and August 31, 2017), including **bookings that effectively arrived and bookings that were canceled**. 

The data is from real hotel bookings, but all data pertaining to hotel or costumer identification have been deleted.

### Preliminaries..

In [None]:
# Load common libraries:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import folium


# set some display options:
sns.set(style="whitegrid")
pd.set_option("display.max_columns", 36)

### Load the data

In [None]:
# load data:
file_path = "../input/hotel-booking-demand/hotel_bookings.csv"
full_data = pd.read_csv(file_path)

In [None]:
full_data.shape

In [None]:
full_data.head()
#full_data.tail()

These are the first 5 observations.

Hmm.. let's look at what columns we have available..

In [None]:
full_data.info()

In [None]:
full_data.hotel.unique()

Let's take a look at the variable descriptions from the paper:

- `hotel`: `Resort Hotel` or `City Hotel` *(Categorical)*
- `is_canceled` Value indicating if the booking was canceled (`1`) or not (`0`) *categorical*



- `lead_time` Number of days that elapsed betweenthe entering date of the booking into the PMS and the arrival date
- `arrival_date_year` Year of arrival date (Integer)
- `arrival_date_month` Month of arrival date with 12 categories: `January` to `December` *(categorical)*
- `arrival_date_week_number` Week number of the arrival date *(Integer)*
- `arrival_date_day_of_month` Day of the month of the arrival date *(Integer)*



- `stays_in_weekend_nights` Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel *(Integer)*
- `stays_in_week_nights` Number of week nights (Monday to Fri-day) the guest stayed or booked to stay at the hotel *(Integer)*



- `adults` Number of adults *(Integer)*
- `children` Number of children *(Integer)*
- `babies` Number of Babies *(Integer)*
- `meal` Type of meal booked. Categories arepresented in standard hospitality meal packages: 
    - `Undefined/SC` - no meal package; 
    - `BB` – Bed & Breakfast; 
    - `HB` – Half board (breakfast and one other meal–usually dinner); 
    - `FB` – Full board (breakfast, lunch and dinner)
- `country` Country of origin. Categories are represented in the ISO 3155–3:2013 format *(Categorical)*

- `market_segment` Market segment designation 
    - `TA` means TravelAgents and 
    - `TO` means TourOperators *(Categorical)*
- `distribution_channel` Booking distribution channel 
    - `TA` means TravelAgents and 
    - `TO` means Tour Operators *Categorical)*



- `is_repeated_guest` Value indicating if the booking namewas from a repeated guest (1) or not (0) *categorical*
- `previous_cancellations` Number of previous bookings that werecancelled by the customer prior to the current booking *(Integer)*
- `previous_bookings_not_canceled` Number of previous bookings notcancelled by the customer prior to thecurrent booking *(Integer)*


- `reserved_room_type` Code of room type reserved. Code ispresented instead of designation for anonymity reasons
- `assigned_room_type` Code for the type of room assigned to the booking. Sometimes the assigned roomtype differs from the reserved room typedue to hotel operation reasons (e.g.overbooking) or by customer request. Code is presented instead of designation for anonymity reasons. *(Catgorical)*


- `booking_changes` Number of changes/amendments madeto the booking from the moment thebooking was entered on the PMS untilthe moment of check-in or cancellation *(Integer)*
- `deposit_type`



- `agent` ID of the travel agency that made thebooking *(Categorical)*
- `company` ID of the company/entity that made thebooking or responsible for paying thebooking. ID is presented instead of des-ignation for anonymity reasons *(Categorical)*

- `days_in_waiting_list` Number of days the booking was in thewaiting list before it was confirmed to the customer *(Integer)*
- `customer_type` Type of booking, assuming one of four categories: 
    - `Contract` - when the booking has an allotment or other type of contract associated to it; 
    - `Group` – when the booking is asso-ciated to a group;
    - `Transient` – when the booking is notpart of a group or contract, and is not associated to other transient booking;
    - `Transient-party` – when the booking istransient, but is associated to at leastother transient booking

- `adr` Average Daily Rate - Calculated by dividing the sum of all lodging transactions by the total number of staying nights *(Numeric)*
- `required_car_parking_spaces` Number of car parking spaces requiredby the customer *(Integer)*
- `total_of_special_requests` Number of special requests made by thecustomer (e.g. twin bed or highfloor) *(Integer)*
- `reservation_status` Reservation last status, assuming one ofthree categories:
    - `Canceled` - booking was canceled bythe customer;
    - `Check-Out` - customer has checked inbut already departed;
    - `No-Show` - customer did not check-in and did inform the hotel of the reason why
- `reservation_status_date` Date at which the last status was set. This variable can be used in conjunction with the Reservation Status to understand when was the booking canceled or whendid the customer checked-out of the hotel *(Date)*

**Note:** there are some differences between the original data set described in the paper and the dataset here:
 - in the paper, there is a separate data set for each hotel, which have been merged with an added column `hotel`
 - omission of redundant variables (e.g., Categorical and Integer versions of month)
 

Let's see if that matches the data we have..

In [None]:
t = pd.DataFrame([[i,full_data[i].unique()] for i in full_data.columns])
t.columns = ['name','unique']
t   

In [None]:
full_data.describe(include='all')

**Phew, that's a lot of data and there are lot's of things that we can ask about it..**

## Preprocessing

### Missing values

In [None]:
# check for missing values
full_data.isnull().sum()

**Looks like there is data missing for `country`, `agent`, and `company`**

This is actually quite common and there are a lot of more or less sophisticated strategies of how to deal with missing data such as:
 - dropping the observations that have missing values
 - replace missing values with a specific value (e.g., mean replacement, use mode, i.e., most common value etc.)
 - impute missing values (e.g., carry last observation forward, hot deck methods etc.)


We have to make some assumptions here:
 - `agent` Let's assume that when no agency is given, the booking was made without one.
 - `company`: Same here - if there is none given, it was most likely private
 
 - `children`: replace four missing values with `0`
 - `country`: add a category for `Unknown`
 

In [None]:
# Replace missing values:
nan_replacements = {"children": 0.0, "country": "Unknown", "agent": 0, "company": 0}
full_data_cln = full_data.fillna(nan_replacements)

# "meal" contains values "Undefined", which is equal to SC.
full_data_cln["meal"].replace("Undefined", "SC", inplace=True)

In [None]:
# check for missing values
print('Remaining Missing Values = ',full_data_cln.isna().sum().sum())

That looks better.. anything else that may not be plausible? Let's check the guests..

In [None]:
zero_guests = list(full_data_cln.loc[full_data_cln["adults"]
                   + full_data_cln["children"]
                   + full_data_cln["babies"]==0].index)
zero_guests

Looks like we have "ghost bookings" (bookings for 0 adults, 0 children and 0 babies).. let's get rid of them..

In [None]:
full_data_cln.drop(full_data_cln.index[zero_guests], inplace=True)

In [None]:
# How much data is left?
full_data_cln.shape

### Check for outliers

In [None]:
t = pd.DataFrame([[i,full_data[i].unique()] for i in full_data_cln.columns])
t.columns = ['name','unique']
t   

**9/10 babies; 10 children, 8 parking spaces seems a little excessive..**

In [None]:
full_data_cln[full_data_cln['babies'] > 8]

In [None]:
full_data_cln[full_data_cln['children'] > 8]

In [None]:
full_data_cln[full_data_cln['required_car_parking_spaces'] > 7]

**Hmm.. these seem exceptional and maybe they are accurate, so let's let that slide..**

### What about the continuous variables?

In [None]:
ax = sns.boxplot(x=full_data_cln['adr'])

Looks like an outlier, let's get rid of it...

In [None]:
# Deleting a record with ADR greater than 5000
full_data_cln = full_data_cln[full_data_cln['adr'] < 5000]
ax = sns.boxplot(x=full_data_cln['adr'])

This looks better..

## Exploratory Data Analysis

Let's start by formulating a few simple research questions:
1. Where do the guests come from?
1. When do they book?
1. How much do guests pay for a room per night?
1. How does the price per night vary over the year?
1. Which are the busiest months?
1. How long do people stay at the hotels?
1. Which are the most important market segments and booking channels?
1. Do guests come back?
1. How many bookings were canceled?
1. Which month have the highest number of cancelations?

In [None]:
full_data_cln['hotel'].value_counts()

Let's separate the Resort and City hotel.

For now, we are interested in the actutal visitor numbers, so only bookings that were not canceled are included. 

In [None]:
rh = full_data_cln.loc[(full_data_cln["hotel"] == "Resort Hotel") & (full_data_cln["is_canceled"] == 0)]
ch = full_data_cln.loc[(full_data_cln["hotel"] == "City Hotel") & (full_data_cln["is_canceled"] == 0)]

### Q1. Where do guests come from?

In [None]:
# get number of acutal guests by country
country_data = pd.DataFrame(full_data_cln.loc[full_data_cln["is_canceled"] == 0]["country"].value_counts())
country_data.index.name = "country"
country_data.rename(columns={"country": "Number of Guests"}, inplace=True)
total_guests = country_data["Number of Guests"].sum()
country_data["Guests in %"] = round(country_data["Number of Guests"] / total_guests * 100, 2)
country_data.head()

In [None]:
# show on map
guest_map = px.choropleth(country_data,
                    locations=country_data.index,
                    color=country_data["Guests in %"], 
                    hover_name=country_data.index, 
                    color_continuous_scale=px.colors.sequential.Plasma,
                    title="Home country of guests")
guest_map.show()

**People from all over the world stay in these two hotels. Most guests are from Portugal and other countries in Europe.**

### Q2. When do they book?

In [None]:
plt.rcParams['figure.figsize'] = 15,6
plt.hist(full_data_cln['lead_time'], bins=50)

plt.ylabel('Count')
plt.xlabel('Time (days)')
plt.title("Lead time distribution ", fontdict=None, position= [0.48,1.05])
plt.show()

### Q3. How much do they pay for a room per night?

Both hotels have different room types and different meal arrangements. Seasonal factors are also important. So the prices vary a lot.

In [None]:
rh["adr"].describe()

In [None]:
ch["adr"].describe()

**So the resort hotel seems to be slightly more expensive, let's take a look at the room categories..**

In [None]:
full_data_guests = full_data_cln.loc[full_data_cln["is_canceled"] == 0] # only actual gusts
room_prices = full_data_guests[["hotel", "reserved_room_type", "adr"]].sort_values("reserved_room_type")

# barplot with standard deviation:
plt.figure(figsize=(12, 8))
sns.barplot(x = "reserved_room_type", y="adr", hue="hotel", data=room_prices, 
            hue_order = ["City Hotel", "Resort Hotel"], ci="sd", errwidth=1, capsize=0.2)
plt.title("Price of room types per night", fontsize=16)
plt.xlabel("Room type", fontsize=16)
plt.ylabel("Price [EUR]", fontsize=16)
plt.legend(loc="upper right")
plt.show()

This figure shows the average price per room, depending on its type and the standard deviation.
Note that due to data anonymization rooms with the same type letter may not necessarily be the same across hotels.

##### Occupancy

In [None]:
full_data_guests['total_guests'] = full_data_guests['adults']+ full_data_guests['children']+ full_data_guests['babies']
plt.figure(figsize=(12,8))
ax = sns.countplot(x="total_guests", data = full_data_guests)
plt.title('Number of Guests')
plt.xlabel('total_guests')
plt.ylabel('Count')
for p in ax.patches:
    ax.annotate((p.get_height()),(p.get_x()+0.1 , p.get_height()+100)) 

In [None]:
# normalize price per night (adr):
full_data_cln["adr_pp"] = full_data_cln["adr"] / (full_data_cln["adults"] + full_data_cln["children"])
full_data_guests = full_data_cln.loc[full_data_cln["is_canceled"] == 0] # only actual gusts
room_prices = full_data_guests[["hotel", "reserved_room_type", "adr_pp"]].sort_values("reserved_room_type")

# barplot with standard deviation:
plt.figure(figsize=(12, 8))
sns.barplot(x = "reserved_room_type", y="adr_pp", hue="hotel", data=room_prices, 
            hue_order = ["City Hotel", "Resort Hotel"], ci="sd", errwidth=1, capsize=0.2)
plt.title("Price of room types per night and person", fontsize=16)
plt.xlabel("Room type", fontsize=16)
plt.ylabel("Price [EUR]", fontsize=16)
plt.legend(loc="upper right")
plt.show()

### Q4. How does the price per night vary over the year?

To keep things simple, let's use the average ADR, regardless of the room type and meal.

In [None]:
# grab data:
room_prices_mothly = full_data_guests[["hotel", "arrival_date_month", "adr"]].sort_values("arrival_date_month")

# order by month:
ordered_months = ["January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"]
room_prices_mothly["arrival_date_month"] = pd.Categorical(room_prices_mothly["arrival_date_month"], categories=ordered_months, ordered=True)

# barplot with standard deviation:
plt.figure(figsize=(12, 6))
sns.lineplot(x = "arrival_date_month", y="adr", hue="hotel", data=room_prices_mothly, 
            hue_order = ["City Hotel", "Resort Hotel"], ci="sd", size="hotel", sizes=(2.5, 2.5))
plt.title("Room price per night over the year", fontsize=16)
plt.xlabel("Month", fontsize=16)
plt.xticks(rotation=45)
plt.ylabel("Price [EUR]", fontsize=16)
plt.show()

**This clearly shows that the prices in the Resort hotel are much higher during the summer (no surprise here).**

**The price of the city hotel varies less and is most expensive during spring and autumn.**

### Q5. Which are the busiest months?

In [None]:
# Create a DateFrame with the relevant data:
resort_guests_monthly = rh.groupby("arrival_date_month")["hotel"].count()
city_guests_monthly = ch.groupby("arrival_date_month")["hotel"].count()

resort_guest_data = pd.DataFrame({"month": list(resort_guests_monthly.index),
                    "hotel": "Resort hotel", 
                    "guests": list(resort_guests_monthly.values)})

city_guest_data = pd.DataFrame({"month": list(city_guests_monthly.index),
                    "hotel": "City hotel", 
                    "guests": list(city_guests_monthly.values)})
full_guest_data = pd.concat([resort_guest_data,city_guest_data], ignore_index=True)

# order by month:
ordered_months = ["January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"]
full_guest_data["month"] = pd.Categorical(full_guest_data["month"], categories=ordered_months, ordered=True)

# Dataset contains July and August date from 3 years, the other months from 2 years. Normalize data:
full_guest_data.loc[(full_guest_data["month"] == "July") | (full_guest_data["month"] == "August"),
                    "guests"] /= 3
full_guest_data.loc[~((full_guest_data["month"] == "July") | (full_guest_data["month"] == "August")),
                    "guests"] /= 2

# show figure:
plt.figure(figsize=(12, 6))
sns.lineplot(x = "month", y="guests", hue="hotel", data=full_guest_data, 
             hue_order = ["City hotel", "Resort hotel"], size="hotel", sizes=(2.5, 2.5))
plt.title("Average number of hotel guests per month", fontsize=16)
plt.xlabel("Month", fontsize=16)
plt.xticks(rotation=45)
plt.ylabel("Number of guests", fontsize=16)
plt.show()

**Findings:**
- The City hotel has more guests during spring and autumn, when the prices are also highest.
In July and August there are less visitors, although prices are lower.
- Guest numbers for the Resort hotel go down slighty from June to September, which is also when the prices are highest.
- Both hotels have the fewest guests during the winter.

In [None]:
from datetime import datetime

def month_converter(month):
    months = ['January', 'February', 'March', 'April', 'May', 'June','July', 'August', 'September', 'October', 'November', 'December']
    return months.index(month) + 1

rh_arr = rh
rh_arr['arrival_month'] = rh_arr['arrival_date_month'].apply(month_converter)
rh_arr['arrival_year_month'] = rh_arr['arrival_date_year'].astype(str) + " _ " + rh_arr['arrival_month'].astype(str)
rh_arr['Arrrival Date'] = rh_arr.apply(lambda row: datetime.strptime(f"{int(row.arrival_date_year)}-{int(row.arrival_month)}-{int(row.arrival_date_day_of_month)}", '%Y-%m-%d'), axis=1)
rh_arr['arrival_day_of_week'] = rh_arr['Arrrival Date'].dt.day_name()
weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
rh_arr['arrival_day_of_week'] = pd.Categorical(rh_arr['arrival_day_of_week'],categories = weekdays)
arrivals = pd.pivot_table(rh_arr,columns = 'arrival_day_of_week',index = 'arrival_month',values = 'reservation_status',aggfunc = 'count')

In [None]:
fig, ax = plt.subplots(figsize = (16,11))
ax = sns.heatmap(arrivals, annot=True, fmt="d", cmap = 'rocket_r')

### Q6. How long do people stay at the hotels?

In [None]:
# Create a DateFrame with the relevant data:
rh["total_nights"] = rh["stays_in_weekend_nights"] + rh["stays_in_week_nights"]
ch["total_nights"] = ch["stays_in_weekend_nights"] + ch["stays_in_week_nights"]

num_nights_res = list(rh["total_nights"].value_counts().index)
num_bookings_res = list(rh["total_nights"].value_counts())
rel_bookings_res = rh["total_nights"].value_counts() / sum(num_bookings_res) * 100 # convert to percent

num_nights_cty = list(ch["total_nights"].value_counts().index)
num_bookings_cty = list(ch["total_nights"].value_counts())
rel_bookings_cty = ch["total_nights"].value_counts() / sum(num_bookings_cty) * 100 # convert to percent

res_nights = pd.DataFrame({"hotel": "Resort hotel",
                           "num_nights": num_nights_res,
                           "rel_num_bookings": rel_bookings_res})

cty_nights = pd.DataFrame({"hotel": "City hotel",
                           "num_nights": num_nights_cty,
                           "rel_num_bookings": rel_bookings_cty})

nights_data = pd.concat([res_nights, cty_nights], ignore_index=True)

In [None]:
# show figure:
plt.figure(figsize=(16, 6))
sns.barplot(x = "num_nights", y = "rel_num_bookings", hue="hotel", data=nights_data,
            hue_order = ["City hotel", "Resort hotel"])
plt.title("Length of stay", fontsize=16)
plt.xlabel("Number of nights", fontsize=16)
plt.ylabel("Guests [%]", fontsize=16)
plt.legend(loc="upper right")
plt.show()

In [None]:
avg_nights_res = sum(list((res_nights["num_nights"] * (res_nights["rel_num_bookings"]/100)).values))
avg_nights_cty = sum(list((cty_nights["num_nights"] * (cty_nights["rel_num_bookings"]/100)).values))
print(f"On average, guests of the City hotel stay {avg_nights_cty:.2f} nights, and {cty_nights['num_nights'].max()} at maximum.")
print(f"On average, guests of the Resort hotel stay {avg_nights_res:.2f} nights, and {res_nights['num_nights'].max()} at maximum.")

- **For the city hotel there is a clear preference for 1-4 nights.**
- **For the resort hotel, 1-4 nights are also often booked, but 7 nights also clearly stand out as being very popular.**

### Q7. Which are the most important market segments and booking channels?

In [None]:
plt.figure(figsize=(12,6))
ax = sns.countplot(x="market_segment", data=full_data_cln, order = full_data_cln['market_segment'].value_counts().index)
plt.title('Market Segment')
plt.xlabel('market_segment')
plt.ylabel('Count')
for p in ax.patches:
    ax.annotate((p.get_height()),(p.get_x()+0.2 , p.get_height()+100)) 

In [None]:
plt.figure(figsize=(12,6))
ax = sns.countplot(x="distribution_channel", data=full_data_cln, order = full_data_cln['distribution_channel'].value_counts().index)
plt.title('Distribution Channel')
plt.xlabel('distribution_channel')
plt.ylabel('Count')

### Q8. Do guests come back?

In [None]:
plt.figure(figsize=(12,6))
ax = sns.countplot(x="is_repeated_guest", data = full_data_cln)
plt.title('Is Repeated Guest?')
plt.xlabel('is_repeated_guest')
plt.ylabel('Total Count')

### Q9. How many bookings were canceled?

In [None]:
# absolute cancelations:
total_cancelations = full_data_cln["is_canceled"].sum()
rh_cancelations = full_data_cln.loc[full_data_cln["hotel"] == "Resort Hotel"]["is_canceled"].sum()
ch_cancelations = full_data_cln.loc[full_data_cln["hotel"] == "City Hotel"]["is_canceled"].sum()

# as percent:
rel_cancel = total_cancelations / full_data_cln.shape[0] * 100
rh_rel_cancel = rh_cancelations / full_data_cln.loc[full_data_cln["hotel"] == "Resort Hotel"].shape[0] * 100
ch_rel_cancel = ch_cancelations / full_data_cln.loc[full_data_cln["hotel"] == "City Hotel"].shape[0] * 100

print(f"Total bookings canceled: {total_cancelations:,} ({rel_cancel:.0f} %)")
print(f"Resort hotel bookings canceled: {rh_cancelations:,} ({rh_rel_cancel:.0f} %)")
print(f"City hotel bookings canceled: {ch_cancelations:,} ({ch_rel_cancel:.0f} %)")

### Q10. Which months have the highest number of cancelations?

In [None]:
# Create a DateFrame with the relevant data:
res_book_per_month = full_data_cln.loc[(full_data_cln["hotel"] == "Resort Hotel")].groupby("arrival_date_month")["hotel"].count()
res_cancel_per_month = full_data_cln.loc[(full_data_cln["hotel"] == "Resort Hotel")].groupby("arrival_date_month")["is_canceled"].sum()

cty_book_per_month = full_data_cln.loc[(full_data_cln["hotel"] == "City Hotel")].groupby("arrival_date_month")["hotel"].count()
cty_cancel_per_month = full_data_cln.loc[(full_data_cln["hotel"] == "City Hotel")].groupby("arrival_date_month")["is_canceled"].sum()

res_cancel_data = pd.DataFrame({"Hotel": "Resort Hotel",
                                "Month": list(res_book_per_month.index),
                                "Bookings": list(res_book_per_month.values),
                                "Cancelations": list(res_cancel_per_month.values)})
cty_cancel_data = pd.DataFrame({"Hotel": "City Hotel",
                                "Month": list(cty_book_per_month.index),
                                "Bookings": list(cty_book_per_month.values),
                                "Cancelations": list(cty_cancel_per_month.values)})

full_cancel_data = pd.concat([res_cancel_data, cty_cancel_data], ignore_index=True)
full_cancel_data["cancel_percent"] = full_cancel_data["Cancelations"] / full_cancel_data["Bookings"] * 100

# order by month:
ordered_months = ["January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"]
full_cancel_data["Month"] = pd.Categorical(full_cancel_data["Month"], categories=ordered_months, ordered=True)

# show figure:
plt.figure(figsize=(12, 8))
sns.barplot(x = "Month", y = "cancel_percent" , hue="Hotel",
            hue_order = ["City Hotel", "Resort Hotel"], data=full_cancel_data)
plt.title("Cancelations per month", fontsize=16)
plt.xlabel("Month", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)
plt.legend(loc="upper right")
plt.show()

- **For the City hotel the relative number of cancelations is around 40 % throughout the year.**
- **For the Resort hotel it is highest in the summer and lowest during the winter.**

#  References
[Nuno Antonio, Ana de Almeida, Luis Nunes: Hotel booking demand datasets, Data in Brief, Volume 22, 2019, p. 41-49](https://doi.org/10.1016/j.dib.2018.11.126.)