## Overview 

This notebook will investigate hotel booking data to find trends and determine relationships between variables of intrest, the data was obtained from the following link - https://www.kaggle.com/jessemostipak/hotel-booking-demand

Goal - use the data to find trends and make business decisons about where to invest more into and what the patrons of the hotel servives want more of. Also to expose areas were money is being lost or profitability is not being maximized.

Due to the somewhat cyclical nature of the vacation industry, it important to find ways to maximize income during the high seasons and minimize loss during the low seasons. 

### Imports and Settings

In [None]:
import numpy as np 
import pandas as pd 
pd.set_option("display.max_columns",500)

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline 

### Get the data

In [None]:
# load data:
file_path = "../input/hotel-booking-demand/hotel_bookings.csv"
data = pd.read_csv(file_path)

In [None]:
#take a look at the first 5 rows of the data 
data.head(5)

In [None]:
#let's describe the numerical data and see basic stats
data.describe()

In [None]:
#descibe the categorical data and see basic stats
data.describe(include="O")

In [None]:
#get some basic info about the data contained
data.info()

In [None]:
#find all the null values in the for each column 
#looks like all null values are in country, agent, and company columns
data.isnull().sum()

### Visualize the data with some plots

**Let's investigate the finances of the hotel**

We will use columns such as average daily rate and create a new column called revenue as well

**ADR Analysis**

In [None]:
#For further analysis we will split the data into city and resort hotel 
city_data = data[data["hotel"]=="City Hotel"]
resort_data = data[data["hotel"]=="Resort Hotel"]

In [None]:
#first general pair plot to try and see relationships 
sns.distplot(city_data[city_data["adr"]<=2000]["adr"],bins=30)
plt.show()

In [None]:
sns.distplot(resort_data[resort_data["adr"]<=2000]["adr"],bins=30)
plt.show()

In [None]:
#let's see monthly adr data
monc_adr = city_data.groupby("arrival_date_month")["adr"].describe()
monc_adr = monc_adr.reindex(["January","February","March","April","May","June","July","August","September","October",\
                           "November","December"])
monc_adr

In [None]:
#repeat for resort data
monr_adr = resort_data.groupby("arrival_date_month")["adr"].describe()
monr_adr = monr_adr.reindex(["January","February","March","April","May","June","July","August","September","October",\
                           "November","December"])
monr_adr

In [None]:
mon_adr = data.groupby("arrival_date_month")["adr"].describe()
mon_adr = mon_adr.reindex(["January","February","March","April","May","June","July","August","September","October",\
                           "November","December"])
mon_adr

In [None]:
#higher variability with min just under $50
ax1=sns.barplot(x=monc_adr["mean"],y=monc_adr.index,palette='muted')
ax1.set_xlabel("ADR")
ax1.set_ylabel("Month")
plt.show()

In [None]:
#lower variability with min just over $80
ax2=sns.barplot(x=monr_adr["mean"],y=monr_adr.index,palette='muted')
ax2.set_xlabel("ADR")
ax2.set_ylabel("Month")
plt.show()

*Here we see the resort hotel charges mcuh higher rates during the months of July and August which may present opportunities to try and maximize revenue. As we will se later, the cancellation rate is high and they may not be capturing as much revenue as desired.*

### Cancellations 
**Below we will take a brief look at cancellations**<br/>
When we look at revenue below and visitors we will need to modify the dataset to reflect the actual visitors <br/>
and the revenue obtained from them


In [None]:
data.is_canceled.value_counts()

*We see here that just over one third of our entries resulted in cancellations. If this is not taken into account when we look about revenue and visitor data, it will greatly skew any insights. When looking at other items, such as ADR trends and popular packages the cancelations aren't as big a factor.* <br/>

*This is different from the no-show case where the person it is assumed the full payment for the visit has been remitted and not refunded. Whereas with the cancellations, the revenue value is dependent on the cancellation poilicy where there may be a full or partial refund given.*

In [None]:
#create ndata variable for new data not including the cancelled bookings 
ndata = data[data.is_canceled == 0].copy()
ndata.head()

**Revenue Analysis** <br/>

Revenue will be estimated by multiplying adr by duration, revenue and duration column will be created

In [None]:
#create duration column
ndata["duration"] =  ndata['stays_in_weekend_nights'] + ndata['stays_in_week_nights']

In [None]:
#create revenue column
ndata["revenue"] = ndata["adr"]*ndata["duration"]

In [None]:
#we split the data again to look at each individually  
city_data = ndata[ndata["hotel"]=="City Hotel"]
resort_data = ndata[ndata["hotel"]=="Resort Hotel"]

In [None]:
#revenue data and distribution
city_data["revenue"].describe()

In [None]:
city_data["revenue"].sum()

In [None]:
resort_data["revenue"].describe()

In [None]:
resort_data["revenue"].sum()

In [None]:
#let's see monthly adr data
monc_rev = city_data.groupby("arrival_date_month").sum()["revenue"]
monc_rev = monc_rev.reindex(["January","February","March","April","May","June","July","August","September","October",\
                           "November","December"])
monc_rev

In [None]:
ax3=ax = monc_rev.plot.bar(rot=50)
ax3.set_xlabel("Month")
ax3.set_ylabel("Revenue")
plt.show()

In [None]:
#let's see monthly adr data
monr_rev = resort_data.groupby("arrival_date_month").sum()["revenue"]
monr_rev = monr_rev.reindex(["January","February","March","April","May","June","July","August","September","October",\
                           "November","December"])
monr_rev

In [None]:
ax4 = monr_rev.plot.bar(rot=50)
ax4.set_xlabel("Month")
ax4.set_ylabel("Revenue")
plt.show()

In [None]:
#revenue by channel
city_roi = city_data.groupby("distribution_channel").sum()["revenue"]
city_roi

In [None]:
city_roi.plot(kind="bar")
plt.show()

In [None]:
resort_roi = resort_data.groupby("distribution_channel").sum()["revenue"]
resort_roi

In [None]:
resort_roi.plot(kind="bar")
plt.show()

**Investigating Most Popular Packages, Room Types and Special Requests**

In [None]:
#distribution of special requests 
sns.countplot(data["total_of_special_requests"])
plt.show()

In [None]:
#most popular meal
sns.countplot(data["meal"])
plt.show()

In [None]:
#Most popular booking channel 
#Travel agents and tour operators are bring in most of the visitors 
sns.countplot(data["distribution_channel"])
plt.show()

In [None]:
#Most popular market segment 
#travel agents are most represented among our visitors as well 
sns.countplot(data["market_segment"])
plt.xticks(rotation=50)
plt.show()

**Investigating the distribution and behaviour of people**

In [None]:
top10c = city_data.country.value_counts().nlargest(10).to_frame().reset_index()
top10c.rename(columns={'index': 'Country', 'country': 'Visitors'}, inplace=True)
top10c

In [None]:
top10r = resort_data.country.value_counts().nlargest(10).to_frame().reset_index()
top10r.rename(columns={'index': 'Country', 'country': 'Visitors'}, inplace=True)
top10r

In [None]:
#Average duration of stay per month
av_dur = ndata.groupby("arrival_date_month").mean()["duration"]
av_dur = av_dur.reindex(["January","February","March","April","May","June","July","August","September","October",\
                           "November","December"])
av_dur

In [None]:
ax5 = av_dur.plot.bar(rot=50)
ax5.set_xlabel("Month")
ax5.set_ylabel("Duration")
plt.show()

In [None]:
#lead time for booking 
sns.distplot(data['lead_time'],bins=30)
plt.show()

In [None]:
#amount of cancellations per month
df_can = data[data["reservation_status"]=="Canceled"]
mon_can = df_can.groupby("arrival_date_month").sum()["is_canceled"]
mon_can = mon_can.reindex(["January","February","March","April","May","June","July","August","September","October",\
                           "November","December"])
mon_can

In [None]:
ax6 = mon_can.plot.bar(rot=50)
ax6.set_xlabel("Month")
ax6.set_ylabel("cancellations")
plt.show()

### Investigating correlations in the data

In [None]:
data.describe(include="O").columns

In [None]:
corr_data = data.drop(['hotel', 'arrival_date_month', 'meal', 'country', 'market_segment',
       'distribution_channel', 'reserved_room_type', 'assigned_room_type',
       'deposit_type', 'customer_type', 'reservation_status',
       'reservation_status_date'],axis=1)


In [None]:
corr_data.corr()

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(corr_data.corr())

In [None]:
corr_data.corr()["is_canceled"].sort_values(ascending=False).to_frame()

In [None]:
#hypothesis - city has higher lead times but also high cancellations, would reducing the lead times
#lead to reduced cancellations and high revenues
city_data["lead_time"].mean()

In [None]:
ocity_data = data[data["hotel"]=="City Hotel"]
oresort_data = data[data["hotel"]=="Resort Hotel"]

In [None]:
ocity_data["is_canceled"].sum()

In [None]:
city_can = ocity_data["is_canceled"].sum()/ocity_data.shape[0]
city_can

In [None]:
#which channel are we seeing the most cancellations 
ocity_data.groupby("distribution_channel")["is_canceled"].sum().to_frame()

In [None]:
oresort_data["lead_time"].mean()

In [None]:
oresort_data["is_canceled"].sum()

In [None]:
resort_can = oresort_data["is_canceled"].sum()/oresort_data.shape[0]
resort_can

In [None]:
#which channel are we seeing the most cancellations 
oresort_data.groupby("distribution_channel")["is_canceled"].sum().to_frame()

### Recommendations and Future Work

**These are some suggestions based on the information provided. Additional work will need to be done to get more actionable insights**

<ol>
<li>A cost benefit analysis should be done to determine if an appropriate return is being made on all the distribution channels being utilized</li>
<li>Cancellations during peak season for the resort may be more detrimental to the revenue as that period contribute to most to revenue. A recommendation would be to further investigate the connection between lead time and cancellation and other factors in the causal chain to minimize this. The the resort can possibly run promotions and provide incentives to minimize cancellations as well. </li>
<li>It seems that BB, HB, and SC meal packages are the most popular. To save on costs, it may be good to phase out the other meal packages and focus on the best performing. Further analysis should be done to determine what makes these three so popular.</li>
</ol>