<h1> Hotel Booking Exploratory Data Analysis <h1>
   

<b> Summary

The major questions answered in this EDA are as follows: 
<ul>
  <li>Distribution of guests in relation to both Hotels</li>
  <li>What factors influence the Average Daily Rate (ADR)</li>
  <li>Length of stay for guests</li>
</ul>
    This dataset examines the differences between two Resort and City hotels based in Portugal over a period of time from July 15 2015 to August 31st 2017
  
    

In [None]:
#setup imports 

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import numpy as np
%matplotlib inline 
import os

import warnings
warnings.filterwarnings('ignore')

In [None]:
#there are two different hotels being Resort and City hotel 

load CSV file  and check to ensure that data has loaded correctly 

In [None]:
#read CSV
hotel = pd.read_csv('../input/hotel-booking-demand/hotel_bookings.csv')
hotel.head()

<B> Preprocessing Data </B>

In [None]:
#search for null values in the dataframe 
hotel.isnull().sum()

<br>

<br>

The there are several columns where values are not in a format that is appropriate for exploritory anaylsis. To correct this we will have to replace the NaN values or drop unneeded columns (setting the children NaN to 0, setting the agent NaN to 0 indicating no agent, and setting country to unknown because that information is not avaiable). The meal column has a description from the input csv that indicates the mealtype "undefined" is interchangable with "SC"(which both signify no meal), consindering the other meal types in the dataset are BB and HB we will convert the undefined into SC for clarity sake.

NOTE: there are some variables that have been anonmyized for privacy sake.

In [None]:
#this opens up the notebook so that you can view all the columns with no limits. 
pd.set_option('display.max_columns', None)

#replace missing values and dropping columns.  
Nan = {'country': 'Unknown', 'children': 0, 'agent': 0}
hotel = hotel.fillna(Nan)

#changing values from unknown to SC in meal column
hotel['meal'].replace('Undefined', 'SC', inplace=True)

In [None]:
#dropping columns that have too many null values: Company column has 94% null values so it is dropped for analysis) 
hotel = hotel.drop(['company'], 1)
hotel.head()

In [None]:
hotel.isnull().sum()

In [None]:
#describing categorical data 
hotel.describe(include=["O"])

In [None]:
#viewing the measures of tendency 
hotel.describe()

wiewing the data we can see that there are some particular outliers in the data that we can presume are entry errors and are nonsensical. these few outliers will be removed from the dataset 

In [None]:
#finding the anomly 
hotel[hotel['adr'] == (-6.38)] 

In [None]:
#find the anomly's and replace 
hotel = hotel.drop([48515])

hotel = hotel.drop([14969])

In [None]:
#there are some a few records that have zero average daily rate(ADR) and are a no-show
#These type of records will be excluded from the dataset since they dont provide any insight. 
hotel.loc[(hotel["adr"]==0) & (hotel["reservation_status"]=="No-Show")]

In [None]:
#checking for data quality 
hotel.is_canceled.value_counts()

We see here that just over one third of our entries resulted in cancellations. For the EDA it is important to remove these features when analysing factors with a strong relationship to cancelations as they can heavily skew the  analysis(although for some features in the data it will not have a large impact).

Removing data that has values of 0 for ADR and status of no-show for reseveration status. no revenue was obtained and the persons did not stay so it is a poor data point.

In [None]:
#removing data that has values of 0 for ADR and status of no-show for reseveration status
df = hotel.loc[(hotel["adr"]!=0) & (hotel["reservation_status"]!="No-Show")]
df

In [None]:
#view the data to see what is left after preprocessing
df.shape

In [None]:
df.describe(include=["O"])

In [None]:
df = df.reset_index(drop=True)
df

In [None]:
#creating Dataframe that excludes the large number of cancelations 
df2 = df[df.is_canceled == 0]
df2

The months August and July show an additional record that can skew analysis. For all analysis invovling this feature we will normalize the data

In [None]:
df.groupby('arrival_date_month')['arrival_date_year'].unique()

<h1> Exploratory Data Analysis </h1>

<b>What does the distribution of guests look like month to month <b/>

In [None]:
#getting the data
#create split DF for city and resort hotel 
resortdf = df[df['hotel'] == 'Resort Hotel']
resortdf

citydf = df[df['hotel'] == 'City Hotel']
citydf

#get total counts for each month for each city 
mresortdata = resortdf.groupby('arrival_date_month')['hotel'].count()
mcitydata =  citydf.groupby('arrival_date_month')['hotel'].count()

mresortdf = pd.DataFrame({'hotel': 'Resort Hotel','month': list(mresortdata.index),
                          'guests':list(mresortdata.values)})
mcitydf = pd.DataFrame({'hotel': 'City Hotel','month': list(mcitydata.index),
                          'guests':list(mcitydata.values)})
#concat to combine the two hotel data for easy viewing 
monthlydf = pd.concat([mresortdf, mcitydf], ignore_index = True)

#order the months for appropriate ordered viewing 
months = ["January", "February", "March", "April", "May", "June", 
"July", "August", "September", "October", "November", "December"]

monthlydf['month'] = pd.Categorical(monthlydf['month'], categories = months, ordered= True)

#normalizing the data 

monthlydf.loc[(monthlydf["month"] == "July") | (monthlydf["month"] == "August"),
                    "guests"] /= 3
monthlydf.loc[(monthlydf["month"] != "July") | (monthlydf["month"] != "August"),
                    "guests"] /= 2
#graphing
plt.figure(figsize=(10, 8))
sns.set(style = 'darkgrid')
sns.lineplot(x='month', y= 'guests', hue = 'hotel' , data = monthlydf, sort = False)
plt.xticks( rotation= 50)
plt.legend(loc='upper right')
plt.title('Number of Guests Per Month')
plt.show()

<b>What does the distribution of Average Daily Rates look like <b/>

In [None]:
#distribution of average daily rates
print("Skewness: %.2f" % df['adr'].skew())
print("Kurtosis: %.2f" % df['adr'].kurt())
plt.figure(figsize=(10, 8))
sns.distplot(df['adr'])
sns.set(style = 'darkgrid')
plt.title('Average Daily Rate Distribution')
plt.xlabel('ADR (EUR€)')
plt.ion()
plt.show()

<b> How does customer type effect the Average Daily Rate (ADR) across the two hotels. <b/>

In [None]:
plt.figure(figsize=(10, 8))
sns.set(style="darkgrid")
htc = sns.catplot(x="customer_type", y="adr", hue="hotel", data=df,
height=6, kind="bar", palette="muted")
htc.despine(left=True)
htc.set_ylabels("ADR (EUR€)")
htc.set_xlabels("Cusotmer Type")
plt.title('Customer Type Prices')
plt.ion()
plt.show()

This graph shows that transient type customers generate the highest average daily rate. 

<B> How does lead time influence the ADR <B/>

In [None]:
plt.figure(figsize=(10, 8))
sns.jointplot(x="lead_time", y="adr", data=df, s = 10)
plt.show()

The joint plot shows that increased lead time is associated with lower ADR, this would suggest that people who book far in advance enjoy lower costs for Hotels as compared to shorter bookings. 

<b> What is the average daily rate per month </b>

In [None]:
months = ["January", "February", "March", "April", "May", "June", 
"July", "August", "September", "October", "November", "December"]

month_revenue = pd.Categorical(df['arrival_date_month'], categories = months, ordered= True)
plt.figure(figsize=(10, 8))
sns.lineplot(x=month_revenue, y= df.adr, hue = 'hotel' , data = df, sort = False)
plt.xticks( rotation= 50)
plt.ylabel('ADR (EUR€)')
plt.title('Average Daily Rate Per Month')
plt.show()

the graph shows that the summer months generate the most average daily rates for bookings acorss hotels.

what is important to note is that when you compare this graph with the number of guests graph which had larger numbers for fall and spring,  those same dates show this lowest ADR's. This would clearly suggest that prices are better during fall and spring months and the volume of guests are also highest during these months. 

<b>How does number of special requests influence ADR costs?<b/>

In [None]:
#barplot 
sns.set(style="darkgrid", palette="pastel")
plt.figure(figsize=(10, 8))
htc = sns.catplot(x="total_of_special_requests", y="adr", hue="hotel", data=df,
height=6, kind="bar", palette="muted")
htc.despine(left=True)
htc.set_ylabels("ADR (EUR€)")
htc.set_xlabels("Number of Special Requests")
plt.show()

The graph indicates that as the number of special requests increases so does the cost for daily rate. People who purchase hotel better hotel rooms tend to demand more special requests 

<b>Which room types demands the highest Average Daily Rate?<b/>

In [None]:
#reserveed room type and average daily rate 
sns.set(style="darkgrid")
plt.figure(figsize=(10, 8))
htc = sns.catplot(x="reserved_room_type", y="adr", hue="hotel", data=df,
                height=6, kind="bar", palette="muted")
htc.despine(left=True)
htc.set_ylabels("ADR (EUR€)")
htc.set_xlabels("Reserved Room Type")
plt.title('Average Daily Rate of Room Types')
plt.show()

Room types G and F demand the highest Average Daily Rate. 

Note: the room types are anonymized for both types of hotels as mentioned before. 

<b>How are the cancelations distributed over a month to month basis ?</b>

In [None]:
#leadtime cancelations using df
plt.figure(figsize=(10, 8))
sns.lineplot(x=month_revenue, y= 'is_canceled', hue = 'hotel' , data = df, sort = False)
plt.xticks( rotation= 50)
plt.ylabel('Cancelations')
plt.title('Cancelations Per Month')
plt.show()

The cancelations are highest in the summer months for both types of hotels, and these months are also the lowest guest volume for both hotels. In general, City hotel sees more ccancelations than resort hotels do.

In [None]:
#finding values of top visitors for both resorts
topc_resort = df[df['hotel']=="Resort Hotel"]["country"].value_counts().head(10)
topc_city = df[df['hotel']=="City Hotel"]["country"].value_counts().head(10)
topc = pd.concat([topc_city,topc_resort],axis=1)
topc.columns = ["city","resort"]
topc

In [None]:
new_topc = topc.rename_axis('country').reset_index()
#create df for resort values 
new_topr = new_topc.drop('city', 1)
new_topr.sort_values(['resort'], ascending=False, inplace= True)
new_topr.reset_index(drop=True)

<B> Where are majority of the guests from ? <B/>

In [None]:
plt.figure(2, figsize=(20,15))
the_grid = gridspec.GridSpec(2, 2)

plt.subplot(the_grid[0, 1],  title='Top Visitors from City Hotel')
sns.barplot(x='country',y='city', data=new_topc, palette='Spectral')
plt.ylabel('Number of Visitors')

plt.subplot(the_grid[0, 0], title='Top Visitors from Resort Hotel',)
sns.barplot(x='country',y='resort', data=new_topr, palette='Spectral')
plt.ylabel('Number of Visitors')

plt.suptitle('Top Countries Visiting ', fontsize=16)
plt.show()

Majority of the guests are from Portugal, Great Britan, and Spain. (Both Hotels are located in Portugal) 

In [None]:
pie = df['country'].value_counts().head(10)

labels = ['PRT','GBR','FRA','ESP','DEU','ITA','IRL','BEL','BRA','NLD']

fig, ax = plt.subplots()
ax.pie(pie, labels = labels ,autopct='%1.1f%%', shadow=True)
plt.title('Top 10 Vistitors')
plt.figure(figsize=(10, 8))
plt.show()

<b> How long are people staying at the Hotels? </b>

In [None]:
#duration of stay  
df2['totalday'] = df2['stays_in_weekend_nights'] + df2['stays_in_week_nights']
df2.head()

In [None]:
print("Skewness: %.2f" % df2['totalday'].skew())
print("Kurtosis: %.2f" % df2['totalday'].kurt())
plt.figure(figsize=(10, 8))
sns.distplot(df2['totalday'])
plt.xlabel('Total Days Spent')
plt.show()

The vast majority of people stay under 10 days at both types of hotels 

<b> Revenue <b/>

In [None]:
#creating revenue column 
#average daily rate times days spent = room revenue 

df2['Revenue'] = df2.adr * df2.totalday
df2.groupby("hotel")["Revenue"].describe()

<b> How does length of stay influnce ADR <b/>

In [None]:
#show figure 
plt.figure(figsize=(10, 8))
f, ax = plt.subplots(figsize=(6.5, 6.5))
sns.scatterplot(x='totalday', y= 'adr', hue = 'hotel',palette="ch:r=-.2,d=.3_r", 
                data = df2 , ax = ax, sizes=(1,10))
plt.xlabel('Total Days Spent')
plt.ylabel('ADR (EUR€)')
plt.show()

The graph shows that ADR is higher for shorter length of stay. This means that prices are more favourable for people who purchace longer stay packages when compared with shorter stays.

<b> How does assinged room type influence total revenue from a guest over length of stay? <b/>

In [None]:
plt.figure(figsize=(10, 8))
sns.lineplot(x=df['assigned_room_type'], y= df2.Revenue, hue = 'hotel' , data = df2)
plt.xlabel('Assigned Room Type')
plt.show()

The highest generating revenue room types are 'G' and 'H' for both resort and city hotel.

---

<h1> Recommendations and Considerations </h1>

Some important considerations in regards to cancelations is to understand that a considerable portion if the data given has these cancelations and the EDA was done removing these variables. These removed data reduce the effectiveness of the EDA.

Some recommendations would be for limiting of the time one can book out for hotels, the data shows that the vast majority of guests who booked far in advance canceled.

<b> If you have any suggestions for what I should take a look at for this dataset please feel free to comment! </b>