### In our previous notebook, we focused on cleaning some of the data from the Hotel Bookings data set based on exploring it and finding values that did not make sense based on logic.  In this notebook, we aren't going to be using any of the changed data.  We will focus on a KPI to compare the performances of our two hotels.  We will be working on the assumption that a resort hotel will be focused on extended stays vs. a city hotel being more transient.  With that in mine, we will walk through a thought process of how to compare the results of the two hotels with a more 'apples to apples' approach.

In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt 

import datetime
import calendar
import matplotlib.dates as mdates

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/prepped-data/prepped_hotel_booking_data.csv')

In [None]:
# There are too many columns to display in a regular view, so .T will be used to transpose.
df.head().T

In [None]:
#Arrival month is the word value.  Let's create a column for the integer value 
#to create an arrival date column.
d = {'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, \
     'July':7, 'August':8, 'September':9, 'October':10, 'November':11, 'December':12}
df['month'] = df.arrival_date_month.map(d)
df.head(3)

In [None]:
# Combine the year, month, and day to create an arrival date column and convert it to datetime.
df.rename(columns={'arrival_date_year': 'year', 'arrival_date_day_of_month': 'day'}, inplace=True)
df['arrival_date'] = pd.to_datetime(df[['year', 'month' , 'day']])

### We are wanting to focus on results of the hotel based on actual stays, so we will focus on observations where they are listed as 'Check-Out' indicating an actual visit and stay in a room.

In [None]:
#Create a new df to house only actual visits
all_check_ins = df[['hotel', 'arrival_date']][df['reservation_status'] == 'Check-Out']

#Create a column of 1's to use later to aggregate and sum
all_check_ins['check_in_count'] = 1

In [None]:
#Create a year-month column
#This df is still at transaction/reservation level
all_check_ins['year_month'] = \
    all_check_ins['arrival_date'].dt.strftime('%Y-%m')

all_check_ins.head(3)

In [None]:
#Aggregate to day level.  There's a value that will be used
#from this table that we will apply later, so we will not go
#straight to year-month.
all_check_ins_by_day = \
    all_check_ins[['hotel', 'year_month', 'arrival_date', 'check_in_count']]\
        .groupby(['hotel', 'year_month', 'arrival_date'], as_index=False).sum()

all_check_ins_by_day.head(3)

In [None]:
#A new df to aggregate to year-month
#With results spanning 26 consecutive months, 
#this is a good solution for visualizations.
all_check_ins_by_year_month = \
    all_check_ins_by_day[['hotel', 'year_month', 'check_in_count']]\
        .groupby(['hotel', 'year_month',], as_index=False).sum()

all_check_ins_by_year_month.head(3)

In [None]:
#Creating a new dataframe to house rows where the customer is listed
#as 'Check-Out', indicating that they did occupy a room during their stay.
arrival_dates_customers_stayed = df[['hotel', 'arrival_date', 'total_nights_stay', 'adr']]\
    [df['reservation_status'] == 'Check-Out'].reset_index()

#We will use the index as a customer/reservation number since each row is a single
#record of some customer's stay.  We don't have something to indicate
#that a set of transactions belong to a single customer.
arrival_dates_customers_stayed.rename(columns={'index':'cust_num'}, inplace=True)

display(arrival_dates_customers_stayed.head())

display(arrival_dates_customers_stayed.tail())

In [None]:
#Empty dataframe to house new data that will contain a row
#for each night a guest stayed during their reservation
all_dates = pd.DataFrame(columns=['cust_num','room_filled_dates'])

#In this code, we are taking each observation from the hotel check-in 
#data and creating a row for each night that the guest stayed in the hotel.
#We are using a a timedelta function and taking a day off of that.
#We are doing this so we do not count the check-out date as
#a night that a room was occupied.  This approach is being used
#because simply applying all nights stayed to the month containing
#the arrival date could lead to nights stayed applied to the incorrect
#month based on the fact that some guests will check-in during one month,
#stay a few nights, and check-out the next month.

for index, row in arrival_dates_customers_stayed.iterrows():
    #This creates a date range for a customer's stay where 
    date_range = pd.DataFrame(pd.date_range(row['arrival_date'],
                                row['arrival_date'] + \
                                datetime.timedelta(days=row['total_nights_stay'] - 1)), \
                                  columns=['room_filled_dates'])

    date_range.insert(loc=0, column='cust_num', value=row['cust_num'])
    
    all_dates = pd.concat([all_dates, date_range])


In [None]:
#Create a column that has only 1's in it.  These can be used
#when grouping by time intervals by summing to get a count of
#the number of rooms occupied for the night/month/etc.
all_dates['room_occupied'] = 1
display(all_dates.head())

In [None]:
#cust_num is an object in all_dates, changing to int to help the join coming up
all_dates['cust_num']= all_dates['cust_num'].astype(str).astype(int)

In [None]:
#all_dates df now has a row for each night a customer stays.  We will left
#join to bring in values like 'adr'.
all_dates = all_dates.merge\
        (arrival_dates_customers_stayed[['cust_num', 'hotel', 'total_nights_stay', 'adr']]\
        ,how='left' ,left_on='cust_num', right_on='cust_num', validate='many_to_one')

In [None]:
#These are the dates for check-ins.  We will adjust the date range
#used in room occupancy with the thought that rooms may be occupied
#by guests that arrived prior to 2015-07-01.  Additionally, guests
#arriving near the end of our date range will show as occupying a
#a room, but new guests will not be visible.
print(df.arrival_date.min())
print(df.arrival_date.max())

In [None]:
mask = (all_dates['room_filled_dates'] > '2015-07-07') \
    & (all_dates['room_filled_dates'] <= '2017-08-31')

In [None]:
all_dates = all_dates.loc[mask]

#Verifying date range has changed:
print(all_dates.room_filled_dates.min())
print(all_dates.room_filled_dates.max())

In [None]:
#To verify, below are the first 5 rows for customers that show as 'Check-Out'
#in the original dataframe, which we will compare to the new all_dates dataframe.
#The index below is what is used as 'cust_num' in all_dates.  Notice that the first 
#4 rows (0-3 index) show a single night stay, while the fifth (index 4), shows 2 nights.
display(df[['arrival_date', 'total_nights_stay', 'adr', 'hotel']]\
    [df['reservation_status'] == 'Check-Out'].head())

display(df[['arrival_date', 'total_nights_stay', 'adr', 'hotel']]\
    [df['reservation_status'] == 'Check-Out'].tail(2))

In [None]:
#As noted above, cust_num 4 now has 2 rows
#Additionally, in the tail, we see the customer that stayed 9 nights.
display(all_dates.head(10))
display(all_dates.tail(10))

In [None]:
all_hotel_visits_by_day = \
    all_dates[['hotel', 'room_filled_dates', 'room_occupied']]\
    .groupby(['hotel', 'room_filled_dates'], as_index=False).sum()

In [None]:
#Creating a year-month combo to use for visualizations
all_hotel_visits_by_day['year_month'] = \
    all_hotel_visits_by_day['room_filled_dates'].dt.strftime('%Y-%m')

In [None]:
#We are working with 2 different hotels, without details like the 
#amount of rooms available.  We are going to use our best estimate
#for the number of available rooms.  We will aggregate the count
#of rooms occupied for each night.  The day with the most rooms
#being used will be used as our proxy for the max amount of rooms
#in each hotel.

resort_max_occ = all_hotel_visits_by_day['room_occupied']\
    [all_hotel_visits_by_day['hotel'] == 'Resort Hotel'].max()

city_max_occ = all_hotel_visits_by_day['room_occupied']\
    [all_hotel_visits_by_day['hotel'] == 'City Hotel'].max()

print('Max Resort Occ: {} -- Max City Occ: {}'.format(resort_max_occ, city_max_occ))

In [None]:
#We will now apply each max amount to each row by hotel.

max_occ_vals = []

for index,row in all_hotel_visits_by_day.iterrows():
    if row['hotel'] == 'City Hotel':
        max_occ_vals.append(city_max_occ)
    else:
        max_occ_vals.append(resort_max_occ)
        
all_hotel_visits_by_day['max_daily_occ'] = max_occ_vals
    
all_hotel_visits_by_day.head(2)

In [None]:
#Grouping the sum of rooms occupied by year-month
all_hotel_visits_by_month = \
    all_hotel_visits_by_day[['hotel', 'year_month', 'room_occupied','max_daily_occ']]\
    .groupby(['hotel', 'year_month'], as_index=False).sum()
display(all_hotel_visits_by_month.head(3))
display(all_hotel_visits_by_month.tail(3))

In [None]:
#You might have figured out where we're headed by now...
all_hotel_visits_by_month['occ_rate'] = \
    all_hotel_visits_by_month['room_occupied'] / all_hotel_visits_by_month['max_daily_occ']

all_hotel_visits_by_month.head(2)

# Let's take a look at check-ins only.  This is like buying a whole CD to get the single track that you like.

In [None]:
plt.figure(figsize=(35,8))
plt.xticks(fontsize=20, rotation=60)
plt.yticks(fontsize=30)
plt.title('Check-Ins by Year-Month', fontsize=60)
plt.plot('year_month', 'check_in_count', \
         data=all_check_ins_by_year_month[all_check_ins_by_year_month['hotel']=='City Hotel']\
         , color='skyblue', linewidth=4, linestyle='dashed', label='City: Check-In')
plt.plot('year_month', 'check_in_count', \
         data=all_check_ins_by_year_month[all_check_ins_by_year_month['hotel']=='Resort Hotel']\
         , color='r', linewidth=4, linestyle='dashed', label='Resort: Check-In')

plt.legend(prop={'size':30})

### Looking at the image above, one can't help but notice that the City hotel is dominating!

# Now, let's combine the number of rooms occupied.
# It's the remix! (One of those few times in history where the remix is better than the original.)

In [None]:
plt.figure(figsize=(35,14))
plt.xticks(fontsize=20, rotation=60)
plt.yticks(fontsize=30)
plt.title('Check-Ins and Rooms Occupied by Year-Month', fontsize=60)
plt.plot('year_month', 'room_occupied', \
         data=all_hotel_visits_by_month[all_hotel_visits_by_month['hotel']=='City Hotel']\
         , color='skyblue', linewidth=6, marker='o', markersize=12, label='City: Room Occ')
plt.plot('year_month', 'room_occupied', \
         data=all_hotel_visits_by_month[all_hotel_visits_by_month['hotel']=='Resort Hotel']\
         , color='r', linewidth=6, marker='o', markersize=12, label='Resort: Room Occ')

plt.plot('year_month', 'check_in_count', \
         data=all_check_ins_by_year_month[all_check_ins_by_year_month['hotel']=='City Hotel']\
         , color='skyblue', linewidth=4, linestyle='dashed', label='City: Check-In')
plt.plot('year_month', 'check_in_count', \
         data=all_check_ins_by_year_month[all_check_ins_by_year_month['hotel']=='Resort Hotel']\
         , color='r', linewidth=4, linestyle='dashed', label='Resort: Check-In')

plt.legend(prop={'size':30})

### Looking at this new image above, the story starts to change.  We can clearly see the check-ins haven't changed, but we start to see some gaps closing with the Resort hotel gaining ground.

# So, what might be a final KPI that can be used to compare these two hotels?  What would be something that a Regional VP of Operations might find useful?
## What? There's a bonus hidden track on this album? (I'm not sure if that's still a thing or not.)

In [None]:
plt.figure(figsize=(35,14))
plt.xticks(fontsize=20, rotation=60)
plt.yticks(fontsize=30)
plt.title('Occupancy Rate by Year-Month', fontsize=60)
plt.plot('year_month', 'occ_rate', \
         data=all_hotel_visits_by_month[all_hotel_visits_by_month['hotel']=='City Hotel']\
         , color='skyblue', linewidth=6, marker='o', markersize=12, label='City')
plt.plot('year_month', 'occ_rate', \
         data=all_hotel_visits_by_month[all_hotel_visits_by_month['hotel']=='Resort Hotel']\
         , color='r', linewidth=6, marker='o', markersize=12, label='Resort')
plt.legend(loc='lower right', prop={'size':40})


### Shazam!  The resort hotel just snagged the lead at the wire.  This visualization tells a vastly different story than what we started with.  This comparison wasn't immediately available, but through work and thought, we were able to find something pretty neat. 
### 1. The Resort hotel seems to be performing closer to capacity than the City hotel.  Again, this is using a proxy for what is considered max occupancy for the night based on the data available.
### 2.  There is clearly some seasonality where the warmer months (we're assuming hotels in the Northern Hemisphere based on countries like ESP, GRB, IRL in the Country column) attract more guests.  Colder months show a dip.
### 3. We can see a trend in both hotels moving up, showing growth in the businesses.
### 4. Did the City hotel just open when the data collection began?  We're they finalizing an expansion to the building?  The max room value used is based on the highest day's room occupancy.  If there was some expansion, a different denominator might be needed for that time period.