# Hotel Reservations

## Overview

## Business Understanding

With the ease of booking and canceling hotel reservations online, hotel cancelations and no-shows have drastically increased. This poses a significant problem for hotel revenue. Hotels are losing out on money when there are vacant rooms due to last minute cancellations. To combat this issue, I am going to create a model that can predict when a customer is going to cancel their reservation. This will allow the hotel to overbook an appropriate number of rooms so that they are not losing out on money due to vacant rooms while also not booking more rooms than there is space for in the hotel.

The stakeholders for this project are the hotel employees in charge of hotel bookings and operations, including the Reservations Manager, VP of Operations and, and VP of Revenue Management.

This business problem is important to the stakeholders because it is crucial to increase revenue coming from hotel room bookings. Additionally, they need to accurately manage vacancies for guests, which also impact the price of the rooms.

In order to solve this business problem, I will investigate the following 3 questions:
1. What factors contribute to **hotel cancellations**?
2. What factors contribute to **maintaining a hotel reservation**?
3. How can hotels strategically price rooms to **increase revenue**?

## Data Understanding

The [Hotel Reservations Dataset](https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset) extracted from Kaggle contains 36,275 entries of unique bookings ranging from 2017 to 2018. 

There are 19 columns, which are provided in the following data dictionary:

**Data Dictionary**

**Booking_ID**: unique identifier of each booking <br>
**no_of_adults**: Number of adults <br>
**no_of_children**: Number of Children <br>
**no_of_weekend_nights**: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel <br>
**no_of_week_nights**: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel <br>
**type_of_meal_plan**: Type of meal plan booked by the customer <br>
**required_car_parking_space**: Does the customer require a car parking space? (0 - No, 1- Yes)<br>
**room_type_reserved**: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels. <br>
**lead_time**: Number of days between the date of booking and the arrival date <br>
**arrival_year**: Year of arrival date <br>
**arrival_month**: Month of arrival date <br>
**arrival_date**: Date of the month <br>
**market_segment_type**: Market segment designation <br>
**repeated_guest**: Is the customer a repeated guest? (0 - No, 1- Yes) <br>
**no_of_previous_cancellations**: Number of previous bookings that were canceled by the customer prior to the current booking <br>
**no_of_previous_bookings_not_canceled**: Number of previous bookings not canceled by the customer prior to the current booking <br>
**avg_price_per_room**: Average price per day of the reservation; prices of the rooms are dynamic. (in euros) <br>
**no_of_special_requests**: Total number of special requests made by the customer (e.g. high floor, view from the room, etc) <br>
**booking_status**: Flag indicating if the booking was canceled or not <br>

The target variable will be `booking_status`.

First, I must import necessary libraries that I will use for the EDA and data preparation.

In [11]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

%matplotlib inline

pd.options.mode.copy_on_write = True

# Suppress harmless warning for use_inf_as_na
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

Next I will load the dataset into the notebook.

In [2]:
data = pd.read_csv('data/hotel_reservations.csv')

## Data Preparation

In [15]:
# Preview the first 5 rows of the train data
data.head()

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests
2947,2,1,1,0,Meal Plan 1,0,Room_Type 1,0,2018,7,10,Online,0,0,0,15.0,0
3033,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.0,0
30081,2,2,0,1,Meal Plan 1,0,Room_Type 2,111,2018,10,6,Online,0,0,0,221.4,1
21861,2,0,2,2,Meal Plan 1,0,Room_Type 1,28,2018,3,4,Online,0,0,0,73.88,0
11680,2,0,0,2,Meal Plan 1,0,Room_Type 4,14,2018,10,7,Online,0,0,0,170.0,2


In [5]:
# View the overall shape, dtypes and null counts for each column in train data
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27206 entries, 2947 to 4089
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            27206 non-null  object 
 1   no_of_adults                          27206 non-null  int64  
 2   no_of_children                        27206 non-null  int64  
 3   no_of_weekend_nights                  27206 non-null  int64  
 4   no_of_week_nights                     27206 non-null  int64  
 5   type_of_meal_plan                     27206 non-null  object 
 6   required_car_parking_space            27206 non-null  int64  
 7   room_type_reserved                    27206 non-null  object 
 8   lead_time                             27206 non-null  int64  
 9   arrival_year                          27206 non-null  int64  
 10  arrival_month                         27206 non-null  int64  
 11  arrival_date      

In [18]:
# Check if there are any duplicates in the train data
data.duplicated(subset='Booking_ID').value_counts()

False    36275
Name: count, dtype: int64

There are no null values in the dataset. `repeated_guest` is of type integer but seems to be categorical, with 0 corresponsing to not a repeated guest and 1 corresponding to a repeated guest. 

In fact, our **categorical variables** are as follows: `type_of_meal_plan`, `required_car_parking_space`, `room_type_reserved`, `market_segment_type`, and `repeated_guest`. 

Our **quantitative variables** are `no_of_adults`, `no_of_children`, `no_of_weekend_nights`, `no_of_week_nights`, `lead_time`, `arrival_year`, `arrival_month`, `arrival_date`, `no_of_previous_cancellations`, `no_of_previous_bookings_not_canceled`, `avg_price_per_room`, and `no_of_special_requests`.

I will investigate the relationship between these features and the target, `booking_status`, to build a model that will predict whether a customer will cancel their reservation or not.

I am not using `Booking_ID` as a variable, as its purpose is just to confirm that each entry is a unique booking.

In [7]:
# Generate descriptive statistics of numerical variables in the data
data.describe()

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,required_car_parking_space,lead_time,arrival_year,arrival_month,arrival_date,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests
count,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0
mean,1.845769,0.105969,0.809086,2.206131,0.0315,84.895097,2017.820297,7.42215,15.596192,0.025436,0.021907,0.144417,103.224453,0.620709
std,0.517187,0.406051,0.873351,1.422998,0.174669,85.658628,0.383947,3.083752,8.714875,0.157447,0.348755,1.674084,35.110905,0.787821
min,0.0,0.0,0.0,0.0,0.0,0.0,2017.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,1.0,0.0,17.0,2018.0,5.0,8.0,0.0,0.0,0.0,80.0,0.0
50%,2.0,0.0,1.0,2.0,0.0,57.0,2018.0,8.0,16.0,0.0,0.0,0.0,99.405,0.0
75%,2.0,0.0,2.0,3.0,0.0,126.0,2018.0,10.0,23.0,0.0,0.0,0.0,120.0,1.0
max,4.0,10.0,7.0,17.0,1.0,443.0,2018.0,12.0,31.0,1.0,13.0,58.0,540.0,5.0


I will convert the values in `booking_status` to 1 for 'Canceled' and 0 for 'Not_Canceled.'

In [8]:
# Convert values in the target column to 0's and 1's
data['booking_status'] = data['booking_status'].map({'Canceled': 1,
                                                       'Not_Canceled': 0})

# Confirm it has been done correctly
data['booking_status'].value_counts()

booking_status
0    18368
1     8838
Name: count, dtype: int64

I will now convert the rest of the variables with string variables to numerical variables. I will then compute OneHotEncoder on these variables to begin feature selection.

In [9]:
# Check how many value types are in each variable with string values
print(data['type_of_meal_plan'].value_counts())
print('\n')

print(data['room_type_reserved'].value_counts())
print('\n')

print(data['market_segment_type'].value_counts())

type_of_meal_plan
Meal Plan 1     20879
Not Selected     3845
Meal Plan 2      2477
Meal Plan 3         5
Name: count, dtype: int64


room_type_reserved
Room_Type 1    21108
Room_Type 4     4551
Room_Type 6      712
Room_Type 2      522
Room_Type 5      191
Room_Type 7      116
Room_Type 3        6
Name: count, dtype: int64


market_segment_type
Online           17428
Offline           7885
Corporate         1507
Complementary      295
Aviation            91
Name: count, dtype: int64


In [10]:
# Convert values in the type_of_meal_plan to numerical values
data['type_of_meal_plan'] = data['type_of_meal_plan'].map({'Not Selected': 0,
                                                             'Meal Plan 1': 1,
                                                             'Meal Plan 2': 2,
                                                             'Meal Plan 3': 3})

# Convert values in the room_type_reserved to numerical values
data['room_type_reserved'] = data['room_type_reserved'].map({'Room_Type 1': 1,
                                                             'Room_Type 2': 2,
                                                             'Room_Type 3': 3,
                                                             'Room_Type 4': 4,
                                                             'Room_Type 5': 5,
                                                             'Room_Type 6': 6,
                                                             'Room_Type 7': 7})


# Convert values in the market_segment_type to numerical values
data['market_segment_type'] = data['market_segment_type'].map({'Offline': 0,
                                                             'Online': 1,
                                                             'Corporate': 2,
                                                             'Complementary': 3,
                                                             'Aviation': 5})

Before preparing my data, I will use `train_test_split` to split my data into a train set and a test set. This is to prevent data leakage. When I test my model on the test set, I want it to mimic unknown data as best as possible. 

In [None]:
# Split dataset into features and target

X = data.drop(columns=['booking_status', 'Booking_ID'], axis = 1)
y = data['booking_status']

In [None]:
# Split the data into a train set and a test set using default values where 75% of the data is train and the remaining
# 25% is the test data
# Random state used for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, random_state=28)

In [None]:
# Create dummy variables for categorical features
ohe = OneHotEncoder(drop='first', sparse_output=False)

# Create dataframe with only the columns that require One Hot Encoding
categorical_train = train[['type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved', 
                          'market_segment_type', 'repeated_guest']].copy()

ohe.fit(categorical_train)
ohe.transform(categorical_train)

# Create new dataframe with One Hot Encoded columns
categorical_train_ohe = pd.DataFrame(data = ohe.transform(categorical_train),
                                     columns = [f'origin_{cat}' for cat in ohe.categories_[0][1:]])

# Create dummy variable for sex
ohe = OneHotEncoder(drop='first', sparse_output=False)
train_female = ohe.fit_transform(X_train[['SEX']]).flatten()
test_female = ohe.transform(X_test[['SEX']]).flatten()

In [9]:
np.corrcoef(data['no_of_adults'], data['booking_status'])

TypeError: unsupported operand type(s) for /: 'str' and 'int'

In [None]:
# List correlations between features target
target = data['booking_status']
features = data[['type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved', 'arrival_year', 
                 'arrival_month', 'arrival_date', 'market_segment_type', 'repeated_guest', 'no_of_adults', 
                 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'lead_time', 
                 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 
                 'no_of_special_requests']]

for feature in features:
    print(feature, np.corrcoef(data[feature], target))

## Modeling

### Baseline Understanding

### First Model

### Modeling Iterations

### Final Model

## Conclusions