# Hotel Reservations

## Overview

## Business Understanding

With the ease of booking and canceling hotel reservations online, hotel cancelations and no-shows have drastically increased. This poses a significant problem for hotel revenue. Hotels are losing out on money when there are vacant rooms due to last minute cancellations. To combat this issue, I am going to create a model that can predict when a customer is going to cancel their reservation. This will allow the hotel to overbook an appropriate number of rooms so that they are not losing out on money due to vacant rooms while also not booking more rooms than there is space for in the hotel.

The stakeholders for this project are the hotel employees in charge of hotel bookings and operations, including the Reservations Manager, VP of Operations and, and VP of Revenue Management.

This business problem is important to the stakeholders because it is crucial to increase revenue coming from hotel room bookings. Additionally, they need to accurately manage vacancies for guests, which also impact the price of the rooms.

In order to solve this business problem, I will investigate the following 3 questions:
1. What factors contribute to **hotel cancellations**?
2. What factors contribute to **maintaining a hotel reservation**?
3. How can hotels strategically price rooms to **increase revenue**?

## Data Understanding

The [Hotel Reservations Dataset](https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset) extracted from Kaggle contains 36,275 entries of unique bookings ranging from 2017 to 2018. 

There are 19 columns, which are provided in the following data dictionary:

**Data Dictionary**

**Booking_ID**: unique identifier of each booking <br>
**no_of_adults**: Number of adults <br>
**no_of_children**: Number of Children <br>
**no_of_weekend_nights**: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel <br>
**no_of_week_nights**: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel <br>
**type_of_meal_plan**: Type of meal plan booked by the customer <br>
**required_car_parking_space**: Does the customer require a car parking space? (0 - No, 1- Yes)<br>
**room_type_reserved**: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels. <br>
**lead_time**: Number of days between the date of booking and the arrival date <br>
**arrival_year**: Year of arrival date <br>
**arrival_month**: Month of arrival date <br>
**arrival_date**: Date of the month <br>
**market_segment_type**: Market segment designation <br>
**repeated_guest**: Is the customer a repeated guest? (0 - No, 1- Yes) <br>
**no_of_previous_cancellations**: Number of previous bookings that were canceled by the customer prior to the current booking <br>
**no_of_previous_bookings_not_canceled**: Number of previous bookings not canceled by the customer prior to the current booking <br>
**avg_price_per_room**: Average price per day of the reservation; prices of the rooms are dynamic. (in euros) <br>
**no_of_special_requests**: Total number of special requests made by the customer (e.g. high floor, view from the room, etc) <br>
**booking_status**: Flag indicating if the booking was canceled or not <br>

The target variable will be `booking_status`.

First, I must import necessary libraries that I will use for the EDA and data preparation.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

pd.options.mode.copy_on_write = True

# Suppress harmless warning for use_inf_as_na
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

Next I will load the dataset into the notebook.

In [4]:
data = pd.read_csv('data/hotel_reservations.csv')

Before preparing my data, I will use `train_test_split` to split my data into a train set and a test set. This is to prevent data leakage. When I test my model on the test set, I want it to mimic unknown data as best as possible. 

In [27]:
from sklearn.model_selection import train_test_split

# Split the data into a train set and a test set using default values where 75% of the data is train and the remaining
# 25% is the 
train, test = train_test_split(data)

## Data Preparation

In [28]:
# Preview the first 5 rows of the train data
train.head()

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
33859,INN33860,1,0,2,1,Meal Plan 1,0,Room_Type 1,41,2018,3,13,Corporate,0,0,0,70.33,0,Not_Canceled
13127,INN13128,1,0,0,2,Meal Plan 1,0,Room_Type 1,1,2018,1,27,Online,0,0,0,71.7,0,Not_Canceled
18108,INN18109,1,0,1,0,Meal Plan 1,0,Room_Type 1,44,2018,3,7,Offline,0,0,0,65.4,0,Canceled
2296,INN02297,1,0,6,14,Meal Plan 1,0,Room_Type 1,17,2018,11,20,Online,0,0,0,97.2,0,Canceled
26911,INN26912,2,0,1,5,Meal Plan 2,0,Room_Type 1,138,2018,7,5,Offline,0,0,0,102.25,0,Not_Canceled


In [29]:
# View the overall shape, dtypes and null counts for each column in train data
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27206 entries, 33859 to 27369
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            27206 non-null  object 
 1   no_of_adults                          27206 non-null  int64  
 2   no_of_children                        27206 non-null  int64  
 3   no_of_weekend_nights                  27206 non-null  int64  
 4   no_of_week_nights                     27206 non-null  int64  
 5   type_of_meal_plan                     27206 non-null  object 
 6   required_car_parking_space            27206 non-null  int64  
 7   room_type_reserved                    27206 non-null  object 
 8   lead_time                             27206 non-null  int64  
 9   arrival_year                          27206 non-null  int64  
 10  arrival_month                         27206 non-null  int64  
 11  arrival_date    

In [30]:
# Check if there are any duplicates in the train data
train.duplicated(subset='Booking_ID').value_counts()

False    27206
Name: count, dtype: int64

There are no null values in the dataset. `repeated_guest` is of type integer but seems to be categorical, with 0 corresponsing to not a repeated guest and 1 corresponding to a repeated guest. 

In fact, our **categorical variables** are as follows: `type_of_meal_plan`, `required_car_parking_space`, `room_type_reserved`, `arrival_year`, `arrival_month`, `arrival_date`, `market_segment_type`, and `repeated_guest`. 

Our **numerical variables** are `no_of_adults`, `no_of_children`, `no_of_weekend_nights`, `no_of_week_nights`, `lead_time`, `no_of_previous_cancellations`, `no_of_previous_bookings_not_canceled`, `avg_price_per_room`, and `no_of_special_requests`.

I will investigate the relationship between these features and the target, `booking_status`, to build a model that will predict whether a customer will cancel their reservation or not.

I am not using `Booking_ID` as a variable, as its purpose is just to confirm that each entry is a unique booking.

In [31]:
# Generate descriptive statistics of numerical variables in the train data
train.describe()

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,required_car_parking_space,lead_time,arrival_year,arrival_month,arrival_date,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests
count,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0,27206.0
mean,1.84349,0.105124,0.804859,2.206278,0.030729,85.500074,2017.822429,7.390245,15.6088,0.025289,0.025766,0.154745,103.428318,0.621003
std,0.518986,0.404503,0.870455,1.408906,0.172584,86.229591,0.382158,3.074041,8.720781,0.157003,0.401657,1.781509,35.028382,0.785439
min,0.0,0.0,0.0,0.0,0.0,0.0,2017.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,1.0,0.0,17.0,2018.0,5.0,8.0,0.0,0.0,0.0,80.3,0.0
50%,2.0,0.0,1.0,2.0,0.0,57.0,2018.0,8.0,16.0,0.0,0.0,0.0,99.45,0.0
75%,2.0,0.0,2.0,3.0,0.0,127.0,2018.0,10.0,23.0,0.0,0.0,0.0,120.4975,1.0
max,4.0,10.0,7.0,17.0,1.0,443.0,2018.0,12.0,31.0,1.0,13.0,58.0,540.0,5.0


In [26]:
np.corrcoef(data['no_of_adults'], data['booking_status'])

TypeError: unsupported operand type(s) for /: 'str' and 'int'

In [25]:
# List correlations between features target
target = data['booking_status']
features = data[['type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved', 'arrival_year', 
                 'arrival_month', 'arrival_date', 'market_segment_type', 'repeated_guest', 'no_of_adults', 
                 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'lead_time', 
                 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 
                 'no_of_special_requests']]

for feature in features:
    print(feature, np.corrcoef(data[feature], target))

TypeError: unsupported operand type(s) for /: 'str' and 'int'

## Modeling

### Baseline Understanding

### First Model

### Modeling Iterations

### Final Model

## Conclusions