## Introduction

**Dataset from Kaggle**

This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data.

The data is originally from the article Hotel Booking Demand Datasets, written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019. The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020.

License: Attribution 4.0 International (CC BY 4.0)

https://www.kaggle.com/jessemostipak/hotel-booking-demand

**Variables**
* **hotel:** Hotel (Resort Hotel or City Hotel)
* **is_canceled:** Value indicating if the booking was canceled (1) or not (0)
* **lead_time:** Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
* **arrival_date_year:** Year of arrival date
* **arrival_date_month:** Month of arrival date
* **arrival_date_week_number:** Week number of year for arrival date
* **arrival_date_day_of_month:** Day of arrival date
* **stays_in_weekend_nights:** Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
* **stays_in_week_nights:** Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
* **adults:** Number of adults
* **children:** Number of children
* **babies:** Number of babies
* **meal:** Type of meal booked. Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal); FB – Full board (breakfast, lunch and dinner)
* **country:** 	Country of origin (ISO 3155–3:2013 format)
* **market_segment:** 	Market segment designation. “TA” means “Travel Agents” and “TO” means “Tour Operators”
* **distribution_channel:** Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
* **is_repeated_guest:** Value indicating if the booking name was from a repeated guest (1) or not (0)
* **previous_cancellations:** Number of previous bookings that were cancelled by the customer prior to the current booking
* **previous_bookings_not_canceled:** Number of previous bookings not cancelled by the customer prior to the current booking
* **reserved_room_type:** Code of room type reserved
* **assigned_room_type:** Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. 
* **booking_changes:** Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
* **deposit_type:** Indication on if the customer made a deposit to guarantee the booking: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.
* **agent:** ID of the travel agency that made the booking
* **company:** ID of the company/entity that made the booking or responsible for paying the booking. 
* **days_in_waiting_list:** Number of days the booking was in the waiting list before it was confirmed to the customer
* **customer_type:** Type of booking: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking
* **adr:** 	Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
* **required_car_parking_spaces:** Number of car parking spaces required by the customer
* **total_of_special_requests:** Number of special requests made by the customer (e.g. twin bed or high floor)
* **reservation_status:** Reservation last status: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why
* **reservation_status_date:** Date at which the last status was set.

**Objective**
* What is the best time to book?
* What is the optimal length of stay?
* Cancellations?

## 1. Data preparation

### 1.1 Loading dataset and libraries

In [1]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib as mpl
import seaborn as sns

%matplotlib inline

In [2]:
# Reading data
df = pd.read_csv('hotel_bookings.csv')

In [3]:
# Exploring data frame
df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


In [4]:
# Checking datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
hotel                             119390 non-null object
is_canceled                       119390 non-null int64
lead_time                         119390 non-null int64
arrival_date_year                 119390 non-null int64
arrival_date_month                119390 non-null object
arrival_date_week_number          119390 non-null int64
arrival_date_day_of_month         119390 non-null int64
stays_in_weekend_nights           119390 non-null int64
stays_in_week_nights              119390 non-null int64
adults                            119390 non-null int64
children                          119386 non-null float64
babies                            119390 non-null int64
meal                              119390 non-null object
country                           118902 non-null object
market_segment                    119390 non-null object
distribution_channel              119390 n

In [5]:
# Checking 1st row
df.head().T

Unnamed: 0,0,1,2,3,4
hotel,Resort Hotel,Resort Hotel,Resort Hotel,Resort Hotel,Resort Hotel
is_canceled,0,0,0,0,0
lead_time,342,737,7,13,14
arrival_date_year,2015,2015,2015,2015,2015
arrival_date_month,July,July,July,July,July
arrival_date_week_number,27,27,27,27,27
arrival_date_day_of_month,1,1,1,1,1
stays_in_weekend_nights,0,0,0,0,0
stays_in_week_nights,0,0,1,1,2
adults,2,2,1,1,2


### 1.2 Cleaning dataset

In [6]:
# Checking for missing values
df.isna().sum()

hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company         

In [7]:
# Converting agent and company columns to boolean values
df['agent_c'] = np.where(df.agent.isna(), 0, 1)
df['company_c'] = np.where(df.company.isna(), 0, 1)

In [8]:
# Replacing NaN values
df['children_c'] = df.children.fillna(0)

### 1.3 Unique values

In [9]:
df.arrival_date_year.unique()

array([2015, 2016, 2017], dtype=int64)

In [10]:
df.reservation_status.unique()

array(['Check-Out', 'Canceled', 'No-Show'], dtype=object)

In [11]:
df.market_segment.unique()

array(['Direct', 'Corporate', 'Online TA', 'Offline TA/TO',
       'Complementary', 'Groups', 'Undefined', 'Aviation'], dtype=object)

In [12]:
df.distribution_channel.unique()

array(['Direct', 'Corporate', 'TA/TO', 'Undefined', 'GDS'], dtype=object)

## 2. Data exploration

## 3. Data modeling