### As the first portfolio project, we would focus on the workflow structure of predictive analysis

#### Highlights of this project:
0. **Structured** problem solving
1. **DETAILS on background thought process** start to finish
2. Data preparation
3. Methods of performance evaluation
4. Methods of performance comparison and improvements

#### Citation:
The dataset is from Kaggle - Hotel booking demand

url: https://www.kaggle.com/jessemostipak/hotel-booking-demand/metadata


Original Sources: Hotel Booking Demand Datasets: Nuno Antonio, Ana Almeida, Luis Nunes, Data in Brief, 2019

In [1]:
#Load Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#Import the data
hotel = pd.read_csv('Data/hotel_bookings.csv')
pd.set_option('display.max_columns', None)

hotel.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


#### Let's determine our GOAL first: our goal is trying to predict whether the booking is for a resort hotel or a city hotel given the other data for the same booking. I chose this goal because I am expecting to find differences in booking patterns for the two different hotels, and hopefully those patterns are distinctive enough to separate types of hotels booked.

In [3]:
#Let's take a look at column infos
hotel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
hotel                             119390 non-null object
is_canceled                       119390 non-null int64
lead_time                         119390 non-null int64
arrival_date_year                 119390 non-null int64
arrival_date_month                119390 non-null object
arrival_date_week_number          119390 non-null int64
arrival_date_day_of_month         119390 non-null int64
stays_in_weekend_nights           119390 non-null int64
stays_in_week_nights              119390 non-null int64
adults                            119390 non-null int64
children                          119386 non-null float64
babies                            119390 non-null int64
meal                              119390 non-null object
country                           118902 non-null object
market_segment                    119390 non-null object
distribution_channel              119390 n

#### Here, I would like to drop some columns that we are not going to use in this analysis. The columns dropped have perceived low usuability - for example, the columns contain a lot of Null values. For another example, considering the time series data, eventhough we might be able to perform some trend analysis on those data, given a new datetime that is not in this dataset, it is hard for some classification models to predict the outcomes.

In [4]:
#Drop Columns
hotel.drop(['arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 
           'company', 'reservation_status_date'], axis = 1, inplace = True)

#### Before we move on, let's do one more thing to find out how many unique values for each column

In [7]:
hotel.nunique()

hotel                                2
is_canceled                          2
lead_time                          479
stays_in_weekend_nights             17
stays_in_week_nights                35
adults                              14
children                             5
babies                               5
meal                                 5
country                            177
market_segment                       8
distribution_channel                 5
is_repeated_guest                    2
previous_cancellations              15
previous_bookings_not_canceled      73
reserved_room_type                  10
assigned_room_type                  12
booking_changes                     21
deposit_type                         3
agent                              333
days_in_waiting_list               128
customer_type                        4
adr                               8879
required_car_parking_spaces          5
total_of_special_requests            6
reservation_status       

#### If we focus just on the categorical variables, we found that columns "country" and "agent" have lots of categories. So I decide to remove them from our analysis for the following reasons:
1. Simplicity: if we were to one-hot encode those categorical variables, we would end up with hundreds of features, which might be okay for production use but it would add complexity to this demonstration
2. Feature Balance: take "country" as an example - if one country only have few matching booking records, then the predictive power of models on that country might be limited

In [8]:
hotel.drop(['country', 'agent'], axis = 1, inplace = True)

In [11]:
hotel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 24 columns):
hotel                             119390 non-null object
is_canceled                       119390 non-null int64
lead_time                         119390 non-null int64
stays_in_weekend_nights           119390 non-null int64
stays_in_week_nights              119390 non-null int64
adults                            119390 non-null int64
children                          119386 non-null float64
babies                            119390 non-null int64
meal                              119390 non-null object
market_segment                    119390 non-null object
distribution_channel              119390 non-null object
is_repeated_guest                 119390 non-null int64
previous_cancellations            119390 non-null int64
previous_bookings_not_canceled    119390 non-null int64
reserved_room_type                119390 non-null object
assigned_room_type                119390 n

#### Next, it seems "children" column has 4 NA values, we can easily fill with its mean here

In [12]:
hotel.fillna({'children': np.mean(hotel['children'])}, inplace = True)

In [15]:
#Check if there are NA values in the entire dataframe
hotel.isna().any().any()

False