# Hotel Booking

Goal : Predict the chances of cancelation using data of costumers from multiple hotels.
The data is given in a tabular form of 32 columns (20 numerical + 12 objects).

As a first stage the data was seperated to training/validation/test sets.
To get the optimal model I have investigated the features in the training-set, starting with the numerical features that are most correlated with the target. 
I proceded with encoding the categorical features while monitoring the accuracy with a small subset of features to get an initial estimation of the achievable prediction.

Model selection was made by considering three models (Random Forest, Gradient Boosting and Logistic Regression) and evaluating them on the validation set. The effectiveness of dimensionality-reduction (using PCA) and model tuning (using Grid-Search) was assessed as well.

Content:
1. Import and read the data
2. Data Analysis
3. Model

# 1. Import and read data:
* Import libraries
* Read and explore data : remove unimportant & incomplete data
* Split into train-val-test

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression


In [None]:
data = pd.read_csv('/kaggle/input/hotel-booking-demand/hotel_bookings.csv')
data.head()

In [None]:
data.info(), len(data.columns)

In [None]:
np.abs(data.corr()['is_canceled']).sort_values(ascending=False)

In [None]:
data.isna().sum()

In [None]:
data = data.drop(labels = ['country', 'agent', 'company'], axis =1)
data.isna().sum()

In [None]:
# split the data to train-val-test
train_val, test = train_test_split(data, train_size=0.9, test_size=0.1, random_state=42)
train,val = train_test_split(train_val, train_size = 0.89, test_size=0.11, random_state=42)
len(data), len(train), len(val), len(test), len(train)/len(data)

# 2. Data analysis
Includes:
* Data visualization
* Data cleaning
* Feature Engineering

### Features

* is_canceled : Value : indicating if the booking was canceled (1) or not (0)
* lead_time : Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
* arrival_date_year : Year of arrival date

* arrival_date_week_number : Week number of year for arrival date
* arrival_date_day_of_month : Day of arrival date
* stays_in_weekend_nights : Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
* stays_in_week_nights : Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
* adults : Number of adults
* children : Number of children
* babies : Number of babies

* country : Country of origin. Categories are represented in the ISO 3155–3:2013 format

* is_repeated_guest : Value indicating if the booking name was from a repeated guest (1) or not (0)
* previous_cancellations : Number of previous bookings that were cancelled by the customer prior to the current booking
* previous_bookings_not_canceled : Number of previous bookings not cancelled by the customer prior to the current booking

* booking_changes : Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

* agentID of the travel agency that made the booking
* companyID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
* days_in_waiting_list - Number of days the booking was in the waiting list before it was confirmed to the customer

* adr : Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
* required_car_parking_spaces : Number of car parking spaces required by the customer
* total_of_special_requests : Number of special requests made by the customer (e.g. twin bed or high floor)

* reservation_status_date : Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

## Top Correlated Features

In [None]:
# Check most correlated features
most_corr = np.abs(train.corr()['is_canceled']).sort_values(ascending=False)[1:7]
most_corr

In [None]:
#sns.pairplot(train[most_corr.index])

In [None]:
train['is_canceled'].value_counts(normalize = True)

### lead_time 
Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
We will use lead-time as a one-feature-desicion criterion

In [None]:
train['lead_time'].plot(kind='kde')

In [None]:
train['lead_time'].quantile(np.arange(0.1,1,0.1))
#np.arange(0.1,1,0.1)

In [None]:
qlt = pd.qcut(train['lead_time'],10, labels=False)
qlt_df=pd.concat([qlt, train['is_canceled']], axis=1)
qlt_df

In [None]:
sns.barplot(x= 'lead_time', y='is_canceled', data=qlt_df, )

In [None]:
qlt_df['is_canceled'].groupby(qlt_df['lead_time']).value_counts(normalize=True)

In [None]:
qlt_df['lead_time'].groupby(qlt_df['is_canceled']).value_counts(normalize=True)

In [None]:
qlt_df['lead_time'].groupby(qlt_df['is_canceled']).value_counts(normalize=True)[1].plot(kind='bar')

### Preliminary Accuracy Estimation
we will create an accuracy dataframe to monitor the quality of prediction at any step

In [None]:
df_prediction = pd.DataFrame(columns = ['Score'])

#### One-feature-decision

In [None]:
qlt_df['Decision'] = 0
#cond1  = (qlt_df['lead_time']==(11.0, 26.0]) | (qlt_df['lead_time']==(2.0, 11.0]) | ( qlt_df['lead_time']==(-0.001, 2.0])
qlt_df.loc[qlt_df['lead_time']>=7, 'Decision'] = 1
qlt_df

In [None]:
qlt_df['Decision'].value_counts(normalize=True)

In [None]:
def print_score(prediction,validation,name):
    acc = np.round(accuracy_score(validation,prediction) * 100, 2)
    print('Accuracy Score : ',acc)
    df_prediction.loc[name,'Score'] = acc
    cm_norm = confusion_matrix(prediction,validation)/(confusion_matrix(prediction,validation).sum())
    return sns.heatmap(cm_norm, cmap='hot', annot=True)

In [None]:
print_score(qlt_df['Decision'],qlt_df['is_canceled'],'Top 1 Numeric Feat')

Intuitvely, the longer the lead-time is the probability of canceltion is growing.
* for the bottom centile (up to 2 days) the chances for cancelation are 8%
* for the 8th and 9th centile (137-265 days) the chances for cancelation are ~50%
* for the upper centile (more than a year) the chances for cancelation are 67%
* Prediction based solely on lead_time is no better than all-True ~65%

#### Top-5 correlated Festures based Decision
We will check the accuracy achieved for a minimal number of features without any feature engineering

In [None]:
train_5 = train[most_corr.index.values]
train_5.head()

In [None]:
train_5.info()

In [None]:
val_5 = val[most_corr.index.values]
val_5.info()

In [None]:
target_5 = train['is_canceled']
target_val_5 = val['is_canceled']

In [None]:
rf_5 = RandomForestClassifier()
rf_5.fit(train_5, target_5)

In [None]:
rf5_pred = rf_5.predict(val_5)
print_score(rf5_pred,target_val_5,'Top 5 Numeric Feat')

Preliminary accuracy estimations shows that one feature is sufficient to reach an accuracy level of ~70% and five features are can yield a prediction with ~75% accuracy. Involvement of more features by categorical encoding, feature enginnering and hyperparameters optimization should give rise to a better result

### total_of_special_requests
Number of special requests made by the customer (e.g. twin bed or high floor)

In [None]:
train['total_of_special_requests'].value_counts(normalize=True).plot(kind='bar')

In [None]:
train['is_canceled'].groupby(train['total_of_special_requests']).value_counts(normalize=True)

In [None]:
sns.barplot(x='total_of_special_requests', y='is_canceled', data=train)

In [None]:
qlt_df['total_of_special_requests']=train['total_of_special_requests']
qlt_df.pivot_table(index='lead_time', columns='total_of_special_requests', values='is_canceled')

More special requests reduce the probability of cancelation 

### required_car_parking_spaces
The values of this feature arise several questions: 
* Is car space is actually a type of special request? in the sense that it indicates planning and ingagement
* Are car spaces correlated with the type of need? for example, in buissness trips or resorts it is more common to travel without a car
* Is more than one car space imply planning of many members? a fact that may increase the engagement but also the logistic complication

In [None]:
train['required_car_parking_spaces'].value_counts(normalize=True)

In [None]:
train['required_car_parking_spaces'].value_counts()

In [None]:
# it is obvious that more than two cars are rare and stated in a different category, 
#therefore we will change the classification of this feature, not enough data for 3 cars or more
train.loc[train['required_car_parking_spaces']>1,'required_car_parking_spaces']=2
val.loc[val['required_car_parking_spaces']>1,'required_car_parking_spaces']=2
test.loc[test['required_car_parking_spaces']>1,'required_car_parking_spaces']=2
train['required_car_parking_spaces'].value_counts()

In [None]:
train['is_canceled'].groupby(train['required_car_parking_spaces']).value_counts(normalize=True)

In [None]:
sns.barplot(x = 'required_car_parking_spaces',y ='is_canceled', data=train)

The only cancelations are from people that does not ask for a parking space (who are 93% of the population)
the 7% who ask for a parking space do not cancel

In [None]:
np.abs(train.corr()['required_car_parking_spaces']).sort_values(ascending=False)[1:7]

#### hotel

In [None]:
train['hotel'].groupby(train['required_car_parking_spaces']).value_counts()

In [None]:
train['required_car_parking_spaces'].groupby(train['hotel']).value_counts(normalize=True)

In [None]:
sns.barplot(x='required_car_parking_spaces', y='hotel', data=train)

In [None]:
# Checking if cancelation is correlated to the type of hotel, regardless to parking spaces
train['is_canceled'].groupby(train['hotel']).value_counts(normalize=True)

* Most reservations are without parking space request (94%), request of more than one parking space is rare, more than two is an outlier
* parking space requests are more common in resorts, however it is not highly correlated
* the 7% who ask for a parking space do not cancel the booking

### booking_changes
Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

In [None]:
train['booking_changes'].value_counts(normalize=True).plot(kind='bar')

In [None]:
train['booking_changes'].value_counts(),train['booking_changes'].value_counts(normalize=True)

In [None]:
train.loc[train['booking_changes']>5,'booking_changes']=6
val.loc[val['booking_changes']>5,'booking_changes']=6
test.loc[test['booking_changes']>5,'booking_changes']=6
train['booking_changes'].value_counts(normalize=True)

In [None]:
train['is_canceled'].groupby(train['booking_changes']).value_counts(normalize=True)


In [None]:
sns.barplot(x='booking_changes',y='is_canceled',data=train)

looks like there is no significant difference between one change and multiple changes as far as it goes to cancellation

In [None]:
train['is_change']=0
train.loc[train['booking_changes']>0,'is_change']=1
train[['is_change','booking_changes']][10:30]

In [None]:
train.corr()[['is_change','booking_changes']]

In [None]:
np.abs(train.corr()['is_change']-train.corr()['booking_changes']).sort_values(ascending=False)[2:]

Is_change is not very diffrent from booking_changes, however it is more correlated to cancelation and babies/adults and less correlated to stays_in_week_nights. It is worth while to check the correlation between adults/babies to week nights/weekend nights

In [None]:
# Replace booking_changes by is_change and apply changes to validation and test sets
val['is_change']=0
test['is_change']=0
val.loc[val['booking_changes']>0,'is_change']=1
test.loc[test['booking_changes']>0,'is_change']=1

train = train.drop('booking_changes',axis=1)
val = val.drop('booking_changes',axis=1)
test = test.drop('booking_changes',axis=1)

train.head()

In [None]:
train.corr()['is_change'].sort_values(ascending=False)[1:7]

In [None]:
train.corr()['is_change'].sort_values(ascending=True)[:7]

changes are slightly correlated to babies/car_space and children and negatively correlated to cancelations and adults
It is worth analyzing if changes without children and babies are different from other cancelations

In [None]:
train.pivot_table(index='is_change', columns='babies',values='is_canceled')

Focusing on the population that makes changes, it does not matter if they have babies or not

### previous_cancellations
Number of previous bookings that were cancelled by the customer prior to the current booking

In [None]:
train['previous_cancellations'].value_counts().sort_index()

In [None]:
train['previous_cancellations'].value_counts().sort_index()[2:].plot()

In [None]:
train['previous_cancellations'].value_counts().sort_index()[3:].plot(kind='bar')

In [None]:
prc = train['previous_cancellations']
sns.barplot(x='previous_cancellations',y='is_canceled',data=train)

It is clear that cancelation is varying with previous_cancellations in a non-monotonic manner, it can be categorized into 4 populations: 
* 0 : with ~30% cancelations
* 1 : with ~90% cancelations! 
* 2 - 11 : with higly varying cancelations of 10%-50%
* more than 12 : with 100% cancelations

We can categorize the data accordingly all to a new scale of categories: 0,1,2,3, however the data is too small and the distribtion does not make a lot of sense. It appears that other parameter may be involved.
We can engineer the features in a way they will make more sense by checking correlation with other features

In [None]:
np.abs(train.corr()['previous_cancellations']).sort_values(ascending=False)[1:]

In [None]:
train['previous_bookings_not_canceled'].value_counts()

In [None]:
# we will create a new feature of the cancelation percentage
total_canc = train['previous_bookings_not_canceled'] + train['previous_cancellations']
train['previous_cancellation_per'] =  train['previous_cancellations'].div(total_canc)
train['previous_cancellation_per'] = train['previous_cancellation_per'].fillna(0)
train[['previous_cancellations','previous_bookings_not_canceled','previous_cancellation_per']]

In [None]:
train['previous_cancellation_per'].value_counts()

In [None]:
bins = [-1,0,0.5,1]
train['previous_cancellation_per'] =  pd.cut(train['previous_cancellation_per'], bins, labels = [0,1,2])
sns.barplot(x='previous_cancellation_per',y='is_canceled',data=train)

In [None]:
train['previous_cancellation_per'] = train['previous_cancellation_per'].astype('int64')
train.info()

In [None]:
np.abs(train.corr()[['previous_cancellation_per','previous_cancellations']]).sort_values(ascending=False,by='previous_cancellation_per')
train[['previous_cancellation_per','previous_cancellations','previous_bookings_not_canceled','is_canceled']].corr()

The new feature 'previous_cancellation_per' which was constructed from 'previous_cancellations' and 'previous_bookings_not_canceled' is correlated much better to the target parameter 'is_canceled'. It comes as no surprise as its distribution in relation to cancelation makes more sense. As 'previous_bookings_not_canceled' still hold some information 'previous_cancellations' can only confuse the model, therefore we will use 'previous_cancellation_per' instead.

In [None]:
def add_previous_cancelation_per(df):
    total_canc = df['previous_bookings_not_canceled'] + df['previous_cancellations']
    df['previous_cancellation_per'] =  df['previous_cancellations'].div(total_canc)
    df['previous_cancellation_per'] = df['previous_cancellation_per'].fillna(0)
    bins = [-1,0,0.5,1]
    df['previous_cancellation_per'] =  pd.cut(df['previous_cancellation_per'], bins, labels = [0,1,2])
    df['previous_cancellation_per'] = df['previous_cancellation_per'].astype('int64')
    df = df.drop('previous_cancellations',axis=1)
    return df

val = add_previous_cancelation_per(val)
test = add_previous_cancelation_per(test)
train = train.drop('previous_cancellations',axis=1)

In [None]:
val['previous_cancellation_per'].value_counts()

### is_repeated_guest

In [None]:
train['is_repeated_guest'].value_counts(normalize=True)

In [None]:
train['is_canceled'].groupby(train['is_repeated_guest']).value_counts(normalize=True)

In [None]:
sns.barplot(x='is_repeated_guest',y='is_canceled',data=train)

Repeated guests tend to cancel less

## Other Numerical Features

In [None]:
num_feat = list(train.columns[(train.dtypes.values=='int64')|(train.dtypes.values=='float64')])
non_num_feat = list(train.columns[(train.dtypes.values!='int64')&(train.dtypes.values!='float64')])
num_feat, non_num_feat

In [None]:
if 'is_canceled' in num_feat:
    num_feat.remove('is_canceled')
if 'is_change' in num_feat:
    num_feat.remove('is_change')
if 'previous_cancellation_per' in num_feat:
    num_feat.remove('previous_cancellation_per')
if 'previous_bookings_not_canceled' in num_feat:
    num_feat.remove('previous_bookings_not_canceled')
    
for item in list(most_corr.index):
    if item in num_feat:
        num_feat.remove(item)
num_feat

In [None]:
train[num_feat]

The rest of the columns can be classified to three:
* date (year,weak number,day)
* stays in week/weekends nights
* number of guests (adults/children/babies)
* address

In [None]:
train[num_feat].isna().sum()

Four points of missing values are insignificant, however, as it is the only missing values to fill I will demonstrate data cleaning on them by correlating to another feature

### Children

In [None]:
np.abs(train.corr()['children']).sort_values(ascending=False)[1:6]

Most correlated feature: adr
* adr : Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights

In [None]:
train['adr'].groupby(train['children']).median()

In [None]:
sns.barplot(x='children', y='adr', data=train)

In the absence of a better measure and given the fact that the 'adr' is extremely low we will predict children=0 for all na samples

In [None]:
train.loc[train['children'].isnull(),'children']=0
train.isna().sum()

## None numerical features
None numerical features were not considered when the most correlated features were calculated\
The category encoders that may be used:
* Simple encoding for ordinal features -> 0,1,2...
* get_dummies, split a category to binary features
* Target encoding - mean value of a category in relation to the target, to avoid overfitting we will use Leave-One-Out

The features are:
* hotel : Hotel (H1 = Resort Hotel or H2 = City Hotel)
* arrival_date_month : Month of arrival date
* meal : Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner)
* market_segment : Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
* distribution_channel: Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
* reserved_room_type : Code of room type reserved. Code is presented instead of designation for anonymity reasons.
* assigned_room_type : Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.
* deposit_type : Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.
* customer_type : Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking
* reservation_status : Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why

In [None]:
cat_flag = False
train[non_num_feat].head()

In [None]:
# Create a dataframe of all the unique categories
cat_df = pd.DataFrame(index = non_num_feat, columns = ['Unique Values', 'Number of Categories'])
for feature in non_num_feat:
    cat_df.loc[feature,'Unique Values'] = train[feature].unique()
    cat_df.loc[feature,'Number of Categories'] = len(train[feature].unique())
cat_df.drop('reservation_status_date', axis=0, inplace=True)
cat_df

In [None]:
train['deposit_type'].value_counts()

In [None]:
sns.barplot(x='deposit_type',y='is_canceled', data = train)

In [None]:
train['reservation_status'].value_counts()

In [None]:
sns.barplot(x='reservation_status', y='is_canceled', data = train)

The correlation between the target and the reservation status  is much too high, suggesting that the feature is actually bound to the target, as it will probably cause data leakage we'll later remove it from the data

In [None]:
def dummies(df,var,prefix=None):
    dummies = pd.get_dummies(df[var], prefix = prefix)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(var, axis=1)
    return df

def set_cat_feat(df):

    #Ordinal/binary parameters   
    month = {
        'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7,\
        'August':8, 'September':9, 'October':10, 'November':11, 'December':12}
    hotel = { 'Resort Hotel' : 0, 'City Hotel' : 1}
    df['arrival_date_month'] = df['arrival_date_month'].map(month)
    df['hotel'] = df['hotel'].map(hotel)
    
    df['assigned/reserved'] = 0
    df.loc[df['reserved_room_type']==df['assigned_room_type'],'assigned/reserved']=1
    df = df.drop('reserved_room_type', axis=1)
    df = df.drop('assigned_room_type', axis =1)
    
    df = df.drop('reservation_status', axis=1)
    
    dummy_feat = ['meal','market_segment','distribution_channel','customer_type','deposit_type']
    dummy_prefix=['meal','MS','DC','CT','DT']
    
    # get_dummies for parameters with a few categories, as we face classification problem no need to drop any category
    for i in range(len(dummy_feat)):
        df = dummies(df,dummy_feat[i],dummy_prefix[i])

    return df

In [None]:
temp = train.loc[:,non_num_feat+ ['is_canceled']]
temp

In [None]:
# Apply changes in all the datasets
if cat_flag == False:
    print('Perform encoding...')
    train = set_cat_feat(train)
    val = set_cat_feat(val)
    test = set_cat_feat(test)
    cat_flag=True

train.head(10)

In [None]:
train.info()

### Dates
reservation_status_date 
* reservation_status : Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why

In [None]:
object_list = train.dtypes.index[train.dtypes=='object'].values
object_list

In [None]:
object_feat = object_list[0] # reservation_status_date
minibatch = train.loc[:,[object_feat]]
minibatch

In [None]:
minibatch[['res_year','res_month','res_day']] = minibatch[object_feat].str.split('-', expand=True).astype(int)
minibatch

In [None]:
minibatch['res_year'].value_counts(normalize=True)

In [None]:
train['arrival_date_year' ].value_counts(normalize=True)

In [None]:
def adjust_dates(df):
    object_feat = 'reservation_status_date'
    df[['res_year','res_month','res_day']] = df[object_feat].str.split('-', expand=True).astype(int)
    df = df.drop(object_feat, axis = 1)
    years_dict = {2014:1,2015:2,2016:3,2017:4}
    df['res_year'] = df['res_year'].map(years_dict)
    df['arrival_date_year'] = df['arrival_date_year'].map(years_dict)
    
    return df

In [None]:
train = adjust_dates(train)
val = adjust_dates(val)
test = adjust_dates(test) 

## Final Preprocessing

In [None]:
train.info()

In [None]:
missing_test = train.columns.values.tolist()
for i in train.columns.values:
    for j in test.columns.values:
        if i==j:
            missing_test.remove(i)
missing_val = train.columns.values.tolist()
for i in train.columns.values:
    for j in val.columns.values:
        if i==j:
            missing_val.remove(i)
missing_test, missing_val

In [None]:
for col in missing_val:
    val[col]=0
for col in missing_test:
    test[col]=0

test.head()

In [None]:
# Resort the columns
train = train.reindex(sorted(train.columns), axis=1)
val = val.reindex(sorted(train.columns), axis=1)
test = test.reindex(sorted(train.columns), axis=1)
train.head()

In [None]:
val.head()

In [None]:
np.abs(train.corr()['is_canceled']).sort_values(ascending = False)

In [None]:
df_prediction

The best correlation is to "deposit transfer" (DT) where NON-refund and No-deposit are the most informative statuses

In [None]:
score_deposit = np.round((np.sum(train['DT_Non Refund']==train['is_canceled'])/len(train))*100,2)
df_prediction.loc['Top-1 Cat. Feat','Score']=score_deposit
score_deposit

In [None]:
#feat_to_remove = train.columns[["RS" in x for x in train.columns]].values
train.columns.shape, val.columns.shape, test.columns.shape

Upon running the PCA for the first time, set 'n_components' to 'None' and then evaluate the 'explained_variance' variable for choosing the optimal number of n_components. In this case, 100 should be fine.

# 3. Model

In [None]:
def get_Xy(df,target):
    X = df.drop(target, axis=1) 
    y = df[target]
    return X,y

In [None]:
X_train, y_train = get_Xy(train,'is_canceled')
print('X dim = {}, y dim = {}'.format(X_train.shape, y_train.shape))
print(y_train[:5])

In [None]:
X_test, y_test = get_Xy(test, 'is_canceled')
X_val, y_val = get_Xy(val, 'is_canceled')

In [None]:
# Normalize input matrix
sc = preprocessing.StandardScaler()
X_train = sc.fit_transform(X_train)
X_val = sc.transform(X_val)
X_test = sc.transform(X_test)

In [None]:
X_test.shape, X_val.shape

### Model Selection

#### Random Forest

In [None]:
# we will start by running Random Forest with default hypreparameters
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

In [None]:
rf_pred = rf.predict(X_val)
print_score(rf_pred, y_val, 'Random Forest')

#### Gradient Boosting

In [None]:
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

In [None]:
gb_pred = gb.predict(X_val)
print_score(gb_pred, y_val, 'Gradient Boosting')

#### Logistic Regression

In [None]:
log = LogisticRegression()
log.fit(X_train, y_train)

In [None]:
log_pred = log.predict(X_val)
print_score(log_pred, y_val, 'Logistic Regression')

#### Dimensionality Reduction

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components = None)
Xp_train = pca.fit(X_train)

target_var = 0.99
explained_variance = pca.explained_variance_ratio_
ev_curve = np.cumsum(explained_variance)
plt.plot(ev_curve)
plt.plot(np.arange(len(explained_variance)),np.ones(len(explained_variance))*target_var, color='red')

In [None]:
n_components = np.min(np.where(ev_curve>target_var))
n_components

In [None]:
pca = PCA(n_components = n_components)
Xp_train = pca.fit_transform(X_train)
Xp_test = pca.transform(X_test)
Xp_val = pca.transform(X_val)

In [None]:
rfp = RandomForestClassifier()
rfp.fit(Xp_train, y_train)

In [None]:
rfp_pred = rfp.predict(Xp_val)
print_score(rfp_pred, y_val, 'Random Forest (PCA)')

In [None]:
log_p = LogisticRegression()
log_p.fit(Xp_train, y_train)

In [None]:
log_p_pred = log_p.predict(Xp_val)
print_score(log_p_pred, y_val, 'Logistic Regression (PCA)')

At this point PCA doesn't seems very effective, especualy for logistic regression, therfore we'll procede with the original set of features

### Model Tuning

In [None]:
rf_model = RandomForestClassifier()
#Run a gridsearch
rf_params = {"max_depth": [10,20,30,40],
            "max_features": [10,20,35],
            "n_estimators": [10,500,1000],
            "min_samples_split": [2,5,10]}
            
rf_val = GridSearchCV(rf_model, 
                           rf_params, 
                           cv = 5, 
                           n_jobs = -1, 
                           verbose = 2) 

rf_val.fit(X_val, y_val)

In [None]:
rf_val.best_params_

In [None]:
rf_tuned = RandomForestClassifier(max_depth = rf_val.best_params_.get('max_depth'), 
                                  max_features = rf_val.best_params_.get('max_features'), 
                                  min_samples_split = rf_val.best_params_.get('min_samples_split'),
                                  n_estimators = rf_val.best_params_.get('n_estimators'))

rf_tuned.fit(X_train, y_train)

In [None]:
#Evaluation on Test set
rft_pred = rf_tuned.predict(X_test)
print_score(rft_pred,y_test,'Random Forest (tuned)')

In [None]:
C = np.logspace(2, 8, 4)
penalty = ['l1', 'l2']
max_iter = [100, 200, 500]
#log_params = dict(C=C, penalty=penalty, max_iter=max_iter) 
log_params = dict(C=C, penalty=['l2'], solver = ['lbfgs'], max_iter = max_iter) 
log_params

In [None]:
log_model = LogisticRegression()
#Run a gridsearch  
#log_val = GridSearchCV(log_model, log_params, cv=5, verbose=0)
log_val = GridSearchCV(log_model, log_params, cv=5, verbose=0)

log_val.fit(X_val, y_val)

In [None]:
log_val.best_params_

In [None]:
log_val.best_params_.get('C')

In [None]:
log_tuned = LogisticRegression(C=log_val.best_params_.get('C'), max_iter=log_val.best_params_.get('max_iter'), solver= 'lbfgs')

log_tuned.fit(X_train, y_train)

In [None]:
#Evaluation on Test set
log_t_pred = log_tuned.predict(X_test)
print_score(log_t_pred,y_test,'Logistic Regression (tuned)')

### Evaluation Summary

In [None]:
df_prediction

In [None]:
df_prediction.plot(kind='bar')

In [None]:
Importance = pd.DataFrame( {"Importance": rf_tuned.feature_importances_*100},
                         index = train.drop('is_canceled',axis=1).columns)
Importance.sort_values(by = "Importance", axis = 0, ascending = False)[:10].plot(kind ="barh", color = "r")

plt.xlabel("Variable Importance Level")

Check the wrongly classified example

In [None]:
wrong = []
for i in np.arange(len(y_test)):
    if y_test.iloc[i]!=log_t_pred[i]:
        wrong.append(i)

In [None]:
feat_ranked = Importance.sort_values(by = "Importance", axis = 0, ascending = True).index.values
#pd.concat([test[feat_ranked],log_t_pred],axis=1)
df_wrong = test[feat_ranked].iloc[wrong]
df_wrong['Prediction'] = log_t_pred[wrong]
df_wrong['Target'] = test['is_canceled'].iloc[wrong]
df_wrong

### Conclusions
Loistic Regression found to be the best model for cancellation prediction. Hyperparameters tuning of the model provided additional improvement while dimensionality reduction using PCA seemed to be counter-productive.
The model was selected using the validation set and training was performed on the training set, the test set wan not used or analyzed until the final estimation which indicated more than 99.99% accuracy. The wrongly classified examples set (7) contains mostly false positive predictions in which all costumers did not have non-refundable deposit status (the most important feature) and the reservation was made on the day of the arrival. 