# 1. Info

This notebooks contains the feature importance analysis for the hotel cancelation dataset.

Before runing this notebook you should have run the "01_data_preparation.ipynb" notebook.

# 2. Feature importance analysis.

## 2.1. Import Libraries

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from IPython.display import display

## 2.2. Read the data

In [3]:
data = pd.read_csv('../data/Hotel_Cancelations.csv')

In [4]:
data.drop(columns=['Booking_ID'], axis=1, inplace=True)

## 2.3. Setting up the validation framework

In [5]:
categorical_variables = ['type_of_meal_plan','room_type_reserved','market_segment_type']
numerical_variables = ['no_of_adults','no_of_children','no_of_weekend_nights','no_of_week_nights','required_car_parking_space','lead_time','arrival_year','arrival_month','arrival_date','repeated_guest',
'no_of_previous_cancellations','no_of_previous_bookings_not_canceled','avg_price_per_room','no_of_special_requests']
target_variable = ['booking_status']


In [8]:
df_full_train, df_test = train_test_split(data, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

print(len(df_train))
print(len(df_test))
print(len(df_val))

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.booking_status.values
y_val = df_val.booking_status.values
y_test = df_test.booking_status.values

del df_train['booking_status']
del df_val['booking_status']
del df_test['booking_status']

21765
7255
7255


## 2.4 difference and risk ratio

In [9]:
global_cancelation = data.booking_status.mean()

In [11]:
for cat in categorical_variables:
    print(cat)
    df_group = data.groupby(cat).booking_status.agg(['mean', 'count'])
    df_group['diff'] = df_group['mean'] - global_cancelation
    df_group['risk_ratio'] = df_group['mean'] / global_cancelation
    display(df_group)
    print()
    print()

type_of_meal_plan


Unnamed: 0_level_0,mean,count,diff,risk_ratio
type_of_meal_plan,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.311802,27835,-0.015834,0.951671
1,0.331189,5130,0.003553,1.010844
2,0.455673,3305,0.128037,1.390791
3,0.2,5,-0.127636,0.610433




room_type_reserved


Unnamed: 0_level_0,mean,count,diff,risk_ratio
room_type_reserved,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.322503,28130,-0.005133,0.984332
1,0.341588,6057,0.013952,1.042584
2,0.32948,692,0.001844,1.005627
3,0.42029,966,0.092654,1.282795
4,0.271698,265,-0.055938,0.829268
5,0.227848,158,-0.099788,0.69543
6,0.285714,7,-0.041922,0.872048




market_segment_type


Unnamed: 0_level_0,mean,count,diff,risk_ratio
market_segment_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.299487,10528,-0.028149,0.914084
1,0.365081,23214,0.037445,1.114289
2,0.109073,2017,-0.218563,0.332909
3,0.296,125,-0.031636,0.903441
4,0.0,391,-0.327636,0.0






## 2.5. Mutual information (Categorical variables)

In [12]:
def mutual_info_cancelation_score(series):
    return mutual_info_score(series, df_full_train.booking_status)

In [13]:
mi = df_full_train[categorical_variables].apply(mutual_info_cancelation_score)
mi.sort_values(ascending=False)

market_segment_type    0.013890
type_of_meal_plan      0.003861
room_type_reserved     0.000769
dtype: float64

The only categorical feature that will be considered for the model selection is the market_segment_type.

## 2.6 Correlation (Numerical variables)

In [14]:
numerical_corr = df_full_train[numerical_variables].corrwith(df_full_train.booking_status).abs().reset_index()
numerical_corr.columns = ['features','corr']
numerical_corr.sort_values(by=['corr'], ascending=False)

Unnamed: 0,features,corr
5,lead_time,0.438241
13,no_of_special_requests,0.251734
6,arrival_year,0.180817
12,avg_price_per_room,0.146501
9,repeated_guest,0.107673
0,no_of_adults,0.091589
3,no_of_week_nights,0.091316
4,required_car_parking_space,0.088054
2,no_of_weekend_nights,0.062028
11,no_of_previous_bookings_not_canceled,0.060505


In [15]:
numerical_corr.query(f"corr > {numerical_corr['corr'].mean()}")['features'].values

array(['lead_time', 'arrival_year', 'avg_price_per_room',
       'no_of_special_requests'], dtype=object)

lead_time, arrival_year, avg_price_per_room, and no_of_special_requests are the numerical features that will be considered for the model selection.

End of notebook.