This notebook contains three parts:
* the state of hotels
* the analysis of variables related to cancelation
* modelling details

The dataset has 32 variables. And the task is to predict the possibility of a booking for a hotel, which means that the variables should have been obtained before customers check in or cancel the booking, **so I drop the variables reservation_status and reservation_status_date.**

There are 4 variables with missing values. 

* For children, agent, company variables, the missing values means there is no child, agent or company related to the booking. So their missing values can be filled with 0.
* For country variable, the number of missing values is small. So its missing values can be deleted.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import seaborn as sns


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
file = pd.read_csv('/kaggle/input/hotel-booking-demand/hotel_bookings.csv')

file.info()

data = file.copy()
data = data.drop(['reservation_status','reservation_status_date'], axis = 1)
data['agent'] = data['agent'].fillna(0)
data['children'] = data['children'].fillna(0) # fill up children value
data = data.dropna(subset=['country']) # drop country rows
data = data.reset_index(drop=True)


# 1. State of hotels

In this part, I'll analyze the state of resort hotel and city hotel from 3 ways:
* the total number of booking. 
* the sum of all lodging transactions. It can be calculated by the defination of adr variable.
* the ratio of booking's cancelation.

From the figures, it can be concluded that:
* Overall, the City Hotel's total number of booking is higher than the Resort Hotel's, and its inverted U shape is more obvious. What's more, it has increased compared with the same period last year.
* The Resort Hotel's sum of all lodging transactions is obviously a inverted V shaped line. I guess there is a big price difference between busy and idle time in the Resort Hotel, and the price in the City Hotel is stable.
* The City Hotel's ratio of booking's cancelation is higher than the Resort Hotel's. Also, it has increased compared with the same period last year.


**For hotels, the increase of total number of booking and sum of all lodging transactions is good, but the high ratio of booking's cancelation is the problem.**

Next, I'll analyze variables related to the cancelation.

In [None]:
data['new_month'] = data['arrival_date_month'].map({'July':'07', 'August':'08', 'September':'09', 'October':'10', 'November':'11', 'December':'12',
       'January':'01', 'February':'02', 'March':'03', 'April':'04', 'May':'05', 'June':'06'})
data['total_stay_night'] = data['stays_in_weekend_nights']+data['stays_in_week_nights']
data['total_transactions'] = data['adr']*data['total_stay_night']
data['time'] = data['arrival_date_year'].apply(str)+data['new_month']

def HotelSituation(dfvalue):
    temp_df1 = dfvalue.groupby(['time'])['is_canceled'].value_counts().unstack()
    temp_df1['total'] = temp_df1[0]+temp_df1[1] # the total number of booking
    temp_df1['cancelation'] = temp_df1[1]/temp_df1['total'] # the ratio of booking's cancelation
    temp_df2 = dfvalue.groupby(['time'])[['total_transactions']].sum() #the sum of all lodging transactions
    temp_df = pd.concat([temp_df1,temp_df2],axis=1)
    return temp_df

city_hotel_data = data[data['hotel']=='City Hotel']
resort_hotel_data = data[data['hotel']=='Resort Hotel']
city_hotel_data_df = HotelSituation(city_hotel_data)
resort_hotel_data_df = HotelSituation(resort_hotel_data)

# plot
k = 1
for column in ('total','total_transactions','cancelation'):
    y1 = [z for z in city_hotel_data_df[column].values]
    y2 = [z for z in resort_hotel_data_df[column].values]
    x = [str(z) for z in city_hotel_data_df.index.values]
    ax = plt.subplot(3,1,k)
    rects1 = ax.plot(x, y1, label='City Hotel')
    rects1 = ax.plot(x, y2, label='Resort Hotel')
    ax.set_ylabel(column)
    ax.set_title(column+' variable changes as time goes by', fontdict={'weight': 'normal', 'size': 8})
    plt.xticks(fontsize='x-small',rotation=30)
    plt.subplots_adjust(hspace=1,right=0.8)
    plt.legend(loc=(1.02,0))
    k = k+1

plt.show()

# 2. Analysis of different variables

## 2.1 Variables about hotels

For hotels, the variables can be sorted into 3 sides:
* room type. According to the definations, I think assigned_room_type is more close to customers' actual situation then reserved_room_type. Also, based on adr variable and assigned_room_type variable, average transation per room type can be calculated (average transation per room type = adr*total_stay_night/the number of room type).
* booking response. The days_in_waiting_list can indicate hotels' booking response speed.
* distribution channel. It is the distribution_channel variable in the dataset.

### 2.1.1 Room Type

It can be concluded that 
* The ratio of booking's cancelation and transation of different room types in the same hotel are obviously different. And the ratio of booking's cancelation and transation of different hotels in the same room type are also obviously different. **I think F room type in the City Hotel and G room type in the Resort Hotel have high price and high number of booking, but they also have high ratio of booking's cancelation.**
* The A room type in both hotels has the highest number of booking and the highest number of changing room types. **I think it's because the A room type is a special offer to attract customers.** And the D room type has low ratio of booking's cancelation, high price and high number of booking, **so it might be the most profitable room type for the City Hotel.**

In [None]:
# canceled_ratio by assigned_room_type and hotel
temp_df = data.groupby(['hotel','assigned_room_type'])['is_canceled'].value_counts().unstack().unstack(level=0)
temp_df['city_total'] = temp_df[0]['City Hotel'].fillna(0)+ temp_df[1]['City Hotel'].fillna(0)
temp_df['city_cancelation'] = temp_df[1]['City Hotel'].fillna(0)/temp_df['city_total']
temp_df['resort_total'] = temp_df[0]['Resort Hotel'].fillna(0)+ temp_df[1]['Resort Hotel'].fillna(0)
temp_df['resort_cancelation'] = temp_df[1]['Resort Hotel'].fillna(0)/temp_df['resort_total']
y1 = [z for z in temp_df['city_cancelation']]
y2 = [z for z in temp_df['city_total']]
y3 = [z for z in temp_df['resort_cancelation']]
y4 = [z for z in temp_df['resort_total']]
labels = [z for z in temp_df.index.values]
x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars
ax = plt.subplot(2,1,1)
rects1 = ax.bar(x - width/2, y1, width, label='City Hotel')
rects2 = ax.bar(x + width/2, y3, width, label='Resort Hotel')
ax.set_ylabel('canceled_ratio')
ax.set_xlabel('assigned_room_type')
ax.set_title('canceled_ratio by assigned_room_type and hotel', fontdict={'weight': 'normal', 'size': 8})
ax.set_xticks(x)
ax.set_xticklabels(labels)
plt.legend(loc=(1.02,0))
plt.subplots_adjust(right=0.8,hspace=0.5)

# total_transactions by assigned_room_type and hotel
temp_df = data.groupby(['hotel','assigned_room_type'])['total_transactions'].mean().unstack(level=0)
y1 = [z for z in temp_df['City Hotel']]
y2 = [z for z in temp_df['Resort Hotel']]
labels = [z for z in temp_df.index.values]
x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars
ax = plt.subplot(2,1,2)
rects1 = ax.bar(x - width/2, y1, width, label='City Hotel')
rects2 = ax.bar(x + width/2, y2, width, label='Resort Hotel')
ax.set_ylabel('canceled_ratio')
ax.set_xlabel('assigned_room_type')
ax.set_title('total_transactions by assigned_room_type and hotel', fontdict={'weight': 'normal', 'size': 8})
ax.set_xticks(x)
ax.set_xticklabels(labels)
plt.legend(loc=(1.02,0))
plt.subplots_adjust(right=0.8)
plt.show()


# heatmap of change details
temp_df11 = data[data['hotel']=='City Hotel'].groupby('assigned_room_type')['reserved_room_type'].value_counts().unstack().fillna(0)
temp_df21 = data[data['hotel']=='Resort Hotel'].groupby('assigned_room_type')['reserved_room_type'].value_counts().unstack().fillna(0)

temp_df12 = data[(data['is_canceled']==1)&(data['hotel']=='City Hotel')].groupby('assigned_room_type')['reserved_room_type'].value_counts().unstack().fillna(0)
temp_df22 = data[(data['is_canceled']==1)&(data['hotel']=='Resort Hotel')].groupby('assigned_room_type')['reserved_room_type'].value_counts().unstack().fillna(0)

temp_df13 = temp_df12/temp_df11 # cancelation
temp_df23 = temp_df22/temp_df21 # cancelation

k = 1
for df_t in (temp_df11, temp_df21, temp_df13, temp_df23):
    ax = plt.subplot(2,2,k)
    cmap = sns.cubehelix_palette(start = 1.5, rot = 3, gamma=0.8, as_cmap = True)
    sns.heatmap(df_t, linewidths = 0.05, ax = ax, vmax=df_t.values.max(), vmin=df_t.values.min(), cmap=cmap, robust=True) 
    if k==1:
        ax.set_title('The total by room type in City Hotel', fontdict={'weight': 'normal', 'size': 8})
    elif k==2:
        ax.set_title('The total by room type in Resort Hotel', fontdict={'weight': 'normal', 'size': 8})
    elif k==3:
        ax.set_title('The canceled ratio by room type in City Hotel', fontdict={'weight': 'normal', 'size': 8})
    else:
        ax.set_title('The canceled ratio by room type in Resort Hotel', fontdict={'weight': 'normal', 'size': 8})
    plt.subplots_adjust(hspace=0.5,wspace=0.5)
    k = k+1

plt.show()


When assigned_room_type differs from reserved_room_type, the ratio of cancelation is low. Thus, I'll construct a new variable named room_change.

In [None]:
### room_change
df1 = data[(data['reserved_room_type']==data['assigned_room_type'])]
df1['room_change'] = 'unchanged'
df2 = data[(data['reserved_room_type']!=data['assigned_room_type'])]
df2['room_change'] = 'changed'
data = pd.concat([df1,df2])
data = data.reset_index(drop=True)

### 2.1.2 Booking Response

The value range of days_in_waiting_list is large, and most values are 0. I'll construct a new variable named days_in_waiting_list_new.

In [None]:
### days_in_waiting_list_new

def fun1(value):
    if value == 0:
        return 0 
    else:
        return 1

data['days_in_waiting_list_new'] = data['days_in_waiting_list'].apply(fun1)


temp_df = data.groupby('days_in_waiting_list_new')['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0]+temp_df[1]
temp_df['cancelation'] = temp_df[1]/temp_df['total']

temp_df

### 2.1.3 Distribution Channel 

It can be concluded that 
* The distribution channel with the highest number of booking is TA/TO, which also has the highest ratio of booking cancelation.
* For GDS, the City Hotel's ratio of cancelation is much higher than the Resort Hotel's. But GDS's number of booking is low.

In [None]:
temp_df = data.groupby(['distribution_channel','hotel'])['is_canceled'].value_counts().unstack().unstack()
temp_df['city_total'] = temp_df[0]['City Hotel'].fillna(0)+temp_df[1]['City Hotel'].fillna(0)
temp_df['resort_total'] = temp_df[0]['Resort Hotel'].fillna(0)+temp_df[1]['Resort Hotel'].fillna(0)
temp_df['city_cancelation'] = temp_df[1]['City Hotel'].fillna(0)/temp_df['city_total']
temp_df['resort_cancelation'] = temp_df[1]['Resort Hotel'].fillna(0)/temp_df['resort_total']

y = [z for z in temp_df['city_total'].values]
x = [z for z in temp_df.index]
plt.subplot(2,2,1)
plt.pie(y,startangle=90)
plt.axis('equal')
plt.title('The booking percentage in\n City Hotel', fontdict={'weight': 'normal', 'size': 8})
plt.legend(loc='best',ncol=2,fontsize='xx-small',labels=x)

y = [z for z in temp_df['resort_total'].values]
x = [z for z in temp_df.index]
plt.subplot(2,2,2)
plt.pie(y,startangle=90)
plt.axis('equal')
plt.title('The booking percentage in\n Resort Hotel', fontdict={'weight': 'normal', 'size': 8})
plt.legend(loc='best',ncol=2,fontsize='xx-small',labels=x)

y1 = [z for z in temp_df['city_cancelation'].values]
y2 = [z for z in temp_df['resort_cancelation'].values]
labels = [z for z in temp_df.index.values]
x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars
ax = plt.subplot(2,1,2)
rects1 = ax.bar(x - width/2, y1, width, label='City Hotel')
rects2 = ax.bar(x + width/2, y2, width, label='Resort Hotel')
ax.set_ylabel('canceled_ratio')
ax.set_xlabel('distribution_channel')
ax.set_title('canceled_ratio by distribution_channel and hotel', fontdict={'weight': 'normal', 'size': 8})
ax.set_xticks(x)
ax.set_xticklabels(labels)
plt.legend(loc=(1.02,0))
plt.subplots_adjust(right=0.8)

plt.show()

## 2.2 Variables about customers

For customers, the variables can be sorted into 2 parts:
* Basic information. It contains the number of customers, country information and customers' type.
* The behavior. It contains customers' previous booking, the need for meal, their deposit type, special requests, number of days that booking in advance and number of stay nights.

### 2.2.1 Basic information

#### 2.2.1.1 the number of customers

I find many unreasonable records and delete them.


In [None]:
# unreasonable records
df1 = data[(data['adults']==0)&(data['children']==0)&(data['babies']==0)]
df2 = data[(data['adults']==0)&(data['children']>0)&(data['babies']>0)]
df3 = data[((data['babies']+ data['children'])/data['adults']>4)&(data['adults']>0)]
temp_df = pd.concat([df1,df2,df3])
data = data.drop(temp_df.index)
data = data.reset_index(drop=True)

data['total_customer'] = data['adults']+data['children']+data['babies']


It can be concluded that:
* Usually, the number of adults varies from 1 to 3, and the number of adults varies from 0 to 2. The number of babies varies from 0 to 1.
* When the number of adults is higher than a particular value, the ratio of cancelation is 1.
* The booking with babies has a lower ratio of cancelation than booking without a baby.

In [None]:
k = 1
for column in ['adults','children','babies']:
    temp_df = data.groupby(column)['is_canceled'].value_counts().unstack()
    temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
    temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
    y1 = [z for z in temp_df['cancelation']]
    y2 = [z for z in temp_df['total']]
    x = [int(z) for z in temp_df.index.values]
    ax = plt.subplot(2,3,k)
    plt.plot(x,y1,linestyle='--',alpha=0.5,color='r')
    plt.title(column)
    plt.ylabel('canceled_tatio')
    plt.subplots_adjust(wspace=0.5)
    if k == 1:
        xmajorLocator = MultipleLocator(10)
    else:
        xmajorLocator = MultipleLocator(1)
    ax.xaxis.set_major_locator(xmajorLocator)
    plt.subplot(2,3,k+3)
    plt.pie(y2)
    plt.axis('equal')
    plt.legend(loc='best',ncol=2,fontsize='xx-small',labels=x)
    k = k+1

plt.show()


Thus, I construct a new variable named family.

In [None]:
## family ##
df1 = data[data['babies']>0]
df1['family'] = 0
df2 = data[(data['adults']==0)&(data['babies']==0)]
df2['family'] = 1
df3 = data[(data['adults']==1)&(data['children']==0)&(data['babies']==0)]
df3['family'] = 2
df4 = data[(data['adults']==2)&(data['children']==0)&(data['babies']==0)]
df4['family'] = 3
df5 = data[(data['adults']>2)&(data['children']==0)&(data['babies']==0)]
df5['family'] = 4
df6 = data[((data['adults']==2)|(data['adults']==1))&(data['children']>0)&(data['babies']==0)]
df6['family'] = 5
df7 = data[(data['adults']>2)&(data['children']>0)&(data['babies']==0)]
df7['family'] = 6
data = pd.concat([df1,df2,df3,df4,df5,df6,df7])
data = data.reset_index(drop=True)

In [None]:
## figure of family ##
temp_df = data.groupby('family')['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
y1 = [z for z in temp_df['cancelation']]
y2 = [z for z in temp_df['total']]
x = [z for z in temp_df.index.values]
ax = plt.subplot(2,1,1)
plt.bar(x,y1,alpha=0.4, color='b')
plt.ylabel('canceled_ratio')
plt.subplots_adjust(wspace=0.4)
plt.title('family')
xmajorLocator = MultipleLocator(1)
ax.xaxis.set_major_locator(xmajorLocator)
plt.subplot(2,1,2)
plt.pie(y2)
plt.axis('equal')
plt.legend(loc='best',ncol=2,fontsize='xx-small',labels=x)
plt.show()

#### 2.2.1.2 country information

The country variable has 177 values, and the sum of the top 20 booking accounts for 94.21% of all booking. 

In [None]:
temp_df = data.groupby(['country'])['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0].fillna(0)+temp_df[1].fillna(0)
temp_df['percentage'] = temp_df['total']/temp_df['total'].sum()
temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
temp_df = temp_df.sort_values('total',ascending=False)

temp_df

As the ratio of cancelation varies from different countries, I consturct a new variable named country_new.

In [None]:
temp_df = data.groupby('country')['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']

temp_df1 = temp_df[(temp_df['total']<=100)]
class1 = [z for z in temp_df1.index.values]
temp_df2 = temp_df[(temp_df['total']>100)&(temp_df['cancelation']<=0.1)]
class2 = [z for z in temp_df2.index.values]
temp_df3 = temp_df[(temp_df['total']>100)&(temp_df['cancelation']>0.1)&(temp_df['cancelation']<=0.3)]
class3 = [z for z in temp_df3.index.values]
temp_df4 = temp_df[(temp_df['total']>100)&(temp_df['cancelation']>0.3)&(temp_df['cancelation']<=0.5)]
class4 = [z for z in temp_df4.index.values]
temp_df5 = temp_df[(temp_df['total']>100)&(temp_df['cancelation']>0.5)]
class5 = [z for z in temp_df5.index.values]


def fun3(values):
    if values in class1:
        return 0
    elif values in class2:
        return 1
    elif values in class3:
        return 2
    elif values in class4:
        return 3
    elif values in class5:
        return 4
    else:
        return 0

data['country_new'] = data['country'].apply(fun3)

In [None]:
temp_df = data.groupby('country_new')['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
y1 = [z for z in temp_df['cancelation']]
y2 = [z for z in temp_df['total']]
labels = ['class1','class2','class3','class4','class5']
x = [z for z in temp_df.index.values]
ax = plt.subplot(2,1,1)
ax.bar(x,y1,alpha=0.4, color='b')
ax.set_ylabel('canceled_ratio')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_title('canceled_ratio by country_new')
plt.subplots_adjust(hspace=0.5)

ax = plt.subplot(2,1,2)
ax.pie(y2, startangle=90)
ax.axis('equal')
ax.set_title('The percentage of different classes in country_new variable')
ax.legend(loc='best',ncol=2,fontsize='xx-small',labels=labels)
plt.show()


#### 2.2.1.3 customers' type

In this dataset, 4 variables can be used to describe customers' type. They are customer_type, market_segment, agent and company.

For customer_type and market_segment, it can be concluded that:
* The Transient's and the Transient-Party's number of booking is much higher than the Group's and the Contract's. The top 3 markets in terms of number of booking are TA/TO(Online TA and Offline TA/TO included), Direct and Groups.
* When it refers to the ratio of cancelation, there are some conclusions: (1) the ratio of cancelation of the Transient and Contract from Groups market has exceeded 95%. (2)the ratio of cancelation of customers from Complementary, Corporate and Direct market is low.


In [None]:
temp_df = data.groupby(['customer_type'])['market_segment'].value_counts().unstack().fillna(0)

temp_df11 = data.groupby(['customer_type'])['market_segment'].value_counts().unstack().fillna(0)
temp_df12 = data[data['is_canceled']==1].groupby('customer_type')['market_segment'].value_counts().unstack().fillna(0)
temp_df1 = temp_df12/temp_df11

temp_df1

Based on these conclusions, I construct a new variable named customer_type_new.

In [None]:
df1 = data[(data['market_segment']=='Groups')&((data['customer_type']=='Contract')|(data['customer_type']=='Transient'))]
df1['customer_type_new'] = 0
df2 = data[(data['market_segment']=='Groups')&(data['customer_type']=='Transient-Party')]
df2['customer_type_new'] = 1
df3 = data[(data['market_segment']=='Complementary')|(data['market_segment']=='Corporate')|(data['market_segment']=='Direct')]
df3['customer_type_new'] = 2
df4 = data[((data['market_segment']=='Online TA')|(data['market_segment']=='Offline TA/TO'))&(data['customer_type']=='Transient')]
df4['customer_type_new'] = 3
df5 = data[(data['market_segment']=='Online TA')&(data['customer_type']=='Transient-Party')]
df5['customer_type_new'] = 4

temp_df = pd.concat([df1,df2,df3,df4,df5])
df6 = data.drop(temp_df.index)
df6['customer_type_new'] = 5

data = pd.concat([temp_df,df6])
data = data.reset_index(drop=True)


In [None]:
#### figure

temp_df = data.groupby('customer_type_new')['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
y1 = [z for z in temp_df['cancelation']]
y2 = [z for z in temp_df['total']]
labels = ['class1','class2','class3','class4','class5','class6']
x = [z for z in temp_df.index.values]
ax = plt.subplot(2,1,1)
ax.bar(x,y1,alpha=0.4, color='b')
ax.set_ylabel('canceled_ratio')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_title('canceled_ratio by customer_type_new')
plt.subplots_adjust(hspace=0.5)

ax = plt.subplot(2,1,2)
ax.pie(y2, startangle=90)
ax.axis('equal')
ax.set_title('The percentage of different classes in customer_type_new variable')
ax.legend(loc='best',ncol=2,fontsize='xx-small',labels=labels)
plt.show()


For agent and company variables, their values ranges are large. And their number of booking and ratio of cancelation vary from values. I conctruct two new variable named agent_new and company_new.

In [None]:
def fun2(values):
    if values == 0:
        return 0
    else:
        return 1

data['company_new'] = data['company'].fillna(0).apply(fun2)

## agent_new
temp_df = data[data['agent']>0].groupby('agent')['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']

temp_df1 = temp_df[(temp_df['total']<=100)]
class1 = [z for z in temp_df1.index.values]
temp_df2 = temp_df[(temp_df['total']>100)&(temp_df['cancelation']<=0.1)]
class2 = [z for z in temp_df2.index.values]
temp_df3 = temp_df[(temp_df['total']>100)&(temp_df['cancelation']>0.1)&(temp_df['cancelation']<=0.3)]
class3 = [z for z in temp_df3.index.values]
temp_df4 = temp_df[(temp_df['total']>100)&(temp_df['cancelation']>0.3)&(temp_df['cancelation']<=0.5)]
class4 = [z for z in temp_df4.index.values]
temp_df5 = temp_df[(temp_df['total']>100)&(temp_df['cancelation']>0.5)&(temp_df['cancelation']<=0.8)]
class5 = [z for z in temp_df5.index.values]
temp_df6 = temp_df[(temp_df['total']>100)&(temp_df['cancelation']>0.8)]
class6 = [z for z in temp_df6.index.values]

def fun3(values):
    if values == 0:
        return 0
    elif values in class1:
        return 1
    elif values in class2:
        return 2
    elif values in class3:
        return 3
    elif values in class4:
        return 4
    elif values in class5:
        return 5
    elif values in class6:
        return 6
    else:
        return 1

data['agent_new'] = data['agent'].apply(fun3)


In [None]:
k = 1
for column in ['company_new','agent_new']:
    temp_df = data.groupby(column)['is_canceled'].value_counts().unstack()
    temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
    temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
    y1 = [z for z in temp_df['cancelation']]
    y2 = [z for z in temp_df['total']]
    if k == 1:
        labels = ['c1','c2']
    else:
        labels = ['c1','c2','c3','c4','c5','c6','c7']
    x = [z for z in temp_df.index.values]
    ax = plt.subplot(2,2,k)
    ax.bar(x,y1,alpha=0.4, color='b')
    ax.set_ylabel('canceled_ratio', fontdict={'weight': 'normal', 'size': 8})
    ax.set_xticks(x)
    ax.set_xticklabels(labels, fontdict={'weight': 'normal', 'size': 8})
    ax.set_title('canceled_ratio by '+column, fontdict={'weight': 'normal', 'size': 8})
    plt.subplots_adjust(hspace=0.5, wspace=0.5)
    ax = plt.subplot(2,2,k+2)
    ax.pie(y2, startangle=90)
    ax.axis('equal')
    ax.set_title('The percentage of different classes in\n '+column+' variable', fontdict={'weight': 'normal', 'size': 8})
    ax.legend(loc='best',ncol=2,fontsize='xx-small',labels=labels)
    k = k+1

plt.show()


### 2.2.2 The behavior

#### 2.2.2.1 customers' previous booking

In the dataset, is_repeated_guest, previous_cancellations and previous_bookings_not_canceled can be used to describe customers' previous booking. It can be concluded that:
* Most bookings come from new customer, and new customers' ratio of cancelation is much higher than repeated customers.
* If the previous_cancellations variable is equal to 0, the ratio of cancelation is low. While if the previous_cancellations_not_canceled is equal to 0, the ratio of cancelation is high.

In [None]:
######### is_repeated_guest
temp_df = data.groupby('is_repeated_guest')['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
y1 = [z for z in temp_df['cancelation']]
y2 = [z for z in temp_df['total']]
x = [z for z in temp_df.index.values]
ax = plt.subplot(2,3,1)
plt.bar(x,y1,alpha=0.4, color='b')
plt.ylabel('canceled_ratio')
plt.subplots_adjust(wspace=0.5)
plt.title('is_repeated_guest', fontdict={'weight': 'normal', 'size': 8})
xmajorLocator = MultipleLocator(1)
ax.xaxis.set_major_locator(xmajorLocator)
plt.subplot(2,3,4)
plt.bar(x,y2)
plt.ylabel('total_num')

######### previous_cancellations + previous_bookings_not_canceled
k = 2
for column in ['previous_cancellations','previous_bookings_not_canceled']:
    temp_df = data.groupby(column)['is_canceled'].value_counts().unstack()
    temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
    temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
    y1 = [z for z in temp_df['cancelation']]
    y2 = [z for z in temp_df['total']]
    x = [z for z in temp_df.index.values]
    ax = plt.subplot(2,3,k)
    plt.plot(x,y1,alpha=0.4, color='b')
    plt.subplots_adjust(wspace=0.5)
    plt.title(column, fontdict={'weight': 'normal', 'size': 8})
    xmajorLocator = MultipleLocator(10)
    ax.xaxis.set_major_locator(xmajorLocator)
    plt.subplot(2,3,k+3)
    plt.bar(x,y2)
    k = k+1

plt.show()


So I construct a new variable named new_previous_cancellations.

In [None]:
df1 = data[(data['previous_bookings_not_canceled']==0)&(data['previous_cancellations']==0)]
df1['new_previous_cancellations'] = 0
df2 = data[(data['previous_bookings_not_canceled']==0)&(data['previous_cancellations']>0)]
df2['new_previous_cancellations'] = 1
df3 = data[(data['previous_bookings_not_canceled']>0)&(data['previous_cancellations']==0)]
df3['new_previous_cancellations'] = 2
df4 = data[(data['previous_bookings_not_canceled']>0)&(data['previous_cancellations']>0)]
df4['new_previous_cancellations'] = 3
data = pd.concat([df1,df2,df3,df4])


In [None]:
temp_df = data.groupby('new_previous_cancellations')['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
y1 = [z for z in temp_df['cancelation']]
y2 = [z for z in temp_df['total']]
labels = ['class1','class2','class3','class4']
x = [z for z in temp_df.index.values]
ax = plt.subplot(2,1,1)
ax.bar(x,y1,alpha=0.4, color='b')
ax.set_ylabel('canceled_ratio')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_title('canceled_ratio by new_previous_cancellations')
plt.subplots_adjust(hspace=0.5)
ax = plt.subplot(2,1,2)
ax.pie(y2, startangle=90)
ax.axis('equal')
ax.set_title('The percentage of different classes in new_previous_cancellations variable')
ax.legend(loc='best',ncol=2,fontsize='xx-small',labels=labels)

plt.show()


#### 2.2.2.2 the need for meal

According to the defination, the value of meal variable as SC and Undefined should be combined into the same class. And it can be concluded that:
* FB has the highest ratio of cancelation, but it also has the low number of booking. **So meal variable has little relation to the ratio of cancelation.** 

In [None]:
def fun4(value):
    if value=='Undefined':
        return 'SC'
    else:
        return value

data['meal'] = data['meal'].apply(fun4)


#### figure
temp_df = data.groupby('meal')['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
y1 = [z for z in temp_df['cancelation']]
y2 = [z for z in temp_df['total']]
x = [z for z in temp_df.index.values]
ax = plt.subplot(2,1,1)
ax.bar(x,y1,alpha=0.4, color='b')
ax.set_ylabel('canceled_ratio')
ax.set_xticks(x)
ax.set_xticklabels(x)
ax.set_title('canceled_ratio by new_previous_cancellations')
plt.subplots_adjust(hspace=0.5)
ax = plt.subplot(2,1,2)
ax.pie(y2, startangle=90)
ax.axis('equal')
ax.set_title('The percentage of different classes in new_previous_cancellations variable')
ax.legend(loc='best',ncol=2,fontsize='xx-small',labels=x)

plt.show()


#### 2.2.2.3 deposit type

It can be found that Non Refund has the highest ratio of cancelation.

In [None]:
temp_df = data.groupby('deposit_type')['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
y1 = [z for z in temp_df['cancelation']]
y2 = [z for z in temp_df['total']]
x = [z for z in temp_df.index.values]
ax = plt.subplot(2,1,1)
ax.bar(x,y1,alpha=0.4, color='b')
ax.set_ylabel('canceled_ratio')
ax.set_xticks(x)
ax.set_xticklabels(x)
ax.set_title('canceled_ratio by deposit_type')
plt.subplots_adjust(hspace=0.5)
ax = plt.subplot(2,1,2)
ax.pie(y2, startangle=90)
ax.axis('equal')
ax.set_title('The percentage of different classes in deposit_type variable')
ax.legend(loc='best',ncol=2,fontsize='xx-small',labels=x)

plt.show()

#### 2.2.2.4 special requests

In this dataset, booking_changes, required_car_parking_spaces and total_of_special_requests can be used to describe customers' special requests. It can be found that:
* Most bookings have no sepcial requests and haven't been changed. Once the bookings have sepcial requests or are changed, their ratio of cancelation are low.
* The ratio of cancelation of bookings which require car parking spaces is 0.


In [None]:
k = 1
for column in ['booking_changes','required_car_parking_spaces','total_of_special_requests']:
    temp_df = data.groupby(column)['is_canceled'].value_counts().unstack()
    temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
    temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
    y1 = [z for z in temp_df['cancelation']]
    y2 = [z for z in temp_df['total']]
    x = [z for z in temp_df.index.values]
    ax = plt.subplot(2,3,k)
    ax.plot(x,y1,alpha=0.4, color='b')
    xmajorLocator = MultipleLocator(2)
    ax.xaxis.set_major_locator(xmajorLocator)
    plt.subplots_adjust(wspace=0.5)
    plt.title(column, fontdict={'weight': 'normal', 'size': 8})
    plt.subplot(2,3,k+3)
    plt.bar(x,y2)
    k = k+1

plt.show()


Thus, I construct two new variables named new_required_car_parking_spaces and booking_changes_class.

In [None]:
### new_required_car_parking_spaces
def fun51(value):
    if value > 0:
        return 1
    else: 
        return 0

data['new_required_car_parking_spaces'] = data['required_car_parking_spaces'].apply(fun51)

### booking_changes_class
def fun52(value):
    if value == 0:
        return 0
    elif value > 0 and value <= 5:
        return 1
    elif value > 5:
        return 2

data['booking_changes_class'] = data['booking_changes'].apply(fun52)


In [None]:
temp_df = data.groupby('booking_changes_class')['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
y1 = [z for z in temp_df['cancelation']]
y2 = [z for z in temp_df['total']]
x = [z for z in temp_df.index.values]
ax = plt.subplot()
ax.bar(x,y1,alpha=0.4, color='b')
ax.set_ylabel('canceled_ratio')
ax.set_xticks(x)
ax.set_xticklabels(x)
ax.set_title('canceled_ratio by booking_changes_class')

plt.show()


#### 2.2.2.5 number of days that booking in advance

The lead_time variable can be used to describe number of days that booking in advance. It can be found that there is a positive correlation. 

In [None]:
temp_df = data.groupby('lead_time')['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0].fillna(0)+temp_df[1].fillna(0)
temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
y1 = [z for z in temp_df['cancelation']]
y2 = [z for z in temp_df['total']]
x = [z for z in temp_df.index.values]
fig = plt.figure()
ax = plt.subplot(2,1,1)
rects1 = ax.scatter(x, y1)
ax.set_ylabel('cancel_ratio')
ax.set_xlabel('lead_time')
plt.subplots_adjust(wspace=0.4,hspace=0.6)
ax = plt.subplot(2,1,2)
rects1 = ax.bar(x, y2)
ax.set_ylabel('total')
ax.set_xlabel(column)
plt.subplots_adjust(wspace=0.4)

plt.show()

I'd like to construction a new variable named new_lead_time.

In [None]:
def fun6(value):
    if value == 0:
        return 0
    elif value > 0 and value <= 7:
        return 1
    elif value > 7 and value <= 14:
        return 2
    elif value > 14 and value <= 30:
        return 3
    elif value > 30 and value <= 60:
        return 4
    elif value > 60 and value <= 90:
        return 5
    elif value > 90 and value <= 220:
        return 6
    else:
        return 7
         

data['new_lead_time'] = data['lead_time'].apply(fun6)


In [None]:
temp_df = data.groupby('new_lead_time')['is_canceled'].value_counts().unstack()
temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
y1 = [z for z in temp_df['cancelation']]
y2 = [z for z in temp_df['total']]
labels = ['class1','class2','class3','class4','class5','class6','class7','class8']
x = [z for z in temp_df.index.values]
ax = plt.subplot(2,1,1)
ax.bar(x,y1,alpha=0.4, color='b')
ax.set_ylabel('canceled_ratio')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_title('canceled_ratio by new_lead_time')
plt.subplots_adjust(hspace=0.5)
ax = plt.subplot(2,1,2)
ax.pie(y2, startangle=90)
ax.axis('equal')
ax.set_title('The percentage of different classes in new_lead_time variable')
ax.legend(loc='best',ncol=2,fontsize='xx-small',labels=labels)

plt.show()


#### 2.2.2.6 number of stay nights

The number of stay nights can be calculated with the formula: total_stay_night = stays_in_weekend_nights + stays_in_week_nights.

In [None]:

k = 1
for column in ['total_stay_night','stays_in_weekend_nights','stays_in_week_nights']:
    temp_df = data.groupby(column)['is_canceled'].value_counts().unstack()
    temp_df['total'] = temp_df[0].fillna(0)+ temp_df[1].fillna(0)
    temp_df['cancelation'] = temp_df[1].fillna(0)/temp_df['total']
    y1 = [z for z in temp_df['cancelation']]
    y2 = [z for z in temp_df['total']]
    x = [z for z in temp_df.index.values]
    ax = plt.subplot(2,3,k)
    ax.plot(x,y1,alpha=0.4, color='b')
    xmajorLocator = MultipleLocator(5)
    ax.xaxis.set_major_locator(xmajorLocator)
    plt.subplots_adjust(wspace=0.6)
    plt.title(column, fontdict={'weight': 'normal', 'size': 8})
    plt.subplot(2,3,k+3)
    plt.bar(x,y2)
    k = k+1

plt.show()



# 3. Modelling Details

Based on the analysis of variables, I choose original variables and the new constructed variables to model.

## 3.1 Model with original variables

I sort the original variables into qualitative variables and quantitative variables. For quantitative variables, I make the MinMaxScaler.

In [None]:
qualitative_data = pd.DataFrame()
qualitative_data['is_canceled'] = data['is_canceled']
qualitative_data['hotel'] = data['hotel'].map({'Resort Hotel':1, 'City Hotel':0})
qualitative_data['arrival_date_month_new'] = data['arrival_date_month'].map({'July':6, 'August':7, 'September':8, 'October':9, 'November':10, 'December':11,
       'January':0, 'February':1, 'March':2, 'April':3, 'May':4, 'June':5})
qualitative_data['arrival_date_week_number'] = data['arrival_date_week_number']
qualitative_data['arrival_date_day_of_month'] = data['arrival_date_day_of_month']
qualitative_data['market_segment'] = data['market_segment'].map({'Direct':0, 'Corporate':1, 'Online TA':2, 'Offline TA/TO':3,
       'Complementary':4, 'Groups':5, 'Undefined':6, 'Aviation':7})
qualitative_data['distribution_channel'] = data['distribution_channel'].map({'Direct':0, 'Corporate':1, 'TA/TO':2, 'Undefined':3, 'GDS':4})
qualitative_data['is_repeated_guest'] = data['is_repeated_guest']
qualitative_data['assigned_room_type'] = data['assigned_room_type'].map({'C':0, 'A':1, 'D':2, 'E':3, 'G':4, 'F':5, 'I':10, 'B':8, 'H':6, 'L':7, 'K':11, 'P':9})
qualitative_data['deposit_type'] = data['deposit_type'].map({'No Deposit':0, 'Refundable':1, 'Non Refund':2})
qualitative_data['customer_type'] = data['customer_type'].map({'Transient':0, 'Contract':1, 'Transient-Party':2, 'Group':3})

quantitative_data = pd.DataFrame()
quantitative_data['lead_time'] = data['lead_time']
quantitative_data['stays_in_weekend_nights'] = data['stays_in_weekend_nights']
quantitative_data['stays_in_week_nights'] = data['stays_in_week_nights']
quantitative_data['adults'] = data['adults']
quantitative_data['children'] = data['children']
quantitative_data['babies'] = data['babies']
quantitative_data['previous_cancellations'] = data['previous_cancellations']
quantitative_data['previous_bookings_not_canceled'] = data['previous_bookings_not_canceled']
quantitative_data['booking_changes'] = data['booking_changes']
quantitative_data['days_in_waiting_list'] = data['days_in_waiting_list']
quantitative_data['adr'] = data['adr']
quantitative_data['required_car_parking_spaces'] = data['required_car_parking_spaces']
quantitative_data['total_of_special_requests'] = data['total_of_special_requests']

from sklearn.preprocessing import MinMaxScaler
quantitative_data_minmaxscaler = pd.DataFrame(MinMaxScaler().fit_transform(quantitative_data))
quantitative_data_minmaxscaler.columns = quantitative_data.columns


new_data1 = pd.concat([qualitative_data, quantitative_data_minmaxscaler],axis=1)


It can be found that the best model is RandomForest model. Its accuracy is 0.84.

In [None]:
#######################   DataModel  #######################
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

y = new_data1['is_canceled'].values
X = new_data1.drop(['is_canceled'], axis = 1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#########  DecisionTree  ########
clf_tree = tree.DecisionTreeClassifier()
clf_tree = clf_tree.fit(X_train, y_train)
clf_tree_predict = clf_tree.predict(X_test)

#########  RandomForest  ########
clf_randomtree = RandomForestClassifier(n_estimators=100)
clf_randomtree = clf_randomtree.fit(X_train, y_train)
clf_randomtree_predict = clf_randomtree.predict(X_test)

#########  GradientBoosting  ########
clf_GradientBoosting = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,max_depth=1, random_state=0)
clf_GradientBoosting = clf_GradientBoosting.fit(X_train, y_train)
clf_GradientBoosting_predict = clf_GradientBoosting.predict(X_test)


for clf, label in zip([clf_tree, clf_randomtree, clf_GradientBoosting], ['Decision Tree', 'Random Forest', 'Gradient Boosting']):
    scores = cross_val_score(clf, X_test, y_test, cv=5, scoring='accuracy')
    print("Accuracy: %0.4f (+/- %0.4f) [%s]" % (scores.mean(), scores.std(), label))


## 3.2 Model with new variables

It can be found that models with new variable have a greater performance than the above ones. 

**For RandomForest model, it increase accuracy from 0.84 to 0.89.**

Good!

In [None]:
new_variable_data = pd.DataFrame()
new_variable_data['room_change'] = data['room_change'].map({'changed':1,'unchanged':0})
new_variable_data['days_in_waiting_list_new'] = data['days_in_waiting_list_new']
new_variable_data['family'] = data['family']
new_variable_data['country_new'] = data['country_new']
new_variable_data['customer_type_new'] = data['customer_type_new']
new_variable_data['company_new'] = data['company_new']
new_variable_data['agent_new'] = data['agent_new']
new_variable_data['new_previous_cancellations'] = data['new_previous_cancellations']
new_variable_data['booking_changes_class'] = data['booking_changes_class']
new_variable_data['new_required_car_parking_spaces'] = data['new_required_car_parking_spaces']
new_variable_data['new_lead_time'] = data['new_lead_time']

new_data2 = pd.concat([qualitative_data, quantitative_data_minmaxscaler, new_variable_data],axis=1)


In [None]:
y = new_data2['is_canceled'].values
X = new_data2.drop(['is_canceled'], axis = 1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#########  DecisionTree  ########
clf_tree = tree.DecisionTreeClassifier()
clf_tree = clf_tree.fit(X_train, y_train)
clf_tree_predict = clf_tree.predict(X_test)

#########  RandomForest  ########
clf_randomtree = RandomForestClassifier(n_estimators=100)
clf_randomtree = clf_randomtree.fit(X_train, y_train)
clf_randomtree_predict = clf_randomtree.predict(X_test)

#########  GradientBoosting  ########
clf_GradientBoosting = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,max_depth=1, random_state=0)
clf_GradientBoosting = clf_GradientBoosting.fit(X_train, y_train)
clf_GradientBoosting_predict = clf_GradientBoosting.predict(X_test)


for clf, label in zip([clf_tree, clf_randomtree, clf_GradientBoosting], ['Decision Tree', 'Random Forest', 'Gradient Boosting']):
    scores = cross_val_score(clf, X_test, y_test, cv=5, scoring='accuracy')
    print("Accuracy: %0.4f (+/- %0.4f) [%s]" % (scores.mean(), scores.std(), label))
