# Analysis of the Hotel Demands and Bookings Cancellation

Hotels booking cancelation make it harder to accurately forecast and optimize
occupancy which in turn results in revenue loss. The goal of this project is to
predict in advance weather a hotel customer will cancel his booking or not.
Predicting future booking cancellation can help hotels plan for cancellation and
refund policies, staffing schedules as well as targeting customers with offers and
discounts. It is also important to understand key booking cancellation factors and
how those factors relate to booking cancellation.

## Import Libaries

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from matplotlib import pyplot

In [2]:
# modeling imports
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score, roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler

In [3]:
from sklearn import preprocessing
from sklearn import metrics
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
from sklearn.model_selection import cross_val_score,StratifiedKFold,cross_validate 
from sklearn.metrics import confusion_matrix,roc_curve,auc
#from pycaret.classification import *,

In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

## Load dataset

I used "Hotel booking demand” dataset available on Kaggle(https://www.kaggle.com/jessemostipak/hotel-booking-demand).

In [5]:
df_booking = pd.read_csv('hotel_bookings.csv')

In [6]:
df_booking.shape

(119390, 32)

In [7]:
df_booking.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


In [8]:
df_booking.columns

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')

The dataset consists of 119,390 observations with 32 features. The individual
sample/unit of analysis in this project is a single booking made by a hotel
customer. There are 32 features related to the booking, including booking date,
lead time, number of adults, children, babes, deposit type and previous
cancellations.

### Change columns order

In [9]:
cols=['is_canceled','hotel','lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date']
df_booking=df_booking[cols] 

## EDA

In [10]:
from pandas_profiling import ProfileReport
#profile = ProfileReport(df_booking)
#profile



## Exploring the Null values


In [11]:
df_booking.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
is_canceled,119390.0,0.370416,0.482918,0.0,0.0,0.0,1.0,1.0
lead_time,119390.0,104.011416,106.863097,0.0,18.0,69.0,160.0,737.0
arrival_date_year,119390.0,2016.156554,0.707476,2015.0,2016.0,2016.0,2017.0,2017.0
arrival_date_week_number,119390.0,27.165173,13.605138,1.0,16.0,28.0,38.0,53.0
arrival_date_day_of_month,119390.0,15.798241,8.780829,1.0,8.0,16.0,23.0,31.0
stays_in_weekend_nights,119390.0,0.927599,0.998613,0.0,0.0,1.0,2.0,19.0
stays_in_week_nights,119390.0,2.500302,1.908286,0.0,1.0,2.0,3.0,50.0
adults,119390.0,1.856403,0.579261,0.0,2.0,2.0,2.0,55.0
children,119386.0,0.10389,0.398561,0.0,0.0,0.0,0.0,10.0
babies,119390.0,0.007949,0.097436,0.0,0.0,0.0,0.0,10.0


#### we can see that adr has an outlier

In [12]:
plt.boxplot(df_booking['adr'])
plt.title("Detecting outliers in adr")

Text(0.5, 1.0, 'Detecting outliers in adr')

In [13]:
df_booking[df_booking['adr']>1000].adr=df_booking['adr'].mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [14]:
df_booking.loc[df_booking['adr'] > 1000, 'adr'] = df_booking['adr'].mean()

In [15]:
df_booking[df_booking['adr']>1000]

Unnamed: 0,is_canceled,hotel,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date


In [16]:
plt.boxplot(df_booking['adr'])
plt.title("Detecting outliers in adr")

Text(0.5, 1.0, 'Detecting outliers in adr')

In [17]:
df_booking.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   is_canceled                     119390 non-null  int64  
 1   hotel                           119390 non-null  object 
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

In [18]:
# Check for nulls 
 
print ("Top Columns having missing values") 
 
missmap = df_booking.isnull().sum().to_frame() 
missmap = missmap.sort_values(0, ascending = False) 
missmap.head()

Top Columns having missing values


Unnamed: 0,0
company,112593
agent,16340
country,488
children,4
reserved_room_type,0


In [19]:
# drop the company column as it is dominated by Null values
df_booking.drop(columns=['company'],axis=1, inplace=True)

# drop the raws with missing children values (only 4 raws)
df_booking.dropna(subset=['children'], inplace=True)

df_booking['country'].replace(np.nan,df_booking['country'].mode().values[0], inplace=True)

# try non-existing value
df_booking['agent'].replace(np.nan,df_booking['agent'].max(), inplace=True)

In [20]:
# Check for nulls 
 
print ("Top Columns having missing values") 
 
missmap = df_booking.isnull().sum().to_frame() 
missmap = missmap.sort_values(0, ascending = False) 
missmap.head()

Top Columns having missing values


Unnamed: 0,0
is_canceled,0
is_repeated_guest,0
reservation_status,0
total_of_special_requests,0
required_car_parking_spaces,0


## Exploring  data types  and values

In [21]:

columns=df_booking.columns

for col in columns:
    print('{} Possible Values:{}'.format(col,df_booking[col].unique()))


is_canceled Possible Values:[0 1]
hotel Possible Values:['Resort Hotel' 'City Hotel']
lead_time Possible Values:[342 737   7  13  14   0   9  85  75  23  35  68  18  37  12  72 127  78
  48  60  77  99 118  95  96  69  45  40  15  36  43  70  16 107  47 113
  90  50  93  76   3   1  10   5  17  51  71  63  62 101   2  81 368 364
 324  79  21 109 102   4  98  92  26  73 115  86  52  29  30  33  32   8
 100  44  80  97  64  39  34  27  82  94 110 111  84  66 104  28 258 112
  65  67  55  88  54 292  83 105 280 394  24 103 366 249  22  91  11 108
 106  31  87  41 304 117  59  53  58 116  42 321  38  56  49 317   6  57
  19  25 315 123  46  89  61 312 299 130  74 298 119  20 286 136 129 124
 327 131 460 140 114 139 122 137 126 120 128 135 150 143 151 132 125 157
 147 138 156 164 346 159 160 161 333 381 149 154 297 163 314 155 323 340
 356 142 328 144 336 248 302 175 344 382 146 170 166 338 167 310 148 165
 172 171 145 121 178 305 173 152 354 347 158 185 349 183 352 177 200 192
 361 207 174

In [22]:
df_booking.dtypes

is_canceled                         int64
hotel                              object
lead_time                           int64
arrival_date_year                   int64
arrival_date_month                 object
arrival_date_week_number            int64
arrival_date_day_of_month           int64
stays_in_weekend_nights             int64
stays_in_week_nights                int64
adults                              int64
children                          float64
babies                              int64
meal                               object
country                            object
market_segment                     object
distribution_channel               object
is_repeated_guest                   int64
previous_cancellations              int64
previous_bookings_not_canceled      int64
reserved_room_type                 object
assigned_room_type                 object
booking_changes                     int64
deposit_type                       object
agent                             

### Change type of arrival_date_month,children agent

In [23]:
df_booking['arrival_date_month']=df_booking['arrival_date_month'].astype(str)

df_booking['children']=df_booking['children'].astype(int)
df_booking['agent']=df_booking['agent'].astype(int)


In [24]:
df_booking.is_canceled.value_counts()


0    75166
1    44220
Name: is_canceled, dtype: int64

### Create new booking_date Column

In [25]:
# Take the date and time fields into a single datetime column

df=pd.DataFrame({
    'year':df_booking.arrival_date_year,
    'month': df_booking.arrival_date_month,
    'day': df_booking.arrival_date_day_of_month
})
#df['month'] = pd.to_datetime(df.month, format='%B').dt.month
#df_booking["booking_date"] = pd.to_datetime(df)


In [26]:
#df_booking["booking_date"]

### Identify hotel type with highest number of cancellation

In [27]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
sns.countplot(x = 'hotel', data = df_booking, hue = 'is_canceled')
plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservation Status by Hotel type", weight='bold')
plt.xlabel('Hotel type')
plt.ylabel('Number of Reservation')
plt.show()

### Identify year with highest number of confirmation/cancellation

In [28]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
sns.countplot(x = 'arrival_date_year', data = df_booking, hue = 'is_canceled')

plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservation Status by Year", weight='bold')
plt.xlabel('Arrival Year')
plt.ylabel('Number of Reservation')
plt.xticks(rotation=90)
plt.show()

### Identify months with highest number of confirmation/cancellation

In [29]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
sns.countplot(x = 'arrival_date_month', data = df_booking, hue = 'is_canceled')

plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservation Status by Month", weight='bold')
plt.xlabel('Arrival Month')
plt.ylabel('Number of Reservation')
plt.xticks(rotation=90)
plt.show()

In [30]:
booking_month=df_booking.groupby('arrival_date_month').count()
booking_month=booking_month.iloc[:,[0]]

canceled_month=df_booking.groupby('arrival_date_month').sum()
canceled_month=canceled_month.iloc[:,[0]]

booking_month=booking_month.reindex(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])

canceled_month=canceled_month.reindex(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])

booking_month['booking_percentage'] = (booking_month['is_canceled']/booking_month['is_canceled'].sum() )*100
canceled_month['canceled_percentage'] = (canceled_month['is_canceled']/canceled_month['is_canceled'].sum() )*100

booking_month.reset_index(level=0, inplace=True)
canceled_month.reset_index(level=0, inplace=True)
#booking_month['arrival_date_month'] = booking_month.index
#canceled_month['arrival_date_month'] = canceled_month.index

In [31]:
booking_month

Unnamed: 0,arrival_date_month,is_canceled,booking_percentage
0,January,5929,4.966244
1,February,8068,6.757911
2,March,9794,8.203642
3,April,11089,9.288359
4,May,11791,9.876367
5,June,10939,9.162716
6,July,12661,10.605096
7,August,13873,11.62029
8,September,10508,8.801702
9,October,11160,9.34783


In [32]:
canceled_month

Unnamed: 0,arrival_date_month,is_canceled,canceled_percentage
0,January,1807,4.086386
1,February,2696,6.096789
2,March,3149,7.121212
3,April,4524,10.230665
4,May,4677,10.576662
5,June,4535,10.25554
6,July,4742,10.723654
7,August,5235,11.838535
8,September,4116,9.308005
9,October,4246,9.60199


In [33]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (12,6))
plt.xticks(rotation=90)
sns.lineplot(x = 'arrival_date_month',y='booking_percentage', data = booking_month,sort=False)
sns.lineplot(x ='arrival_date_month',y='canceled_percentage', data = canceled_month,sort=False)
plt.legend(labels=['Booking', 'Cancellation'])
plt.title("Percentage of Booking/Cancellation per Month", weight='bold')
plt.xlabel('Arrival Month')
plt.ylabel('Percentage of Reservation')

plt.show()


In [34]:
# Make histogram
fig, ax = plt.subplots(3, 1, figsize=(10, 12))  # 3 Rows, 1 Col

count0, bins_0, _ = ax[0].hist(df_booking.loc[(df_booking['is_canceled']==0),'lead_time'], bins=10, )
count1, bins_1, _ = ax[1].hist(df_booking.loc[(df_booking['is_canceled']==1),'lead_time'], bins=10,)
ax[2].plot((bins_0[:-1]+bins_0[1:])/2,count1/(count1 + count0));
plt.title("Lead Time Combined Histograms ",)

Text(0.5, 1.0, 'Lead Time Combined Histograms ')

In [35]:
# Make histogram
fig, ax = plt.subplots(3, 1, figsize=(10, 12))  # 3 Rows, 1 Col

count0, bins_0, _ = ax[0].hist(df_booking.loc[(df_booking['is_canceled']==0),'adr'], bins=10)

count1, bins_1, _ = ax[1].hist(df_booking.loc[(df_booking['is_canceled']==1),'adr'], bins=10)
ax[2].plot((bins_0[:-1]+bins_0[1:])/2,count1/(count1 + count0));
plt.title("Adr Combined Histograms ",)

Text(0.5, 1.0, 'Adr Combined Histograms ')

In [36]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
sns.countplot(x = 'market_segment', data = df_booking, hue = 'is_canceled',order = df_booking['market_segment'].value_counts().index)

plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservation Status by Market Segment", weight='bold')
plt.xlabel('Market Segment')
plt.ylabel('Number of Reservation')
plt.xticks(rotation=90)
plt.show()

In [37]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
sns.countplot(x = 'distribution_channel', data = df_booking, hue = 'is_canceled', order = df_booking['distribution_channel'].value_counts().index)

plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservation Status by Distribution Channel", weight='bold')
plt.xlabel('Distribution Channel')
plt.ylabel('Number of Reservation')
plt.xticks(rotation=90)
plt.show()

In [38]:

plt.figure(figsize=(8,6))
topcountries=df_booking['country'].value_counts().iloc[:10]
sns.countplot(x='country', data=df_booking, 
              order=topcountries.index, palette="ch:s=.25,rot=-.25")
plt.title('Top Countries of Reservation', weight='bold')
plt.xlabel('Country', fontsize=12)
plt.ylabel('Number of Reservation', fontsize=12)

Text(0, 0.5, 'Number of Reservation')

In [39]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
sns.countplot(x = 'is_repeated_guest', data = df_booking, hue = 'is_canceled')

plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservation Status by Type of Guest", weight='bold')
plt.xlabel('Type of Guest')
plt.ylabel('Number of Reservation')
plt.xticks([0,1],['New','Repeated'])

plt.show()

In [40]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
sns.countplot(x = 'total_of_special_requests', data = df_booking, hue = 'is_canceled')

plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservation Status by Number of Special Requests", weight='bold')
plt.xlabel('has Previous Cancellations')
plt.ylabel('Number of Special Requests')
plt.xticks()

plt.show()

In [41]:
d=df_booking.copy()

### Compute and display correlation matrix

In [42]:


#sns.set_theme(style="white",font_scale = 1)

# Generate a large random dataset
rs = np.random.RandomState(42)

corr = d.corr()


f, ax = plt.subplots(figsize=(20, 20))

cmap = sns.diverging_palette(230, 90, as_cmap=True)


heat_plot =sns.heatmap(corr, cmap=cmap, vmin=-1, vmax=1, annot=True, center=0,
             linewidths=1, )

fig = heat_plot.get_figure()
plt.title('Heatmap of Hotel Demand Dataset',fontsize = 25,weight='bold')
plt.xticks(weight='bold',fontsize = 20,color='green')
plt.yticks( weight='bold',fontsize = 20,color='green')
plt.tight_layout() 


fig.savefig('heat_plot.jpg') 
 

In [43]:
cor = (d.corr()**2)**0.5
cor_mat = cor["is_canceled"].sort_values(ascending=True)
cor_mat*100

stays_in_weekend_nights             0.178338
children                            0.504779
arrival_date_day_of_month           0.608413
arrival_date_week_number            0.813199
arrival_date_year                   1.673249
stays_in_week_nights                2.477143
babies                              3.248845
adr                                 4.877698
days_in_waiting_list                5.419315
previous_bookings_not_canceled      5.735537
adults                              5.999027
is_repeated_guest                   8.478795
previous_cancellations             11.014047
agent                              12.823315
booking_changes                    14.437057
required_car_parking_spaces        19.549246
total_of_special_requests          23.470590
lead_time                          29.317738
is_canceled                       100.000000
Name: is_canceled, dtype: float64


The heatmap reveals statistically significant correlations between the target variable and reservation_ status, lead_time, country, deposit_type. By checking reservation_ status values, it appears that it is highly correlated with the target and should be eliminated from further analysis.

### Features Engineering

In [44]:
d.columns

Index(['is_canceled', 'hotel', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')

### Add Features

In [45]:
## Make a new column which contain 1 if guest got the same room he/she booked
d['got_required__room'] = 0
d.loc[ d['reserved_room_type'] == d['assigned_room_type'] , 'got_required__room'] = 1


In [46]:


sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
sns.countplot(x = 'got_required__room', data = d, hue = 'is_canceled')

plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservations that Got the Required Room ", weight='bold')
plt.xlabel('Got Required Room')
plt.ylabel('Number of Reservations')
plt.xticks([0,1],['NO','Yes'])


plt.show()

### Change Features

In [47]:
d['babies'] = np.where(d['babies']>= 1, 1, 0)


In [48]:

sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
sns.countplot(x = 'babies', data = d, hue = 'is_canceled')

plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservations with Babies", weight='bold')
plt.xlabel('With Babies')
plt.ylabel('Number of Reservations')
plt.xticks([0,1],['NO','Yes'])


plt.show()

In [49]:
d['children'] = np.where(d['children']>= 1, 1, 0)

In [50]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
sns.countplot(x = 'children', data = d, hue = 'is_canceled')

plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservations with Children ", weight='bold')
plt.xlabel('With Children')
plt.ylabel('Number of Reservations')
plt.xticks([0,1],['NO','Yes'])


plt.show()

In [51]:
d['previous_cancellations'] = np.where(d['previous_cancellations']>= 1, 1, 0)

In [52]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
sns.countplot(x = 'previous_cancellations', data = d, hue = 'is_canceled')

plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservations that Has Previous Cancellations ", weight='bold')
plt.xlabel('Has Previous Cancellations')
plt.ylabel('Number of Reservations')
plt.xticks([0,1],['NO','Yes'])


plt.show()

In [53]:
d['booking_changes'] = np.where(d['booking_changes']>= 1, 1, 0)

In [54]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
sns.countplot(x = 'booking_changes', data = d, hue = 'is_canceled')

plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservations that Has booking changes ", weight='bold')
plt.xlabel('Has booking_changes')
plt.ylabel('Number of Reservations')
plt.xticks([0,1],['NO','Yes'])


plt.show()

In [55]:
d.iloc[:,25:]

Unnamed: 0,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date,got_required__room
0,Transient,0.00,0,0,Check-Out,2015-07-01,1
1,Transient,0.00,0,0,Check-Out,2015-07-01,1
2,Transient,75.00,0,0,Check-Out,2015-07-02,0
3,Transient,75.00,0,0,Check-Out,2015-07-02,1
4,Transient,98.00,0,1,Check-Out,2015-07-03,1
...,...,...,...,...,...,...,...
119385,Transient,96.14,0,0,Check-Out,2017-09-06,1
119386,Transient,225.43,0,2,Check-Out,2017-09-07,1
119387,Transient,157.71,0,4,Check-Out,2017-09-07,1
119388,Transient,104.40,0,0,Check-Out,2017-09-07,1


In [56]:
d['required_car_parking_spaces'] = np.where(d['required_car_parking_spaces']>= 1, 1, 0)

In [57]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
sns.countplot(x = 'required_car_parking_spaces', data = d, hue = 'is_canceled')

plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservations that Has  Required Car Parking Spaces ", weight='bold')
plt.xlabel('Has Required Car Parking Spaces')
plt.ylabel('Number of Reservations')
plt.xticks([0,1],['NO','Yes'])


plt.show()

In [58]:
d['total_of_special_requests'] = np.where(d['total_of_special_requests']>= 1, 1, 0)

In [59]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))

sns.countplot(x = 'total_of_special_requests', data = d, hue = 'is_canceled')

plt.legend(labels=['Confirmed', 'Canceled'])
plt.title("Number of Reservations that Has Special Requests ", weight='bold')
plt.xlabel('Has Special Requests')
plt.ylabel('Number of Reservations')
plt.xticks([0,1],['NO','Yes'])


plt.show()

### Feature Selection based on correlation matrix

In [60]:
# since is_canceled is almost the same as reservation_status we will drop both reservation_status and reservation_status_date
#df_booking.drop(['reservation_status'],axis=1,inplace=True)
d.drop(['reservation_status','reservation_status_date'],axis=1,inplace=True)

In [61]:
# since we add got_required_room we can remove reserved_room_type, 
d.drop(['reserved_room_type','assigned_room_type'],axis=1,inplace=True)

In [62]:
d.drop(['children','babies','meal','stays_in_weekend_nights','stays_in_week_nights'],axis=1,inplace=True)

In [63]:
#d.drop(['arrival_date_year','arrival_date_month','arrival_date_day_of_month','arrival_date_week_number'],axis=1,inplace=True)

In [64]:
d.shape

(119386, 23)

## Build Model

### Stratify split the data into 80% Training and 30% Testing

In [65]:
X_train, X_test, y_train, y_test = train_test_split(d.iloc[:, 1:], d.iloc[:, 0], 
                                                    test_size = 0.2, random_state=42, stratify=d.iloc[:, 0])
train_df = X_train.copy()
train_df['is_canceled'] = y_train
X_train.head()

Unnamed: 0,hotel,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,adults,country,market_segment,distribution_channel,...,previous_bookings_not_canceled,booking_changes,deposit_type,agent,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,got_required__room
5734,Resort Hotel,236,2016,May,20,12,2,PRT,Groups,TA/TO,...,0,0,No Deposit,315,0,Transient-Party,48.0,0,0,1
102208,City Hotel,1,2016,November,48,20,2,IRL,Online TA,TA/TO,...,0,0,No Deposit,9,0,Transient,91.38,0,1,1
24582,Resort Hotel,97,2016,May,22,22,2,CN,Online TA,TA/TO,...,0,0,No Deposit,240,0,Transient,66.0,1,1,1
89383,City Hotel,28,2016,May,21,19,2,PRT,Corporate,Corporate,...,0,0,No Deposit,535,0,Transient-Party,100.0,0,0,1
47127,City Hotel,20,2016,February,7,8,1,AGO,Online TA,TA/TO,...,0,0,No Deposit,9,0,Transient,86.0,0,1,1


### Encode categorical features as numbers

In [66]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline

# numeric features
numeric_features = X_train.select_dtypes(include='number').columns.tolist()
print(numeric_features)

# categorical features
categorical_features = X_train.select_dtypes(exclude='number').columns.tolist()
print(categorical_features)

# build pipeline for numeric features
numeric_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', MinMaxScaler())])
# build pipeline for categorical features
categorical_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))])

from sklearn.compose import ColumnTransformer

data_pipeline = ColumnTransformer(transformers=[
    ('numeric', numeric_pipeline, numeric_features),
    ('categorical', categorical_pipeline, categorical_features)
])

X_train_transformed=data_pipeline.fit_transform(X_train)
X_test_transformed=data_pipeline.transform(X_test)   

['lead_time', 'arrival_date_year', 'arrival_date_week_number', 'arrival_date_day_of_month', 'adults', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', 'agent', 'days_in_waiting_list', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'got_required__room']
['hotel', 'arrival_date_month', 'country', 'market_segment', 'distribution_channel', 'deposit_type', 'customer_type']


## Training, Fine tuning and Evaluation of models using  KFold CrossValidation¶

### Train and fine tune logistic regression Classifer

In [67]:
scoring = ['precision', 'recall', 'f1_macro']


In [131]:

lr = LogisticRegression()

scores = cross_validate (lr, X_train_transformed, y_train, cv=5,scoring=scoring, verbose=2,n_jobs = -1)
print("%0.2f F_macro , %0.2f Precision,  %0.2f Recall" % (scores['test_f1_macro'].mean(), scores['test_precision'].mean(),scores['test_recall'].mean()))


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preproc

0.80 F_macro , 0.81 Precision,  0.68 Recall


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   10.4s finished


In [235]:

from sklearn.linear_model import LogisticRegression



random_grid = {'C':[.01,.1,1,10,100,1000,]}


lr = LogisticRegression(random_state=42,n_jobs=-1)
lr = GridSearchCV(estimator = lr, param_grid = random_grid, cv = 5, verbose=2,n_jobs = -1)# Fit the random search model
lr.fit(X_train_transformed, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preproc

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=-1)]: Done  23 out of  30 | elapsed:   47.6s remaining:   14.5s
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/pre

GridSearchCV(cv=5, estimator=LogisticRegression(n_jobs=-1, random_state=42),
             n_jobs=-1, param_grid={'C': [0.01, 0.1, 1, 10, 100, 1000]},
             verbose=2)

In [236]:
lr.best_estimator_

LogisticRegression(C=1000, n_jobs=-1, random_state=42)

In [199]:

lr = LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=-1, penalty='l2',
                                          random_state=42, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False)

scores = cross_validate (lr, X_train_transformed, y_train, cv=5,scoring=scoring, verbose=2,n_jobs = -1)
print("%0.2f F_macro , %0.2f Precision,  %0.2f Recall" % (scores['test_f1_macro'].mean(), scores['test_precision'].mean(),scores['test_recall'].mean()))


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preproc

0.80 F_macro , 0.81 Precision,  0.68 Recall


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   12.5s finished


In [191]:

lr.fit(X_train_transformed, y_train)

trainscore=lr.score(X_train_transformed, y_train)
testscore=lr.score(X_test_transformed, y_test)
print('Training Accuracy:{},Testing Accuracy:{}'.format(trainscore,testscore))
ytest_predict = lr.predict(X_test_transformed) 
print(metrics.classification_report(y_test,ytest_predict))

Training Accuracy:0.8232189973614775,Testing Accuracy:0.822388809783064
              precision    recall  f1-score   support

           0       0.83      0.91      0.87     15034
           1       0.81      0.68      0.74      8844

    accuracy                           0.82     23878
   macro avg       0.82      0.79      0.80     23878
weighted avg       0.82      0.82      0.82     23878




lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



### Training and Fine Tuning of K Neighbors Classifer

In [237]:
knn = KNeighborsClassifier(n_jobs=-1)

scores = cross_validate (knn, X_train_transformed, y_train, cv=5,scoring=scoring, verbose=2,n_jobs = -1)
print("%0.2f F_macro , %0.2f Precision,  %0.2f Recall" % (scores['test_f1_macro'].mean(), scores['test_precision'].mean(),scores['test_recall'].mean()))


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:  2.5min remaining:  1.7min


0.82 F_macro , 0.79 Precision,  0.76 Recall


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.6min finished


[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   8.2s
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=  17.5s
[CV] C=100 ...........................................................
[CV] ............................................ C=100, total=  14.9s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   8.3s
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=  17.8s
[CV] C=1000 ..........................................................
[CV] ........................................... C=1000, total=  14.8s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   9.1s
[CV] C

In [238]:
from sklearn.neighbors import KNeighborsClassifier


random_grid = {'leaf_size': list(range(1,30)),'n_neighbors': list(range(1,30)), 'p':[1,2]}
print(random_grid)

knn = KNeighborsClassifier(n_jobs=-1)
knn= RandomizedSearchCV(estimator = knn, param_distributions = random_grid,  cv = 5, verbose=2,random_state=42, n_jobs = -1)# Fit the random search model
knn.fit(X_train_transformed, y_train)

{'leaf_size': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], 'p': [1, 2]}
Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed: 16.8min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 33.5min finished


RandomizedSearchCV(cv=5, estimator=KNeighborsClassifier(n_jobs=-1), n_jobs=-1,
                   param_distributions={'leaf_size': [1, 2, 3, 4, 5, 6, 7, 8, 9,
                                                      10, 11, 12, 13, 14, 15,
                                                      16, 17, 18, 19, 20, 21,
                                                      22, 23, 24, 25, 26, 27,
                                                      28, 29],
                                        'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8,
                                                        9, 10, 11, 12, 13, 14,
                                                        15, 16, 17, 18, 19, 20,
                                                        21, 22, 23, 24, 25, 26,
                                                        27, 28, 29],
                                        'p': [1, 2]},
                   random_state=42, verbose=2)

In [240]:
knn.best_estimator_

KNeighborsClassifier(leaf_size=9, n_jobs=-1, n_neighbors=2, p=1)

In [241]:
knn = KNeighborsClassifier(algorithm='auto', leaf_size=9, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=2, p=2,
                     weights='uniform')

scores = cross_validate (knn, X_train_transformed, y_train, cv=5,scoring=scoring, verbose=2,n_jobs = -1)
print("%0.2f F_macro , %0.2f Precision,  %0.2f Recall" % (scores['test_f1_macro'].mean(), scores['test_precision'].mean(),scores['test_recall'].mean()))


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:  1.9min remaining:  1.3min


0.82 F_macro , 0.87 Precision,  0.68 Recall


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.9min finished


In [242]:
knn.fit(X_train_transformed, y_train)

trainscore=knn.score(X_train_transformed, y_train)
testscore=knn.score(X_test_transformed, y_test)
print('Training Accuracy:{},Testing Accuracy:{}'.format(trainscore,testscore))
ytest_predict = knn.predict(X_test_transformed) 
print(metrics.classification_report(y_test,ytest_predict))

Training Accuracy:0.9260690203961972,Testing Accuracy:0.8493173632632549
              precision    recall  f1-score   support

           0       0.84      0.94      0.89     15034
           1       0.87      0.69      0.77      8844

    accuracy                           0.85     23878
   macro avg       0.86      0.82      0.83     23878
weighted avg       0.85      0.85      0.84     23878

[CV]  ................................................................
[CV] ................................................. , total= 1.7min
[CV]  ................................................................
[CV] ................................................. , total= 1.8min
[CV]  ................................................................
[CV] ................................................. , total= 1.8min
[CV]  ................................................................
[CV] ................................................. , total= 1.8min
[CV]  ..........................

### Training  of naive base Classifer

In [243]:
nb = BernoulliNB()

scores = cross_validate (nb, X_train_transformed, y_train, cv=5,scoring=scoring, verbose=2,n_jobs = -1)
print("%0.2f F_macro , %0.2f Precision,  %0.2f Recall" % (scores['test_f1_macro'].mean(), scores['test_precision'].mean(),scores['test_recall'].mean()))


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:    8.8s remaining:    5.9s


0.74 F_macro , 0.80 Precision,  0.54 Recall


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    9.2s finished


In [135]:
from sklearn.naive_bayes import BernoulliNB


random_grid = {}
print(random_grid)

nb = BernoulliNB()
nb = GridSearchCV(estimator = nb, param_grid = random_grid,  cv = 5, verbose=2, n_jobs = -1)# Fit the random search model
nb.fit(X_train_transformed, y_train)

{}
Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:    0.8s remaining:    0.5s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.8s finished


GridSearchCV(cv=5, estimator=BernoulliNB(), n_jobs=-1, param_grid={}, verbose=2)

In [136]:
nb.best_estimator_

BernoulliNB()

In [248]:
nb = BernoulliNB(alpha=1.0, class_prior=None, fit_prior=True)

scores = cross_validate(nb, X_train_transformed, y_train, cv=5,scoring=scoring, verbose=2,n_jobs = -1)
print("%0.2f F_macro , %0.2f Precision,  %0.2f Recall" % (scores['test_f1_macro'].mean(), scores['test_precision'].mean(),scores['test_recall'].mean()))


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


0.74 F_macro , 0.80 Precision,  0.54 Recall


[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:    0.9s remaining:    0.6s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.9s finished


In [159]:
nb.fit(X_train_transformed, y_train)

trainscore=nb.score(X_train_transformed, y_train)
testscore=nb.score(X_test_transformed, y_test)
print('Training Accuracy:{},Testing Accuracy:{}'.format(trainscore,testscore))
ytest_predict = nb.predict(X_test_transformed) 
print(metrics.classification_report(y_test,ytest_predict))

Training Accuracy:0.7805524144574276,Testing Accuracy:0.7848228494848815
              precision    recall  f1-score   support

           0       0.77      0.93      0.84     15034
           1       0.82      0.54      0.65      8844

    accuracy                           0.78     23878
   macro avg       0.80      0.73      0.75     23878
weighted avg       0.79      0.78      0.77     23878



### Training and Fine Tuning of SVM  Classifer

In [137]:
svm =  LinearSVC()
scores = cross_validate (svm, X_train_transformed, y_train, cv=5,scoring=scoring, verbose=2,n_jobs = -1)
print("%0.2f F_macro , %0.2f Precision,  %0.2f Recall" % (scores['test_f1_macro'].mean(), scores['test_precision'].mean(),scores['test_recall'].mean()))


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


0.80 F_macro , 0.82 Precision,  0.67 Recall


[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:    6.6s remaining:    4.4s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    6.6s finished


In [None]:
from sklearn.svm import LinearSVC


random_grid = {'C': [0.1,1, 10, 100],'max_iter':[10,100,200,-1],'kernel':['linear','rbf']}

svm =  LinearSVC()
svm = RandomizedSearchCV(estimator = svm, param_distributions = random_grid,  cv = 5, verbose=2,random_state=42, n_jobs = -1)# Fit the random search model
#svm.fit(X_train_transformed, y_train)

In [None]:
#svm.best_estimator_

In [138]:
svm= LinearSVC(C=1.0,penalty='l2',max_iter=1000, random_state=42, verbose=2)

scores = cross_validate (svm, X_train_transformed, y_train, cv=5,scoring=scoring, verbose=2,n_jobs = -1)
print("%0.2f F_macro , %0.2f Precision,  %0.2f Recall" % (scores['test_f1_macro'].mean(), scores['test_precision'].mean(),scores['test_recall'].mean()))


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


......................................................................................................................................................*..*.
optimization finished, #iter = 312
Objective value = -36810.149058
nSV = 51160
..................****
optimization finished, #iter = 353
Objective value = -36741.790200
nSV = 51036
***
optimization finished, #iter = 357
Objective value = -36757.622481
nSV = 51058

optimization finished, #iter = 359
Objective value = -36755.374811
nSV = 51085
*.*
optimization finished, #iter = 362
Objective value = -36652.834631
nSV = 51009
0.80 F_macro , 0.82 Precision,  0.67 Recall


[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:    6.7s remaining:    4.5s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    6.8s finished


In [166]:
svm.fit(X_train_transformed, y_train)
trainscore=svm.score(X_train_transformed, y_train)
testscore=svm.score(X_test_transformed, y_test)
print('Training Accuracy:{},Testing Accuracy:{}'.format(trainscore,testscore))
ytest_predict = svm.predict(X_test_transformed) 
print(metrics.classification_report(y_test,ytest_predict))

Training Accuracy:0.8233969929220589,Testing Accuracy:0.8213418209230254
              precision    recall  f1-score   support

           0       0.82      0.91      0.87     15034
           1       0.82      0.66      0.73      8844

    accuracy                           0.82     23878
   macro avg       0.82      0.79      0.80     23878
weighted avg       0.82      0.82      0.82     23878

[CV]  ................................................................
[LibLinear][CV] ................................................. , total=   5.9s
[CV]  ................................................................
[LibLinear][CV] ................................................. , total=   5.6s
[CV]  ................................................................
[LibLinear][CV] ................................................. , total=   6.1s
[CV]  ................................................................
[LibLinear][CV] ................................................. , to

### Training and Fine Tuning of Random Forest Classifer

In [95]:
# fine tune random forest model
rf = RandomForestClassifier()


scores = cross_validate (rf, X_train_transformed, y_train, cv=5,scoring=scoring, verbose=2,n_jobs = -1)
print("%0.2f F_macro , %0.2f Precision,  %0.2f Recall" % (scores['test_f1_macro'].mean(), scores['test_precision'].mean(),scores['test_recall'].mean()))


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


0.88 F_macro , 0.88 Precision,  0.81 Recall


[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:   26.9s remaining:   17.9s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   27.0s finished


In [96]:
from sklearn.ensemble import RandomForestClassifier

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

min_samples_split = [2, 5, 10]

min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

rf = RandomForestClassifier(max_depth=None, random_state=42,n_jobs=-1)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, cv = 5, verbose=10,random_state=42, n_jobs = -1)# Fit the random search model
#rf_random.fit(X_train_transformed, y_train)


{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}


In [97]:
#rf.best_estimator_

In [68]:
# fine tune random forest model
rf = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)


scoring = ['precision', 'recall', 'f1_macro']
scores = cross_validate (rf, X_train_transformed, y_train, cv=5,scoring=scoring, verbose=2,n_jobs = -1)
print("%0.2f F_macro , %0.2f Precision,  %0.2f Recall" % (scores['test_f1_macro'].mean(), scores['test_precision'].mean(),scores['test_recall'].mean()))


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:   25.6s remaining:   17.1s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   25.8s finished


0.88 F_macro , 0.88 Precision,  0.81 Recall


Fine tuned RF model has the best performance with an improvment of 8% over the Logistic Regression

### Evaluation  on Holdout 

In [69]:
rf.fit(X_train_transformed, y_train)

trainscore=rf.score(X_train_transformed, y_train)
testscore=rf.score(X_test_transformed, y_test)
print('Training Accuracy:{},Testing Accuracy:{}'.format(trainscore,testscore))
ytest_predict = rf.predict(X_test_transformed) 
print(metrics.classification_report(y_test,ytest_predict))

Training Accuracy:0.9955919922938392,Testing Accuracy:0.892202026970433
              precision    recall  f1-score   support

           0       0.90      0.93      0.92     15034
           1       0.88      0.82      0.85      8844

    accuracy                           0.89     23878
   macro avg       0.89      0.88      0.88     23878
weighted avg       0.89      0.89      0.89     23878



In [70]:

sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
 
cf_matrix = confusion_matrix(y_test,ytest_predict)
print(cf_matrix)
sns.heatmap(cf_matrix, annot=True,fmt='2d', cmap='Greens')
plt.xlabel('Predicted Class',weight='bold')
plt.ylabel('True Class',weight='bold')

[[14039   995]
 [ 1579  7265]]


  plt.figure(figsize = (8,6))


Text(51.0, 0.5, 'True Class')

In [74]:
sns.set_palette("ch:s=.25,rot=-.25")
plt.figure(figsize = (8,6))
rf_probs = rf.predict_proba(X_test_transformed)
# keep probabilities for the positive outcome only
rf_probs = rf_probs[:, 1]
# calculate scores
ns_probs = [0 for _ in range(len(y_test))]
ns_auc = roc_auc_score(y_test, ns_probs)
rf_auc = roc_auc_score(y_test, rf_probs)
# summarize scores
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Logistic: ROC AUC=%.3f' % (rf_auc))
# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_probs)
# plot the roc curve for the model
pyplot.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill',color='green')
pyplot.plot(rf_fpr, rf_tpr, marker='.', label='Random Forest',color='green',)
plt.legend(labels=['No Skill:{:.2f}'.format(ns_auc), 'Random Forest: {:.2f}'.format(rf_auc)])
plt.title('Roc Curve of Random Forest')

No Skill: ROC AUC=0.500
Logistic: ROC AUC=0.958


Text(0.5, 1.0, 'Roc Curve of Random Forest')

In [None]:
import plotly.express as px
y_score = rf.predict_proba(X_test_transformed)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_score)

# The histogram of scores compared to true labels
fig_hist = px.histogram(
    x=y_score, color=y_test, nbins=50,
    labels=dict(color='True Labels', x='Score')
)

fig_hist.show()


# Evaluating model performance at various thresholds
df = pd.DataFrame({
    'False Positive Rate': fpr,
    'True Positive Rate': tpr
}, index=thresholds)
df.index.name = "Thresholds"
df.columns.name = "Rate"

fig_thresh = px.line(
    df, title='TPR and FPR at every threshold',
    width=700, height=500
)

fig_thresh.update_yaxes(scaleanchor="x", scaleratio=1)
fig_thresh.update_xaxes(range=[0, 1], constrain='domain')
fig_thresh.show()

In [150]:
### handling imbalanced data

In [151]:
def train_RandomForestClassifier(X_train_smote, y_train_smote,X_test, y_test):
    

    lr_cor = LogisticRegression(C=1.0, class_weight=None, dual=False,
                                              fit_intercept=True,
                                              intercept_scaling=1, l1_ratio=None,
                                              max_iter=100, multi_class='auto',
                                              n_jobs=-1, penalty='l2',
                                              random_state=42, solver='lbfgs',
                                              tol=0.0001, verbose=0,
                                              warm_start=False)

    scores = cross_validate (lr_cor, X_train_smote, y_train_smote, cv=5,scoring=scoring, verbose=2,n_jobs = -1)
    print("%0.2f F_macro , %0.2f Precision,  %0.2f Recall" % (scores['test_f1_macro'].mean(), scores['test_precision'].mean(),scores['test_recall'].mean()))

    lr_cor.fit( X_train_smote, y_train_smote)

    trainscore=lr_cor.score(X_train_smote, y_train_smote)
    testscore=lr_cor.score(X_test, y_test)
    print('Training Accuracy:{},Testing Accuracy:{}'.format(trainscore,testscore))
    ytest_predict = lr_cor.predict(X_test) 
    print(metrics.classification_report(y_test,ytest_predict))

In [228]:
### with Over Sampling

In [229]:
import imblearn.over_sampling
# setup for the ratio argument of RandomOverSampler initialization
n_pos = np.sum(y_train == 1)
n_neg = np.sum(y_train == 0)
ratio = {1 : n_pos * 2, 0 : n_neg} 

# randomly oversample positive samples: create 4x as many 
ROS = imblearn.over_sampling.RandomOverSampler(sampling_strategy = ratio, random_state=42) 
X_train_resampled, y_train_resampled = ROS.fit_resample(X_train_transformed, y_train)

train_RandomForestClassifier(X_train_resampled, y_train_resampled,X_test_transformed, y_test)



After over-sampling, the number of samples (70752) in class 1 will be larger than the number of samples in the majority class (class #0 -> 60132)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERAT

0.81 F_macro , 0.83 Precision,  0.83 Recall


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training Accuracy:0.8151645732098652,Testing Accuracy:0.8097411843537985
              precision    recall  f1-score   support

           0       0.88      0.80      0.84     15034
           1       0.71      0.82      0.76      8844

    accuracy                           0.81     23878
   macro avg       0.80      0.81      0.80     23878
weighted avg       0.82      0.81      0.81     23878

[CV]  ................................................................
[CV] ................................................. , total=   6.0s
[CV]  ................................................................
[CV] ................................................. , total=   8.2s
[CV]  ................................................................
[CV] ................................................. , total=   8.4s
[CV]  ................................................................
[CV] ................................................. , total=   5.9s
[CV]  ..........................

Over sampling harm Random forest performance 

In [223]:
### with under Sampling

In [224]:
from imblearn.under_sampling import RandomUnderSampler
X_under, y_under = RandomUnderSampler(random_state=42).fit_sample(X_train_transformed,y_train)
train_RandomForestClassifier(X_under, y_under,X_test_transformed, y_test)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preproc

0.81 F_macro , 0.82 Precision,  0.79 Recall


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training Accuracy:0.8122031886024423,Testing Accuracy:0.8167350699388558
              precision    recall  f1-score   support

           0       0.87      0.83      0.85     15034
           1       0.73      0.79      0.76      8844

    accuracy                           0.82     23878
   macro avg       0.80      0.81      0.81     23878
weighted avg       0.82      0.82      0.82     23878



Under sampling harm Random forest performance 

In [225]:
# with SMOTE

In [226]:
smote = imblearn.over_sampling.SMOTE(sampling_strategy=ratio, random_state = 42)
    
X_train_smote, y_train_smote = smote.fit_resample(X_train_transformed, y_train)

train_RandomForestClassifier(X_train_smote, y_train_smote,X_test_transformed, y_test)



After over-sampling, the number of samples (70752) in class 1 will be larger than the number of samples in the majority class (class #0 -> 60132)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERAT

0.82 F_macro , 0.83 Precision,  0.83 Recall


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training Accuracy:0.8173573546040769,Testing Accuracy:0.8097830639082
              precision    recall  f1-score   support

           0       0.88      0.80      0.84     15034
           1       0.71      0.82      0.76      8844

    accuracy                           0.81     23878
   macro avg       0.80      0.81      0.80     23878
weighted avg       0.82      0.81      0.81     23878



SMOT sampling harm Random forest performance 

In [None]:
#s = setup(train_df, target = 'is_canceled', train_size = 0.99,session_id = 123)

In [None]:
#%%time
#best = compare_models()