This Dataset contains **119390** observations for a City Hotel and a Resort Hotel. Each observation represents a hotel booking between the 1st of July 2015 and 31st of August 2017, including booking that effectively arrived and booking that were canceled.

# columns:

**Hotel**:The datasets contains the booking information of two hotel. One of the hotels is a resort hotel and the other is a city hotel.

**is_canceled**:Value indicating if the booking was canceled (1) or not (0).

**lead_time**:Number of days that elapsed between the entering date of the booking into the PMS and the arrival date.

**arrival_date_year**:Year of arrival date

**arrival_date_month**:Month of arrival date with 12 categories: “January” to “December”

**arrival_date_week_number**:Week number of the arrival date.

**arrival_date_day_of_month**:Day of the month of the arrival date.

**stays_in_weekend_nights**:Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

**stays_in_week_nights**:Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel BO and BL/Calculated by counting the number of week nights

**adults**:Number of adults

**children**:Number of children

**babies**:Number of babies

**meal**:BB – Bed & Breakfast/HB

**country**:Country of origin.

**market_segment**:Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”

**distribution_channel**:Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”

**is_repeated_guest**:Value indicating if the booking name was from a repeated guest (1) or not (0)

**previous_cancellations**:Number of previous bookings that were cancelled by the customer prior to the current booking

**previous_bookings_not_canceled**:Number of previous bookings not cancelled by the customer prior to the current booking

**reserved_room_type**:Code of room type reserved. Code is presented instead of designation for anonymity reasons

**assigned_room_type**:Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons

**booking_changes**:Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

**deposit_type**:No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.

**agent**:ID of the travel agency that made the booking

**company**:ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons

**days_in_waiting_list**:Number of days the booking was in the waiting list before it was confirmed to the customer

**customer_type**:Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking

**adr**:Average Daily Rate (Calculated by dividing the sum of all lodging transactions by the total number of staying nights)

**required_car_parking_spaces**:Number of car parking spaces required by the customer

**total_of_special_requests**:Number of special requests made by the customer (e.g. twin bed or high floor)

**reservation_status**:Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why

**reservation_status_date**:Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

**name**:Name of the Guest (Not Real)

**email**:Email (Not Real)

**phone-number**:Phone number (not real)

**credit_card**:Credit Card Number (not Real)


In [None]:
#import all necessory libreries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
#import the Dataset:
df=pd.read_csv("../input/hotel-booking/hotel_booking.csv")

2-Dataset overview:

In [None]:
df.head()

In [None]:
df.info()

In [None]:
#statistic's information:
df.describe()

3-number of rows:

In [None]:
len(df.index)


In [None]:
df.shape

In [None]:
df.head()

4-Number of missing data(the most one)

In [None]:
missing = df.isnull().sum().sort_values()
missing

In [None]:
print(missing.idxmax())
missing.max()

In [None]:
#Make a Function to calculate the percent of missing data in each columns (feature) and then sort it
def missing_percent(df):
    nan_percent= 100*(df.isnull().sum()/len(df))
    nan_percent= nan_percent[nan_percent>0].sort_values()
    return nan_percent

In [None]:
nan_percent= missing_percent(df)
plt.figure(figsize=(8,6))
sns.barplot(x=nan_percent.index,y=nan_percent)
plt.xticks(rotation=90)

5-remove the 'company' column:

In [None]:
df.drop('company', axis=1, inplace=True)

6-countries with the most number reservations:

In [None]:
df['country'].value_counts()[:5]

7-A person with the most ADR(Average Daily Rate):

In [None]:
print(df.loc[df['adr'].idxmax(),'name'])
df['adr'].max()

8-average of ADR (2 decimal)

In [None]:
round(df['adr'].mean(),2)

9-Average of stay nights(2 decimal)

In [None]:
df['stay']=df['stays_in_weekend_nights']+df['stays_in_week_nights']
round(df['stay'].mean(),2)

In [None]:
plt.figure(figsize=(7,5))
sns.distplot(x=df['stay'],hist_kws=dict(edgecolor='yellow',linewidth=3))

10-Name and E-mail of people with 5 special requests

In [None]:
five_req=df[df['total_of_special_requests']==5 ]
five_req[['name','email']]

11-The most repeated Family name

In [None]:
def family(s):
    s=str(s)
    b=s.split(' ')
    return b[-1]
df['family']=df['name'].apply(family)
df['family'].value_counts()[:5]


12-people with the most number od kids:

In [None]:
df['kids']=df['children']+df['babies']
df.loc[df['kids'].idxmax(),'name']
df.sort_values('kids',ascending=False)
df['count_family']=1
w=df.groupby('family').sum()
h=w.sort_values('count_family',ascending=False)
w['count_family'].idxmax()
h.iloc[:5]

In [None]:
def codes(j):
    t= str(j)
    return t[:3]
df['codes']=df['phone-number'].apply(codes)
df['codes'].value_counts()[:3]

In [None]:
df.head()

In [None]:
df['number of people']=df['kids']+df['adults']
fp=df.pivot_table(index='arrival_date_month',columns='arrival_date_year',values='number of people')
sns.heatmap(fp,cmap='gray' ,linecolor='blue',linewidth=2,annot=True)


In [None]:
df1=df[['hotel','arrival_date_month','stay','number of people']]
df1

In [None]:
#Dealing with missing data:
nan_percent= missing_percent(df)
plt.figure(figsize=(8,6))
sns.barplot(x=nan_percent.index,y=nan_percent)
plt.xticks(rotation=90)

In [None]:
df['agent'].idxmax()

In [None]:
df['agent'].value_counts().sort_values()

the most repeated agent is 9 so I decided to replace missing values with this one.

In [None]:
df['agent']=df['agent'].fillna(9)

In [None]:
nan_percent[nan_percent<1]

In [None]:
df['country'].value_counts().sort_values()

The PRT country is the most repeated one so I decide to replace missing values with this country.

In [None]:
df['country']=df['country'].fillna('PRT')

In [None]:
nan_percent= missing_percent(df)

Columns children,kid and number of people have less than 1 persent so I decide to remove these columns.

In [None]:
df['children']=df['children'].fillna(0)
df['kids']=df['kids'].fillna(0)
df['number of people']=df['number of people'].fillna(0)


In [None]:
nan_percent= missing_percent(df)
nan_percent

In [None]:
df.info()

Gradually there is no more missing data.

# feature engineering:

Changing integer to object

In [None]:
#Convert to String:
df['is_canceled']= df['is_canceled'].apply(str)

In [None]:
df['is_repeated_guest'].nunique()

In [None]:
#Convert to String:
df['is_repeated_guest']= df['is_repeated_guest'].apply(str)

In [None]:
#Convert to String:
df['agent']=df['agent'].apply(str)

Creating Dummy Variables:

In [None]:
df=df.drop(['name','email','phone-number','credit_card','codes','family','reservation_status','reservation_status_date'],axis=1)

In [None]:
df_num= df.select_dtypes(exclude='object')
df_obj= df.select_dtypes(include='object')

In [None]:
df_obj=pd.get_dummies(df_obj.drop('is_canceled',axis=1), drop_first=True)
final_df= pd.concat([df_num, df_obj], axis=1)

# Starting decision Tree Model:

In [None]:
#Determine the Features & Target Variable
X =final_df
y = df['is_canceled']


In [None]:
sns.histplot(data=df,x=y)

In [None]:
#Split the Dataset to Train & Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
#Train the Model
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train,y_train)

In [None]:
#Predicting Test Data
y_pred = model.predict(X_test)

In [None]:
#Evaluating the Model
from sklearn.metrics import confusion_matrix,classification_report
confusion_matrix(y_test,y_pred)

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
model.feature_importances_

In [None]:
pd.DataFrame(index=X.columns,data=model.feature_importances_,columns=['Feature Importance'])

**Visualize the Tree**

In [None]:
from sklearn.tree import plot_tree
plt.figure(figsize=(12,8))
plot_tree(model,filled=True,feature_names=X.columns);

# Reporting Model Results

To begin experimenting with hyperparameters, let's create a function that reports back classification results and plots out the tree.

In [None]:
def report_model(model):
    model_preds = model.predict(X_test)
    print(classification_report(y_test,model_preds))
    print('\n')
    plt.figure(figsize=(8,6),dpi=100)
    plot_tree(model,filled=True,feature_names=X.columns);

**Changing Hyperparameters**

In [None]:
#Max Depth
pruned_tree = DecisionTreeClassifier(max_depth=8)
pruned_tree.fit(X_train,y_train)

In [None]:
report_model(pruned_tree)

In [None]:
#Max Leaf Nodes
pruned_tree = DecisionTreeClassifier(max_leaf_nodes=5)
pruned_tree.fit(X_train,y_train)

In [None]:
report_model(pruned_tree)

In [None]:
#Criterion
entropy_tree = DecisionTreeClassifier(criterion='entropy')
entropy_tree.fit(X_train,y_train)

In [None]:
report_model(entropy_tree)