# **Project Name - Exploratory Data Analysis Of Hotel Bookings**

##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name -** Ratul Dutta


# **Project Summary -**

Almabetter Hotel Booking EDA Project Using Python. This project relates to hotel reservations and includes a variety of city hotels and resort hotels. There are 32 columns and a total of 119390 rows in this dataset. Data collection, data cleansing and manipulation, and EDA (experimental Data Analysis) are the three categories into which the workflow for data manipulation is divided. The names of some of the columns, including hotel, is_canceled, lead time, arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month, and stays_In_weekend_nights, have been updated as the data collection process has progressed. This is done by coding Head(), Tail(), info(), describe(), columns(), and other methods used for data collection. As we proceed, we Identify the distinct value for each column, create a list in tabular format, and also verify the dataset type for each column. Identify some columns with inaccurate data types and fix them afterward. As we discover duplicate items totaling 87396, which are later discarded from the dataset, duplicate data items must also be removed during the data cleaning phase.

We must first perform data manipulation before visualizing any data from the data source. To do that, we examined each column's null value. After checking, drop the column using the 'drop' method if we find one that has a greater percentage of null values. We are so removed from the "company" column. When there are only a few null values, we fill those null values with the necessary values using the formula.fill() To

achieve greater understanding and business goals, many charts are utilized for data visualization.

# **GitHub Link -**

https://github.com/ratul837/almabetter-HotelEDA.git

# **Problem Statement**


Have you ever thought the ideal season of the year to reserve a hotel room? Alternatively, how long should remain to get the greatest daily rate? What If you wanted to foretell whether a hotel would unreasonably frequently receive unusual requests? You can investigate those questions using the data from hotel reservations! This data collection comprises reservation details for a city hotel and a resort hotel, as well as detalls like the date the reservation was made, the duration of the stay, the number of adults, kids, and/or babies, and the number of parking spaces that are available. The data is devoid of any information that may be used to identify a specific person. Investigate and evaluate the Information to find crucial elements that control reservations.

#### **Define Your Business Objective?**

Analyse the reservation data for the City Hotel and Resort Hotel to learn more about the various elements that influence a reservation. This is being done as a solo project.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# import libraies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import itertools
import calendar
from plotnine import *

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Hotel Bookings.csv')
df=dataset.copy()

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(len(df.axes[0]))
print(len(df.axes[1]))

### Dataset Information

In [None]:
# Dataset Info
df.info('all')

#### Duplicate Values

In [None]:
# returns the dataframe without duplicate rows
#df.drop_duplicates().shape
# returns the only duplicated rows
df[df.duplicated()]

In [None]:
# Dataset Duplicate Value Count
df.duplicated().value_counts()

In [None]:
#drop the duplicate columns
df.drop_duplicates(inplace=True)

In [None]:
df.duplicated().value_counts()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().values.sum())

In [None]:
# column wise null value
df.isnull().sum()

In [None]:
# Visualizing the missing values
msno.bar(df)

In [None]:
# handling missing values
# df.fillna(np.nan , inplace = True)
df['country'].fillna('missing',inplace=True)
df['agent'].fillna(0,inplace=True)
df['company'].fillna(0,inplace=True)
df['children'].fillna(0,inplace=True)
print(df.isnull().sum())

### What did you know about your dataset?

A single file in this data collection analyses different booking details between two hotels: a city hotel and a resort hotel. includes details like the date the reservation was made, the number of people staying, the number of adults, kids, and/or babies, and the number of parking spaces available, among other things. There are 32 columns and 119390 rows in the entire dataset. Dataset contains duplicate items, such as 31944, which is later removed. Every column in this dataset has a data type, such as an integer, a float, or a text. We note that some of these data types are inaccurate and eliminate those columns afterward. We calculate the distinct value for each column, which represents the actual values for each column.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all').T

### Variables Description

The columns and the data it represents are listed below:

1. hotel: Name of the hotel (Resort Hotel or City Hotel).

2. Is canceled: If the booking was canceled (1) or not (0).

3. lead time: Number of days before the actual arrival of the guests.

4. arrival_date_year: Year of arrival date.

5. arrival_date_month: Month of month arrival date.

6. arrival_date_week_number: Week number of year for arrival date.

7. arrival_date_day_of_month: Day of arrival date.

8. stays_in_weekend_nights: Number of weekend nights (Saturday or Sunday) spent at the hotel by the guests.

9. stays_in_week_nights: Number of weeknights (Monday to Friday) spent at the hotel by the guests.

10. adults: Number of adults among guests.

11. children: Number of children among guests.

12. babies: Number of babies among guests.

13. meal: Type of meal booked.

14. country: Country of guests.

15. market segment: Designation of market segment.

16. distribution channel: Name of booking distribution channel.

17. is_repeated_guest: If the booking was from a repeated guest (1) or not (0).

18. previous cancellations: Number of previous bookings that were cancelled by the customer prior to the current booking.

19. previous bookings_not_canceled: Number of previous bookings not cancelled by the customer prior to the current booking.

20. reserved_room_type: Code of room type reserved.

21. assigned_room_type: Code of room type assigned.

22. booking changes: Number of changes/amendments made to the booking.

23. deposit_type: Type of the deposit made by the guest.

24. agent: ID of travel agent who made the booking.

25. company: ID of the company that made the booking.

26. days_in_waiting list: Number of days the booking was in the waiting list.

27. customer_type: Type of customer, assuming one of four categories.

28. adr : Average Daily Rate, as defined by dividing the sum of all lodging transactions by the total number of staying nights.

29. required_car_parking spaces: Number of car parking spaces required by the customer.

30. total of special_requests: Number of special requests made by the customer.

31. reservation_status: Reservation status (Canceled, Check-Out or No-Show).

32. reservation_status_date: Date at which the last reservation status was updated.

## **3. Data Wrangling**

In [None]:
# Write your code to make your dataset analysis ready.
# calculate the total night stays to get the total revenue
df['total_stay_in_nights'] = df['stays_in_week_nights']+df['stays_in_weekend_nights']
df['revenue'] = df['total_stay_in_nights']*df['adr']
df['revenue']

In [None]:
# calulating total guest for each booking
df['total_guest']=df['adults'] + df['children']+df['babies']
df['total_guest']

In [None]:
# for easy understanding of'is_canceled' column changed it (0,1) to not_canceled, is_canceled
df['is_canceled'] = df['is_canceled'].replace([0,1],['not canceled','is canceled'])
df['is_canceled']

In [None]:
# for easy understanding of'is_repeated_guest' column changed it (0,1) to not_repeated, repeated
df['is_repeated_guest'] = df['is_repeated_guest'].replace([0,1],['not repeated','repeated'])
df['is_repeated_guest']

In [None]:
hotel_wise_revenue = df.groupby('hotel')['revenue'].sum()
hotel_wise_revenue

### What all manipulations have you done and insights you found?

We have seen that a few columns are needed in the data for analytical purposes, and these columns can be evaluated.
1. Number of Guests: We may assess the overall number of guests and Income by using these columns. This value is obtained by summing the total number of adults, children, and infants.
2. Income: ADR and total guests are multiplied to find revenue. This column will be used to examine each hotel's growth and profitability.

3. Replace values in columns: is cancelled, isn't cancelled, and is a repeat visitor. As we've seen, these columns only hold the value of 0,1 to indicate that the boycott is now being cancelled. These values (0,1) from "Cancelled" & "Not cancelled" are changed. The same procedure is used to change 0,1 from "Repeated" and "Not repeated" in the column "is_repeated_guest". These values will now facilitate greater comprehension during visualisation.

4. Changes to the values in columns: We verified that these columns (Agent & Kids) contain float values, which don't make sense in the data because they represent the guest count and agent ID. Therefore, we have updated the "float" data type of these columns to "Integer".

5. Removed duplicate entries and is_null values: Data wrangling must be done before any data from the data set can be visualised. Data wrangling must be done before any data from the data set can be visualised. We have examined each column's null value in order to determine that. After checking, drop the column using the 'drop' method if we find one that has a greater percentage of null values. We fill those null values with the necessary values using the fillina() function. In the same, we looked to see whether there was any data duplication and discovered that a few rows had duplicate data. As a result, we used the drop duplicates() method to delete those rows from the data set.

By doing this, we have eliminated any unnecessary data.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### **Univariate Analysis**

Univariate Analysis is a satistical analysis technique that involves analyzing and describing a single variable in a dataset

#### **1. Hotel Prefernce By the Guest**

In [None]:
#hotel
hotel_types=[];coustomer_in_hotel=[]
hotel_types=df['hotel'].unique()
for x in hotel_types:
  coustomer_in_hotel.append(df[df['hotel']==x].shape[0])
percentage=[]
for x in coustomer_in_hotel:
  y=(x/sum(coustomer_in_hotel)*100)
  percentage.append(round(y,2))
graph=plt.bar(hotel_types,coustomer_in_hotel,color=['Blue','Orange'])
i = 0
for p in graph:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    plt.text(x+width/2,y+height*1.01,str(percentage[i])+'%',ha='center',weight='bold')
    i+=1
plt.show()

**Observation**

City Hotels are the most prefered Hotels by the guests

#### **2.Most bookings by the agents**

In [None]:
#agent
agent=[];agent_changes={};new_list=[]
agent=sorted(list(df['agent'].unique()))
for x in agent:
  agent_changes[x]=df[df['agent']==x].shape[0]
keys = list(agent_changes.keys())
values = list(agent_changes.values())
values=sorted(values,reverse=True)
def get_values(dictionary,value):
  keys=list(dictionary.keys())
  values=list(dictionary.values())
  x=keys[values.index(value)]
  return x
new_list=[]
for x in values[:8]:
  new_list.append(get_values(agent_changes,x))
graph=plt.bar(list(map(str,new_list)),values[:8])

**Observation**

Agent No. 9 made the highest bookings

#### **3. Percentage Of Booking Cancellation**

In [None]:
#is_canceled
canceled_list=[];cancel=[];percentage=[];
canceled_list_string=['Not Canceled Bookings','Canceled Booking']
canceled_list=dataset['is_canceled'].unique()
for x in canceled_list:
  cancel.append(dataset[dataset['is_canceled']==x].shape[0])
for x in cancel:
  y=(x/sum(cancel))*100
  percentage.append(round(y,2))
graph=plt.bar(canceled_list_string,cancel,color=['Red','Green'])
i=0
for p in graph:
  width=p.get_width()
  height=p.get_height()
  x,y=p.get_xy()
  plt.text(x+width/2,y+height*1.01,str(percentage[i])+'%',ha='center',weight='bold')
  i+=1
plt.show()

**Observation**

37% of total bookings got canceled.

#### **4. Percentage of repeated guest**

In [None]:
#is_repeated_guest
value_list=df['is_repeated_guest'].unique()
repeated_guest_list=[];guest_stayed=['Not Repeated Guest','Repeated Guest']
for x in value_list:
  repeated_guest_list.append(df[df['is_repeated_guest']==x].shape[0])
percentage=[]
for x in repeated_guest_list:
  y=(x/sum(repeated_guest_list)*100)
  percentage.append(round(y,2))
graph=plt.bar(guest_stayed,repeated_guest_list,color=['Blue','Orange'])
i = 0
for p in graph:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    plt.text(x+width/2,y+height*1.01,str(percentage[i])+'%',ha='center',weight='bold')
    i+=1
plt.show()

**Observation**

Repeated guests are very few around 3.91%
in order to get the repeated coustomers management need to get the feed back from the coustomers and improve the services

#### **5. Bookings Made By Coustomer**

In [None]:
#customer_type
coustomer=[];coustomer_type=[];percentage=[];car_list_string=[]
coustomer=df['customer_type'].unique()
for x in coustomer:
  coustomer_type.append(df[df['customer_type']==x].shape[0])
for x in coustomer_type:
  y=(x/sum(coustomer_type))*100
  percentage.append(round(y,2))
graph=plt.bar(coustomer,coustomer_type,color=['Red','Green','Blue','Purple'])
i=0
for p in graph:
  width=p.get_width()
  height=p.get_height()
  x,y=p.get_xy()
  plt.text(x+width/2,y+height*1.01,str(percentage[i])+'%',ha='center',weight='bold')
  i+=1
plt.show()

**Observation**

Transient Coutomers made the most bookings app. 82.37% of total booking

#### **6. Car Parking Prefernce**

In [None]:
#required_car_parking_spaces
car_list=[];parking_spaces=[];percentage=[];car_list_string=[]
car_list=df['required_car_parking_spaces'].unique()
for x in car_list:
  parking_spaces.append(df[df['required_car_parking_spaces']==x].shape[0])
for x in parking_spaces:
  y=(x/sum(parking_spaces))*100
  percentage.append(round(y,2))
car_list_string= list(map(str,car_list))
graph=plt.bar(car_list_string,parking_spaces,color=['Red','Green','Blue'])
i=0
for p in graph:
  width=p.get_width()
  height=p.get_height()
  x,y=p.get_xy()
  plt.text(x+width/2,y+height*1.01,str(percentage[i])+'%',ha='center',weight='bold')
  i+=1
plt.show()

**Observation**

Most of the coustomer does not require Car Parking Spaces

#### **7. Booking changes made by the Coustomer**

In [None]:
#booking_changes
booking=[];booking_changes=[];percentage=[];changes_string=[]
booking=sorted(list(df['booking_changes'].unique()))
for x in booking:
  booking_changes.append(df[df['booking_changes']==x].shape[0])
for x in booking_changes:
  y=(x/sum(booking_changes))*100
  percentage.append(round(y,2))
plt.figure(figsize=(14,5))
changes_string=list(map(str,booking))
graph=plt.bar(changes_string,booking_changes)
i=0
for p in graph:
  width=p.get_width()
  height=p.get_height()
  x,y=p.get_xy()
  plt.text(x+width/2,y+height*1.01,str(percentage[i])+'%',ha='center',weight='bold')
  i+=1
plt.show()

**Observation**

Maximum (app. 81.8%) coustomers made no changes.

#### **8. Percentage distribution of "Coustomer Type"**

In [None]:
#deposit_type
deposite_value=[];deposite_value_list=[];percentage=[]
deposite_value=df['deposit_type'].unique()
for x in deposite_value:
  deposite_value_list.append(df[df['deposit_type']==x].shape[0])
for x in deposite_value_list:
  y=x/sum(deposite_value_list)*100
  percentage.append(round(y,2))
graph=plt.bar(deposite_value,deposite_value_list)
i=0
for p in graph:
  width=p.get_width()
  height=p.get_height()
  x,y=p.get_xy()
  plt.text(x+width/2,y+height*1.01,str(percentage[i])+'%',ha='center',weight='bold')
  i+=1
plt.show()

**Observation**

Most of the coustomers give payment after coming the hotel.So, there is probability to implement solution like UPI as a payment option.

#### **9. Food Preferences by the Coustomer**

In [None]:
#meal
meal_value=[];meal_value_list=[];percentage=[]
meal_value=df['meal'].unique()
for x in meal_value:
  meal_value_list.append(df[df['meal']==x].shape[0])
for x in meal_value_list:
  y=x/sum(meal_value_list)*100
  percentage.append(round(y,2))
graph=plt.bar(meal_value,meal_value_list,color=['Red','Blue','Green','Yellow','Orange'])
i=0
for p in graph:
  width=p.get_width()
  height=p.get_height()
  x,y=p.get_xy()
  plt.text(x+width/2,y+height*1.01,str(percentage[i])+'%',ha='center',weight='bold')
  i+=1
plt.show()

**Observation**

Coustomer prefer BB type meal

#### **10. Country wise coustomers**

In [None]:
# countries
list_of_countries=[];coustomer_list=[];new_dict={};updated_countries=[]
list_of_countries=list(df['country'].unique())
for x in list_of_countries:
  coustomer_list.append(df[df['country']==x].shape[0])
#bc=' '.join(list(country_wise_coustomers.head(3)['Country']))
#print("Top 3 Countries from which coustomers came from",bc)
new_dict=dict(zip(coustomer_list,list_of_countries))
#print(new_dict)
for x in sorted(coustomer_list,reverse=True):
  updated_countries.append(new_dict[x])
#print(updated_countries)
#print(sorted(coustomer_list,reverse=True))
plt.bar(updated_countries[:8],sorted(coustomer_list,reverse=True)[:8])

**Observation**

Most of the coustomers come from "PRT".

#### **11. Room Preference by Coustomer**

In [None]:
#reserved_room_type
room_value=[];room_type_list=[];percentage=[]
room_value=sorted(list(df['reserved_room_type'].unique()))
for x in room_value:
  room_type_list.append(df[df['reserved_room_type']==x].shape[0])
for x in room_type_list:
  y=x/sum(room_type_list)*100
  percentage.append(round(y,2))
graph=plt.bar(room_value,room_type_list)
i=0
for p in graph:
  width=p.get_width()
  height=p.get_height()
  x,y=p.get_xy()
  plt.text(x+width/2,y+height*1.01,str(percentage[i])+'%',ha='center',weight='bold')
  i+=1
plt.title("Coustomers Demanded Room Type")
plt.show()

#assigned_room_type
room_value=[];room_type_list=[];percentage=[]
room_value=sorted(list(df['assigned_room_type'].unique()))
for x in room_value:
  room_type_list.append(df[df['assigned_room_type']==x].shape[0])
for x in room_type_list:
  y=x/sum(room_type_list)*100
  percentage.append(round(y,2))
graph=plt.bar(room_value,room_type_list)
i=0
for p in graph:
  width=p.get_width()
  height=p.get_height()
  x,y=p.get_xy()
  plt.text(x+width/2,y+height*1.01,str(percentage[i])+'%',ha='center',weight='bold')
  i+=1
plt.title("Assigned Room type by the Hotel Management")
plt.show()

**Observation**

"A" & "D" type room is the most demanding among the coustomers and also assigned by the Hotel Management.

#### **12. Month Wise Hotel Booking**

In [None]:
#arrival_date_month
month_list_name=list(calendar.month_name)
month_value=[];month_value_list=[];percentage=[];new_dict={};month_dict={}
month_value=df['arrival_date_month'].unique()
for x in month_value:
  month_value_list.append(df[df['arrival_date_month']==x].shape[0])
for x in month_value_list:
  y=x/sum(month_value_list)*100
  percentage.append(round(y,2))
new_dict=dict(zip(month_value,month_value_list))
for x in month_list_name[1:]:
  month_dict[x]=new_dict[x]
plt.figure(figsize=(12,5))
plt.plot(list(month_dict.keys()),list(month_dict.values()))

**Observation**

July and August months had the most Bookings. Summer vaccation can be the reason for bookings.

#### **13. Year wise hotel booking**

In [None]:
month_dict={};month_list_name=[];month_list_number=[]
#month_list_name=['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
month_list_name=list(calendar.month_name)
month_list_number=[0,1,2,3,4,5,6,7,8,9,10,11,12]
month_dict=dict(zip(month_list_name,month_list_number))
df['arrival_date_month_number']=df.apply(lambda x: month_dict[x['arrival_date_month']], axis=1)
years=df['arrival_date_year'].unique()
coustomers=[];dataset_list=[]
df_2015=pd.DataFrame()
df_2016=pd.DataFrame()
df_2017=pd.DataFrame()
df_2015=df[df['arrival_date_year']==2015]
df_2016=df[df['arrival_date_year']==2016]
df_2017=df[df['arrival_date_year']==2017]
dataset_list=[df_2015,df_2016,df_2017]
for x in dataset_list:
  coustomers.append(x.shape[0])
years_string= list(map(str,years))
percentage=[]
for x in coustomers:
  y=((x/sum(coustomers))*100)
  percentage.append(round(y,2))
plt.figure(figsize=(10,5))
graph=plt.bar(years_string,coustomers,color=['Red','Blue','Green'])
i = 0
for p in graph:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    plt.text(x+width/2,y+height*1.01,str(percentage[i])+'%',ha='center',weight='bold')
    i+=1
plt.show()

**Observation**

2016 has the highest percentage of bookings

#### **14. Distribution Channel Preferd by the Coustomers**

In [None]:
#distribution_channel
channel_value=[];channel_value_list=[];percentage=[]
channel_value=df['distribution_channel'].unique()
for x in channel_value:
  channel_value_list.append(df[df['distribution_channel']==x].shape[0])
for x in channel_value_list:
  y=x/sum(channel_value_list)*100
  percentage.append(round(y,2))
graph=plt.bar(channel_value,channel_value_list)
i=0
for p in graph:
  width=p.get_width()
  height=p.get_height()
  x,y=p.get_xy()
  plt.text(x+width/2,y+height*1.01,str(percentage[i])+'%',ha='center',weight='bold')
  i+=1
plt.show()

**Observattion**

"TA/TO" is the most prefered distribution channel by Coustomer for Hotel Booking.

### **Bivariate and Multivariate Analysis**

Bivariate analysis is a statistical analysis technique that involves analyzing the relationship between two variables in a dataset. The goal of bivariate analysis is to better understand the association or correlation between the two variables and to identify any patterns or trends in the data.

#### **1. Hotel Types with Highest adr**

In [None]:
# adr
resort_df=df[df['hotel']=='Resort Hotel']
resort_df=resort_df.groupby(['hotel','arrival_date_month'])['adr'].mean().reset_index()
city_df=df[df['hotel']=='City Hotel']
city_df=city_df.groupby(['hotel','arrival_date_month'])['adr'].mean().reset_index()
new_dict={};new_dict_2={}
#print(resort_df)
#print(city_df)
new_list=list(month_dict.keys())
for x in new_list[1:]:
  new_dict[x]=int(resort_df[resort_df['arrival_date_month']==x]['adr'])
  new_dict_2[x]=int(city_df[city_df['arrival_date_month']==x]['adr'])

#print(new_dict)
#print(new_dict_2)
plt.figure(figsize=(12,5))
plt.plot(list(new_dict.keys()),list(new_dict.values()),label="Resort Hotel")
plt.plot(list(new_dict_2.keys()),list(new_dict_2.values()),label="City Hotel")
plt.legend()
plt.show()

**Observation**

From June to August month Resort hotels has the highest adr than City hotel, and rest of the month Resort hotel has the lesser adr than City hotel.

#### **2. Lead time of Hotel Types**

In [None]:
# lead_time
lead_df=pd.DataFrame()
lead_df=df.groupby(['hotel'])['lead_time'].mean().reset_index()
lead_df.plot(kind='bar',x='hotel',y='lead_time',color='green')

**Observation**

Resort hotel has the slightly more average lead time.

#### **3. Booking Cancellation Rate For Hotel Type**

In [None]:
# cancellation
df['test'] = 1
cancellation=pd.DataFrame();
cancellation=df.groupby(['hotel','is_canceled'])['test'].sum().reset_index()
plt.figure(figsize=(6,6))
sns.barplot(x='hotel',y='test',hue='is_canceled',data=cancellation)

**Observation**

City Hotel has the highest booking cancellation also more booking than resort hotel.

#### **4. Waiting Days for Hotel Type**

In [None]:
waiting=pd.DataFrame();
waiting=df.groupby(['hotel'])['days_in_waiting_list'].sum().reset_index()
plt.figure(figsize=(6,6))
sns.barplot(x='hotel',y='days_in_waiting_list',data=waiting)

**Observation**

City hotel is in demand among the coustomers as they are ready for long waiting to stay in city hotel

#### **5. Optimal Stay Length in Hotels**

In [None]:
df['total_stays'] = df['stays_in_weekend_nights']+df['stays_in_week_nights']
stays_df=df.groupby(['hotel','total_stays'])['test'].sum().reset_index()
plt.figure(figsize=(12,5))
sns.barplot(x='total_stays',y='test',hue='hotel',data=stays_df)

**Observation**

Coustomer Prefered to Stay in City Hotel as Compared to Resort Hotel.

#### **6. Contribution of The Distribution channel to Companies Revenue**

In [None]:
# distribution channel
distribution =df.groupby(['hotel','distribution_channel'])['test'].sum().reset_index()
plt.figure(figsize=(12,5))
sns.barplot(x='distribution_channel',y='test',hue='hotel',data = distribution)

**Observation**

"TA/TO" is the most contributing distribution channel for the companies revenue.

#### **7. Cancellation Rate across all Distribution Channel**

In [None]:
# distribution channel
cancel = df[df['is_canceled']=='is canceled']
cancel = cancel.groupby(['distribution_channel','hotel']).size().reset_index().rename(columns={0:'counts'})
cancel
plt.figure(figsize=(10,8))
sns.barplot(x='distribution_channel',y='counts',hue='hotel',data = cancel)
plt.title('Cancellation Rate Among The Distribution Channel')

**Observation**

1. "TA/TO", City Hotels has the high cancellation rate.
2. In "direct" both the hotels has same cancellation Rate.

#### **8. Cancellation rate for each Market Segment**

In [None]:
# market_segment
market =df [df['is_canceled']=='is canceled']
market= market.groupby(['hotel','market_segment'])['test'].sum().reset_index()
plt.figure(figsize=(12,5))
sns.barplot(x='market_segment',y='test',hue='hotel',data=market)

**Observsation**

"Online TA" has the highest cancellation in both typee of hotel booking. In order to reduce the booking cancellations hotel need to set the refundable policies.

#### **9. ADR Across Market Segment**

In [None]:
# market_segment
adr_df = df[df['is_canceled']=='is canceled']
adr_df = adr_df.groupby(['market_segment'])['adr'].mean().reset_index()
plt.figure(figsize=(12,5))
sns.barplot(x='market_segment',y='adr',data= adr_df)

**Observation**

"Direct" & "Online TA" has the highest adr among all market segment

#### **10. Relationship Between Repeated guests and First time booking**

In [None]:
plt.figure(figsize=(10,8))
sns.barplot(x=df['is_repeated_guest'],y=df['previous_bookings_not_canceled'])
#plt.xticks([0,1])
plt.title("Relationship Between Repeated guests and First time booking")
plt.show()

**Observation**

Not Repeated Guest (first time Guest) are more likely to cancel their booking.

#### **11. Relationship in Between ADR and Total number of Pepole**

In [None]:
df[df['adr']==df['adr'].max()]

In [None]:
plt.figure(figsize=(8,6))
new_df = df[(df['total_guest']<6) & (df['adr']<5400)]
sns.boxplot(x = new_df['total_guest'], y=new_df['adr'])
plt.title('ADR vs Total number of Pepole')

**Observation**

As the total number of People Increses the adr also increases.

#### **Correlation Heat Map**

In [None]:
plt.figure(figsize=(18,10))
sns.heatmap(df.corr(),annot=True)
plt.title("Co-relation Heatmap")

1. is_canceled and same_room_alloted_or_not are negatively corelated. That means customer is unlikely to cancel his bookings if he don't get the same room as per reserved room. We have visualized it above.

2. lead time and total_stay is positively corelated. That means more is the stay of cutsomer more will be the lead time.
3. adults,childrens and babies are corelated to each other. That means more the people more will be adr.
4. is_repeated guest and previous bookings not canceled has strong corelation. may be repeated guests are not more likely to cancel their bookings.

## **5. Conclusion**

1. The guests showed a preference for City hotels, making it the busiest type of hotel.

2. 37.4% of all bookings were cancelled.

3. Only 3.9% of guests revisited the hotels, indicating a low retention rate.

4. Over 81.8% of bookings had 0 changes made, while around 12.47% had single changes made.

5. The majority of customers (91.6%) did not require car parking spaces.

6. About 79.1% of bookings were made through travel agents/tour operators.

7. Bed & Breakfast (BB) was the most preferred meal type among guests.

8. More than 25,000 guests were from Portugal, making it the country with the highest number of guests.

9. Most bookings for City and Resort hotels were made in 2016.

10. City hotels generated more revenue than Resort hotels, with higher average ADR.

11. City hotels had a higher booking cancellation rate.

12. Resort hotels had a higher average lead time.

13. Waiting time was higher for City hotels compared to Resort hotels, indicating City hotels were busier.

14. Resort hotels had the highest number of repeated guests.

15. The optimal stay for both types of hotels was less than 7 days, with people typically staying for a week.