# **Project Name**    - Hotel Booking Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

Write the summary here within 500-600 words.

Introduction:

The Hotel Booking Analysis EDA project delves into the complexities of the volatile hotel industry, focusing on booking patterns, cancellations, and underlying factors influencing customer behavior. The dataset includes information from Hilton and Hyatt hotels, encompassing booking dates, length of stay, and demographic details.

Objective:

The primary goal is to analyze factors affecting hotel bookings, unveiling trends for reporting and predicting future bookings. Through exploratory data analysis (EDA), the project aims to provide insights for strategic decision-making.

Analysis Approach:

The EDA is structured into three phases:

Univariate Analysis:

Examining individual variables, this phase offers a foundational understanding
of the dataset.

Key variables explored include booking patterns, length of stay, and demographic factors.

Bivariate Analysis:

Comparing two variables, this phase uncovers relationships between them.

Focus areas include the impact of booking lead time on cancellations and the correlation between stay length and customer demographics.

Multivariate Analysis:

Expanding the scope to more than two variables, this phase deepens the exploration.

Analysis involves understanding the interplay between factors such as booking source, seasonality, and customer preferences



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
data_path = "/content/drive/MyDrive/Hotel Bookings.csv"
df = pd.read_csv(data_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"This data set has a total number of {df.shape[0]} rows and {df.shape[1]} columns.")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
df1=df.copy()
df1['reservation_status_date'] = pd.to_datetime(df1['reservation_status_date'], format = '%Y-%m-%d')


In [None]:
# Dataset Duplicate Value Count
df1.duplicated().value_counts()

In [None]:
# Visulizing through Count pot
plt.figure(figsize=(4,4))
sns.countplot(x=df1.duplicated())

Hence there are 31994 duplicate values in our dataset

In [None]:
# To remove the duplicate rows
df1 = df1.drop_duplicates()
df1.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df1.isna().sum().sort_values(ascending = False)[:5].reset_index().rename(columns={'index':'column',0:'Null Values'})

In [None]:
# Visualizing the missing values
plt.figure(figsize=(15, 8))
sns.heatmap(df1.isnull(), cbar=False, yticklabels=False,cmap='viridis')
plt.xlabel("Name Of Columns")
plt.title("Places of missing values in column")

In [None]:
# Filling/replacing null values with 0.
null_columns=['agent','children','company']
for col in null_columns:
  df1[col].fillna(0,inplace=True)

# Replacing NA values with 'others'
df1['country'].fillna('others',inplace=True)

### What did you know about your dataset?

We Have Null values in columns- Company, agent, Country,children.

for company and agent I will fill the Missing values with 0
for country I will fill Missing values with boject 'Others'. ( assuming while collecting data country was not found so user selected the 'Others' option.)
AS the count of missing values in Children Column is only 4, so we can replace with 0 considering no childrens.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df1.columns

In [None]:
# Dataset Describe
df1.describe()

### Variables Description

Answer Here:

hotel : Hotel(Resort Hotel or City Hotel)

is_canceled : Value indicating if the booking was canceled (1) or not (0)

lead_time :* Number of days that elapsed between the entering date of the booking into the PMS and the arrival date*

arrival_date_year : Year of arrival date

arrival_date_month : Month of arrival date

arrival_date_week_number : Week number of year for arrival date

arrival_date_day_of_month : Day of arrival date

stays_in_weekend_nights : Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

stays_in_week_nights : Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel

adults : Number of adults

children : Number of children

babies : Number of babies

meal : Type of meal booked. Categories are presented in standard hospitality meal packages:

country : Country of origin.`

market_segment : Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”

distribution_channel : Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”

is_repeated_guest : Value indicating if the booking name was from a repeated guest (1) or not (0)

previous_cancellations : Number of previous bookings that were cancelled by the customer prior to the current booking

previous_bookings_not_canceled : Number of previous bookings not cancelled by the customer prior to the current booking

reserved_room_type : Code of room type reserved. Code is presented instead of designation for anonymity reasons.

assigned_room_type : Code for the type of room assigned to the booking.

booking_changes : Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

deposit_type : Indication on if the customer made a deposit to guarantee the booking.

agent : ID of the travel agency that made the booking

company : ID of the company/entity that made the booking or responsible for paying the booking.

days_in_waiting_list : Number of days the booking was in the waiting list before it was confirmed to the customer

customer_type : Type of booking, assuming one of four categories

adr : Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights

required_car_parking_spaces : Number of car parking spaces required by the customer

total_of_special_requests :* Number of special requests made by the customer (e.g. twin bed or high floor)*

reservation_status : Reservation last status, assuming one of three categories

Canceled – booking was canceled by the customer
Check-Out – customer has checked in but already departed
No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date : Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel



```

```

### Check Unique Values for each variable.

In [None]:
# Checking the unique values in categorical columns.
categorical_cols=list(set(df1.drop(columns=['reservation_status_date','country','arrival_date_month']).columns)-set(df1.describe()))
for col in categorical_cols:
  print(f'Unique values in column {col} are:, {(df1[col].unique())}')

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

df1['total_people'] = df1['adults'] + df1['babies'] + df1['children']
df1['total_stay'] = df1['stays_in_weekend_nights'] + df1['stays_in_week_nights']

### What all manipulations have you done and insights you found?

Answer Here.
1.removed null values
2. Checked unqie values in categorical columns

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

Univariate Analysis

Univariate analysis is a statistical analysis technique that involves analyzing and describing a single variable in a dataset.

1) Which type of hotel is mostly prefered by the guests?

In [None]:
# Chart - 1 visualization code
# Visualizsing the by pie chart.
df1['hotel'].value_counts().plot.pie(explode=[0.05, 0.05], autopct='%1.1f%%', shadow=True, figsize=(10,8),fontsize=20)
plt.title('Most Preffered  Hotel')

##### 1. Why did you pick the specific chart?

*Answer* Here.

A pie chart can be a useful visualization for hotel booking analysis when you want to represent the composition of a whole in terms of its parts. Here are a few scenarios where a pie chart might be effective:

##### 2. What is/are the insight(s) found from the chart?

Answer Here

City Hotel is most preffered hotel by guests. Thus city hotels has maximum bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes by increaing booking source,Cancellation rate, Room type, Meal plan and customer segements we can increare a positive growth

#### Chart - 2

2) Which Agent made the most bookings?

In [None]:
# Chart - 2 visualization code
# return highest bookings made by agents
highest_bookings= df1.groupby(['agent'])['agent'].agg({'count'}).reset_index().rename(columns={'count': "Most_Bookings" }).sort_values(by='Most_Bookings',ascending=False)

 # as agent 0 was NAN value and we replaced it with 0 and indicates no bookings.so droping.
highest_bookings.drop(highest_bookings[highest_bookings['agent']==0].index,inplace=True)

# taking top 10 bookings made by agent
top_ten_highest_bookings=highest_bookings[:10]

top_ten_highest_bookings

plt.figure(figsize=(18,8))
sns.barplot(x=top_ten_highest_bookings['agent'],y=top_ten_highest_bookings['Most_Bookings'],order=top_ten_highest_bookings['agent'])
plt.xlabel('Agent No')
plt.ylabel('Number of Bookings')
plt.title("Most Bookings Made by the agent")

##### 1. Why did you pick the specific chart?



```
# This is formatted as code
```

Answer Here.

For categorical comparison, bar chart is useful


##### 2. What is/are the insight(s) found from the chart?

Answer Here

Agent ID no: 9 made most of the bookings

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

3)What is the pecentage of cancellation?

In [None]:
# Chart - 3 visualization code

df1['is_canceled'].value_counts().plot.pie(explode=[0.05, 0.05], autopct='%1.1f%%', shadow=True, figsize=(10,8),fontsize=20)
plt.title("Cancellation and non Cancellation")

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

27.5 % of the bookings were cancelled.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

4) What is the Percentage of repeated guests?

In [None]:
# Chart - 4 visualization code


df1['is_repeated_guest'].value_counts().plot.pie(explode=(0.05,0.05),autopct='%1.1f%%',shadow=True,figsize=(12,8),fontsize=20)

plt.title(" Percentage (%) of repeated guests")

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Repeated guests are very few which only 3.9 %.
In order to retained the guests management should take feedbacks from guests and try to imporve the services.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

5) What is the percentage distribution of "Customer Type"?

In [None]:
# Chart - 5 visualization code
df1['customer_type'].value_counts().plot.pie(explode=[0.05]*4,shadow=True,autopct='%1.1f%%',figsize=(12,8),fontsize=15,labels=None)


labels=df1['customer_type'].value_counts().index.tolist()
plt.title('% Distribution of Customer Type')
plt.legend(bbox_to_anchor=(0.85, 1), loc='upper left', labels=labels)


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Transient customer type is more whcih is 82.4 %. percentage of Booking associated by the Group is vey low.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

6)What is the percentage distribution of required_car_parking_spaces?

In [None]:
# Chart - 6 visualization code

df1['required_car_parking_spaces'].value_counts().plot.pie(explode=[0.05]*5, autopct='%1.1f%%',shadow=False,figsize=(12,8),fontsize=15,labels=None)

labels=df1['required_car_parking_spaces'].value_counts().index
plt.title('% Distribution of required car parking spaces')
plt.legend(bbox_to_anchor=(0.85, 1), loc='upper left', labels=labels)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

*Answer* Here

91.6 % guests did not required the parking space. only 8.3 % guests required only 1 parking space.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

7)What is the percentage of booking changes made by the customer.?

In [None]:
# Chart - 7 visualization code
booking_changes_df=df1['booking_changes'].value_counts().reset_index().rename(columns={'index': "number_booking_changes",'booking_changes':'Counts'})

plt.figure(figsize=(12,8))
sns.barplot(x=booking_changes_df['number_booking_changes'],y=booking_changes_df['Counts']*100/df1.shape[0])
plt.title("% of Booking change")
plt.xlabel('Number of booking changes')
plt.ylabel('Percentage(%)')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

*Answer* Here

Almost 82% of the bookings were not changed by guests.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

8)What is Percentage distribution of Deposite type ?

In [None]:
# Chart - 8 visualization code
df1['deposit_type'].value_counts().plot.pie(explode=(0.5,0.5,0.05),autopct='%1.1f%%',shadow=False,figsize=(14,8),fontsize=20,labels=None)
plt.title("% Distribution of deposit type")
labels=df1['deposit_type'].value_counts().index.tolist()
plt.legend(bbox_to_anchor=(0.85, 1), loc='upper left', labels=labels)



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

98.7 % of the guests prefer "No deposit" type of deposit.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

Which type of food is mostly preferred by the guests?

In [None]:
# Chart - 9 visualization code

# df1['meal'].value_counts().plot.pie(explode=[0.05, 0.05,0.05,0.05,0.05], autopct='%1.1f%%', shadow=True, figsize=(20,15),fontsize=20)
plt.figure(figsize=(18,8))
sns.countplot(x=df1['meal'])
plt.xlabel('Meal Type')
plt.ylabel('Count')
plt.title("Preferred Meal Type")


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

So the most preferred meal type by the guests is BB( Bed and Breakfast)
* HB- (Half Board) and SC- (Self Catering) are equally preferred.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

 From which country the most guests are coming?

In [None]:
# Chart - 10 visualization code

# Counting the guests from various countries.
country_df=df1['country'].value_counts().reset_index().rename(columns={'index': 'country','country': 'count of guests'})[:10]
# country_df1=df1['country'].value_counts().reset_index().rename(columns={'index': 'country','country': 'count of guests'})

# Visualizing by  plotting the graph
plt.figure(figsize=(20,8))
sns.barplot(x=country_df['country'],y=country_df['count of guests'])
plt.xlabel('Country')
plt.ylabel('Number of guests',fontsize=12)
plt.title("Number of guests from diffrent Countries")


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Most of the guests are coming from portugal i.e more 25000 guests are from portugal
abbreevations for countries-

PRT- Portugal
GBR- United Kingdom
FRA- France
ESP- Spain
DEU - Germany
ITA -Itlay
IRL - Ireland
BEL -Belgium
BRA -Brazil
NLD-Netherlands

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

Which is the most preferred room type by the customers?

In [None]:
# Chart - 11 visualization code

#set plotsize
plt.figure(figsize=(18,8))

#plotting
sns.countplot(x=df1['assigned_room_type'],order=df1['assigned_room_type'].value_counts().index)
#  set xlabel for the plot
plt.xlabel('Room Type')
# set y label for the plot
plt.ylabel('Count of Room Type')
#set title for the plot
plt.title("Most preferred Room type")

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

So the most preferred Room type is "A"

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In which month most of the bookings happened?

In [None]:
# Chart - 12 visualization code

# groupby arrival_date_month and taking the hotel count
bookings_by_months_df=df1.groupby(['arrival_date_month'])['hotel'].count().reset_index().rename(columns={'hotel':"Counts"})
# Create list of months in order
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
# creating df which will map the order of above months list without changing its values.
bookings_by_months_df['arrival_date_month']=pd.Categorical(bookings_by_months_df['arrival_date_month'],categories=months,ordered=True)
# sorting by arrival_date_month
bookings_by_months_df=bookings_by_months_df.sort_values('arrival_date_month')

bookings_by_months_df


In [None]:
plt.figure(figsize=(20,8))

#pltting lineplot on x- months & y- booking counts
sns.lineplot(x=bookings_by_months_df['arrival_date_month'],y=bookings_by_months_df['Counts'])

# set title for the plot
plt.title('Number of bookings across each month')
#set x label
plt.xlabel('Month')
#set y label
plt.ylabel('Number of bookings')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

July and August months had the most Bookings. Summer vaccation can be the reason for the bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

Which Distribution channel is mostly used for hotel bookings?

In [None]:
# Chart - 13 visualization code

# Visualizsing the by pie chart.


#Creating labels
labels=df1['distribution_channel'].value_counts().index.tolist()

# creating new df of distribution channel
distribution_channel_df=df1['distribution_channel'].value_counts().reset_index().rename(columns={'index':"distribution_channel",'distribution_channel':'count'})

#adding percentage columns to the distribution_channel_df
distribution_channel_df['percentage']=round(distribution_channel_df['count']*100/df1.shape[0],1)

#Creating list of percentage
sizes=distribution_channel_df['percentage'].values.tolist()

#plotting the piw chart
df1['distribution_channel'].value_counts().plot.pie(explode=[0.05, 0.05,0.05,0.05,0.05], shadow=False, figsize=(15,8),fontsize=10,labels=None)

# setting legends with the percentage values
labels = [f'{l}, {s}%' for l, s in zip(labels, sizes)]
plt.legend(bbox_to_anchor=(0.85, 1), loc='upper left', labels=labels)
plt.title(' Mostly Used Distribution Channel for Hotel Bookings ')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
TA/TO' is mostly(79.1%) used for booking hoetls.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

correlation_matrix = df1.corr()

plt.figure(figsize=(10, 8))  # Adjust the size if needed
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap')
plt.show()




##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

) is_canceled and same_room_alloted_or_not are negatively corelated. That means customer is unlikely to cancel his bookings if he don't get the same room as per reserved room. We have visualized it above.
2) lead_time and total_stay is positively corelated.That means more is the stay of cutsomer more will be the lead time.
3)adults,childrens and babies are corelated to each other. That means more the people more will be adr.
4) is_repeated guest and previous bookings not canceled has strong corelation. may be repeated guests are not more likely to cancel their bookings.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

numeric_columns = df1.select_dtypes(include=[np.number])

sns.pairplot(numeric_columns)
plt.suptitle('Pair Plot of Numeric Variables', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

1. Homewood Suites by Hilton Hotel is most preffered so stakeholders can offer  discounts on City Hotel to increase bookings.

2. Around 16.45% of booking are cancelled so hotel can offer layality discount if guests don't cancel their booking.
3. Hotel can maintain raw maintains for BB type meal in advance to avoid delay as BB is the most preffered meal.
4. Hotel shold increase number of rooms in CIty hotels to decrease the waiting time.
5. TA has most number of bookings over other market segments so hotel could run some offer to get more bookings from other segment.
6. Room type A is most preffered by guests so hotel should increase the number of A type room.
7. Number of repeated guests is low tha indicates that there is something they don't like about hotel and that needs to be fixed to increase number of repeated guests.
8. Waiting time period for City hotel is hiigh as compared to Homewood Suites by Hilton Hotels. That means city hotels are much busier then Homewood Suites by Hilton Hotel .
9. Optimal stay in both the type hotel is less than 7 days. Usually people stay for a week so hotel need to take some actions to improve their performance.
10. Maximum number of guests were from Portugal.



# **Conclusion**

Recommendations:

Based on the analysis, the project recommends that hotels consider tailored strategies to address their unique challenges. Proactive measures, such as personalized promotions and direct booking incentives, can mitigate cancellations and foster customer satisfaction.

Conclusion:

In conclusion, the Hotel Booking Analysis EDA project provides actionable insights for hotels aiming to navigate the dynamic landscape of the hospitality industry. By leveraging data-driven strategies, hotels can optimize their operations, enhance customer experiences, and drive sustainable revenue growth.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***