<a href="https://colab.research.google.com/github/saketvaibhav7114/EDA-on-Hotel-Booking-Analysis/blob/main/EDA_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name -** Saket Vaibhav

# **Project Summary -**

This project aims to perform Exploratory Data Analysis (EDA) on a dataset containing hotel booking information to gain insights and understand patterns in customer behaviour, booking trends, and other relevant factors that impact the hospitality industry. By conducting a comprehensive EDA, we intend to extract meaningful information that can guide decision-making, improve customer experience, and optimise hotel management strategies.

---



---



---








**Objectives:**:

**Data Cleaning and Preprocessing:-** The first step involves data cleaning and preprocessing to handle missing values, outliers, and inconsistencies in the dataset. This ensures the data's quality and reliability before proceeding with analysis.

**Exploratory Data Analysis (EDA):-** Through a variety of statistical and visual techniques, we will explore the dataset to gain insight into various trends & patterns in hotel booking.

**Customer Segmentation:-** Based on the analysis, we will attempt to segment customers into different groups based on their booking behaviors, demographics, and preferences. This can help in tailoring marketing strategies and service offerings to specific customer segments.

# **GitHub Link -**

https://github.com/saketvaibhav7114/EDA-on-Hotel-Booking-Analysis

# **Problem Statement**


*   What are the seasonal booking patterns and trends?
*   Which type of rooms are most most popular?
*   What is the average length of stay of guests?
*   Is there any correlation between customer demographics and booking behavior?
*   Is there any correlation between previous cancellation data & re-booking?
*   What is the cancellation rates?
*   What are the reasons for cancellation?
*   What are the customer preferences for booking channels (online vs. offline)?

#### **Define Your Business Objective?**

Through this project, we anticipate discovering valuable insights that can benefit the hospitality industry in multiple ways:

*   Identifying peak booking periods to optimize pricing and resource allocation.
*   Understanding customer preferences to enhance personalized offerings.
*   Developing strategies to reduce booking cancellations and enhance customer loyalty.
*   Gaining a competitive advantage by leveraging data-driven decision-making.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

Why we import those libraries?

•	**pandas:-** for data analysis and manipulating dataset.

•	**NumPy:-** working with numerical values

•	**matplotlib:-** for data visualization and graphical plotting for python.

•	**datetime:-** to manipulate date and time object data.

•	**seaborn:-** for plotting statistical graph in python.


In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
hotel_booking_df=pd.read_csv('/content/Hotel Bookings.csv')

### Dataset First View

In [None]:
# Dataset First Look
pd.set_option("display.max_columns",None)     # Display all the columns
hotel_booking_df.head()                       # Display top 5 rows

In [None]:
hotel_booking_df.tail()       # Display last 5 rows

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
hotel_booking_df.shape

### Dataset Information

In [None]:
# Dataset Info
hotel_booking_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
hotel_booking_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values= hotel_booking_df.isnull().sum()
missing_values

In [None]:
# Visualizing the missing values

columns_with_missing_values = missing_values[missing_values > 0]      #  Filter columns with missing values

# Calculate the percentage of missing values in each column
total_rows = len(hotel_booking_df)
percentage_missing = (columns_with_missing_values / total_rows) * 100

# Create a bar chart
plt.figure(figsize=(10, 6))
bar_plot = columns_with_missing_values.plot(kind='bar', color='lightcoral')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.title('Number of Missing Values in Columns')
plt.xticks(rotation=45, ha='center')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the percentage of missing values on top of each bar
for index, value in enumerate(columns_with_missing_values):
    plt.text(index, value, f'{percentage_missing[index]:.2f}%', ha='center', va='bottom')

plt.show()

### What did you know about your dataset?

The dataset used for this analysis comprises a comprehensive collection of hotel booking data, including information such as customer demographics, booking dates, hotel features, room types, duration of stay, booking cancellations etc. It covers a substantial time span, containing data from various hotels across different geographical locations and demographics.

The dataset contains 119390 rows & 32 columns. There are several duplicated rows and some columns contains missing values, so we need to treat them individually.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
hotel_booking_df.columns

In [None]:
# Dataset Describe
hotel_booking_df.describe().T

### Variables Description

1.hotel: Name of hotel (City hotel or Resort hotel)

2.is_canceled: If the booking was cancelled (1) or not (0)

3.lead_time: The time taken between a customer makes a reservation and their actual arrival time

4.arrival_date_year: Year of hotel arrival date

5.arrival_date_month: Month of month hotel arrival date

6.arrival_date_week_number: week hotel arrival date

7.arrival_date_day_of_month: Day of hotel arrival date

8.stays_in_weekend_nights: Number of weekend nights (Saturday & Sunday) the guest stayed or booked to stay at the hotel

9.stays_in_week_nights: Number of weeknights (Mon to Fri) the guest stayed or booked to stay at the hotel

10.adults: Number of adults among guests

11.children: Number of children among guests

12.babies: Number of babies among guests

13.meal: Available options of meal for guests

14.country: country code

15.market_segment: A strategy that allows hotel owners to better understand that which segment customer belongs to

16.distribution_channel: Name of booking distribution channel. The term 'TA' means 'Travel Agents'& 'TO' means 'Tour Operators.'

17.is_repeated_guest: If the booking were repeated guest (1) or not (0)

18.previous_cancellations: Number of Previous bookings that were cancelled by customer priorities the current booking

19.previous_bookings_not_canceled: Number of Previous bookings that were not cancelled by the customer from current booking

20.reserved_room_type: Code of reserved room type

21.assigned_room_type: Code of assigned room type

22.booking_changes: Number changes made in the booking

23.deposit_type: Type of deposit made by guest

24.agent: Number booking made by company agent (ID)

25.company: The company that made the number of bookings by company (ID)

26.days_in_waiting_list: The number of days booking was in the waiting list.

27.customer_type: Type of customer, assuming one of four categories

28.adr: Average daily rate

29.required_car_parking_spaces: Number of car parking spaces required by the customer

30.total_of_special_requests: Number of special requests made by the customer

31.reservation_status: Status of reservation

32.reservation_status_date: The date at which the last reservation status was updated


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in hotel_booking_df.columns:
    print('\033[1m {}:\033[0m {}'.format(column, hotel_booking_df[column].unique()[:12]))

In [None]:
# Check count of Unique Values for each variable.
for column in hotel_booking_df.columns:
    print('\033[1m {}:\033[0m {}'.format(column, hotel_booking_df[column].nunique()))

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Dropping the company column since it has more than 90 % empty columns
hotel_booking_df = hotel_booking_df.drop(columns='company',axis=1)

In [None]:
# Filling the Missing value
hotel_booking_df[['agent','children']]= hotel_booking_df[['agent','children']].fillna(0)
hotel_booking_df[['country']] = hotel_booking_df [['country']].fillna('other')

In [None]:
# Checking if all the null values are removed or not
hotel_booking_df.isnull().sum()

In [None]:
hotel_booking_df.shape

In [None]:
# Checking rows having duplicate values
hotel_booking_df.duplicated().sum()

In [None]:
# Dropping duplicated rows
hotel_booking_df= hotel_booking_df.drop_duplicates(keep='first')

In [None]:
# Mapping the 'arrival_date_month' columns with its numerical equivalent
month_map = {
    'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5, 'June': 6,
    'July': 7, 'August': 8, 'September': 9, 'October': 10, 'November': 11, 'December': 12
}

hotel_booking_df['arrival_date_month'] = hotel_booking_df['arrival_date_month'].map(month_map)


In [None]:
# Creating a custom function to merge the three columns of date,month & year into single columns
def merge_columns(row):
    return '-'.join([str(row['arrival_date_year']), str(row['arrival_date_month']), str(row['arrival_date_day_of_month'])])

# Apply the custom function to merge the columns
hotel_booking_df['arrival_date'] = hotel_booking_df.apply(merge_columns, axis=1)

# Now drop the separate columns which is not further required
hotel_booking_df.drop(columns=['arrival_date_day_of_month'], inplace=True)

In [None]:
# Changing datatype of 'reservation_status_date' & arrival_date from object to datetime
hotel_booking_df['reservation_status_date'] = pd.to_datetime(hotel_booking_df['reservation_status_date'],format = '%Y-%m-%d')

# Changing datatype of 'arrival_date' from object to datetime
hotel_booking_df['arrival_date'] = pd.to_datetime(hotel_booking_df['arrival_date'],format = '%Y-%m-%d')

# Convert 'children' column to int data type
hotel_booking_df['children'] = hotel_booking_df['children'].astype(int)

In [None]:
# Creating a new columns 'total_stays' by merging columns containg 'stays_in_weekend_nights' & 'stays_in_week_nights'
hotel_booking_df['total_stay'] = hotel_booking_df['stays_in_weekend_nights']+hotel_booking_df['stays_in_week_nights']


# Adding [total people num as column, i.e total type of person =no. of adults + No. of children + No. of babies
hotel_booking_df['total_guests'] = hotel_booking_df['adults'] + hotel_booking_df['children']+hotel_booking_df['babies']

# Now dropping the individual columns
hotel_booking_df.drop(columns=['stays_in_weekend_nights', 'stays_in_week_nights','adults','children','babies'], inplace=True)

In [None]:
hotel_booking_df.info()

### What all manipulations have you done and insights you found?



1. Since company columns contain more than 50% of the missing data, it is not feasible to keep the said columns. So we removed this column.
2. After that, we noticed that there were some missing values in the children, country, and agent columns.

>>> * For the children and agent columns, we filled in the missing value with 0.






>>> * For the country column, we filled in the missing value with 'Other'.




3. After removing the missing value, we checked for any duplication of rows and found that there were 32001 duplicate rows. So we drop all these rows, keeping only the first rows.
4. After treating the duplicated rows, we have done some feature enginnering to modify our dataset.

>>> * Three columns containing arrival date, arrival month,and arrival year are combined so that they become a single arrival date column. But before that, we converted the month columns into their equivalent numeric columns.
>>> * Similarly, weekend night stay & week night stay column are combined to make total stay column.
>>> *  Columns containing adults, children, and babies are combined to make the total number of guest columns.

5. Now we have corrected the datatype of a few columns.

>>> * reservation_status_date column from object to datetime datatype.
>>> * arrival date column from object to datetime datatype.
>>> * children column from float to int datatype.









## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1:- Comparison between City Hotel & Resort Hotel

In [None]:
# Get the counts of each hotel type
hotel_counts = hotel_booking_df['hotel'].value_counts()

# Plot the pie chart
plt.figure(figsize=(8, 6))
plt.pie(hotel_counts, labels=hotel_counts.index, autopct='%1.2f%%', startangle=45, explode=(0, 0.05), colors=['skyblue', 'lightgreen'],textprops={'fontsize': 13})
plt.title('Hotel Type Distribution')
plt.axis('equal')  # Equal aspect ratio ensures the pie chart is circular.
plt.legend(title='Hotel Type', loc='upper right', labels=['City Hotel', 'Resort Hotel'])
plt.show()

##### 1. Why did you pick the specific chart?

Pie charts are effective for comparing the parts (individual categories) to the whole (total count of hotels). Each slice of the pie represents a proportion of the whole, allowing us to visualize the relative frequencies of 'City Hotel' and 'Resort Hotel.'

##### 2. What is/are the insight(s) found from the chart?

1) The pie chart shows that 'City Hotel' has 61.1% of the total bookings while
'Resort Hotel' accounts for 38.9% of the total bookings.
The chart reveals that 'City Hotel' bookings are more prefered, comprising a higher percentage of the total bookings compared to 'Resort Hotel' bookings.

2) City hotels have higher demands, which may be because they provide better customer service and advertise more to attract guests. On the contrary,
Guests have shown less interest in the Resort hotel, which means they need to provide better customer service, different types of meals, and an attractive advertisement for their services. There is a high scope for growth for resort hotels if they update their service and provide vacation offers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1) Insights from the chart can influence business decisions, such as resource allocation, marketing efforts, or expansion plans based on the popularity of each hotel type.

2) Reasons that may lead to Negative Growth in Resort Hotel bookings could be changes in customer preferences and travel behaviours that impact the demand for City Hotel types.


#### Chart - 2: Room Type Preference

In [None]:
# Group data by 'room_type' and calculate the total number of bookings for each room type
room_type_counts = hotel_booking_df['assigned_room_type'].value_counts()

# Plot the bar chart
plt.figure(figsize=(14, 8))

colors = ['skyblue', 'lightgreen','black', 'red', 'lightcoral', 'gold']
bars=plt.bar(room_type_counts.index, room_type_counts.values,color=colors)
plt.xlabel('Room Type',fontsize=12)
plt.ylabel('Number of Bookings',fontsize=15)
plt.title('Booking Preference by Room Types',fontsize=12)
plt.xticks(rotation=0, ha='right',fontsize=12)
plt.tick_params(axis='y', labelsize=10)
# plt.tight_layout()  # To prevent clipping of labels


# Add the value on top of each bar
for bar in bars:
    height = bar.get_height()
    plt.annotate(f'{height}', xy=(bar.get_x() + bar.get_width() / 2, height), xytext=(0, 3),
                 textcoords="offset points", ha='center', va='bottom', fontsize=10, color='black')
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts work well when there are a limited number of distinct categories. In this case, each room type is a separate category, making it easy to compare the number of bookings for each room type.

##### 2. What is/are the insight(s) found from the chart?

The barplot reveals the room type with the highest number of bookings, indicating the most preferred or popular room type among the customers. Type 'A' room is most preferred among the guests, while Type 'L' room is least preferred.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The plot can help determine the demand for each room type, allowing hotel management to allocate resources and pricing strategies accordingly. Also, observing trends in room type bookings over time might reveal changes in customer preferences, such as a shift towards more luxurious or budget-friendly options.

If there are significant pricing differences between different room types within the same hotel, customers might opt for more affordable options, leading to negative growth in bookings for higher-priced rooms. Also,negative publicity, customer reviews, or incidents of poor customer experiences can damage a hotel's reputation, leading to a decrease in particular room type.

#### Chart - 3

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***