<a href="https://colab.research.google.com/github/vignesh312000/eda/blob/main/EDA_my_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

My EDA is on a dataset containing information about hotel bookings. This analysis aimed to gain insights into booking trends, cancellation patterns, and various factors that influence hotel reservations.I started by examining the dataset, which included 35 columns (after data wrangling) providing information on booking details, customer demographics, and hotel-related factors.
The dataset encompassed diverse features, such as booking date, guest demographics, room types, and booking status (canceled or not canceled).To understand booking trends over time, I plotted the number of bookings by year and month. This allowed me to identify peak booking periods and seasonal variations.
Additionally, I examined the distribution of bookings across different days of the week.A significant part of my analysis focused on cancellations. I calculated the cancellation rate and explored how it varied based on different factors.
Using bar plots , I visualized the distribution of canceled and non-canceled bookings.
Using Box plot,I also investigated how lead time, booking changes, and other factors correlated with booking cancellations.Customer segmentation played a crucial role in the analysis. I explored how customer type, meal preferences, and other categorical features influenced booking behaviors.

Grouped bar plots and other visualizations helped identify patterns among different customer segments.
My analysis extended to operational aspects, such as the number of days in the waiting list. These insights can assist in optimizing hotel operations.Finally, I explored the dataset's potential for predictive modeling. I identified features that might be valuable for predicting booking cancellations or other business-related outcomes.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# mounting Gdrive
from google.colab import drive
drive.mount('/content/drive')


In [None]:
#the dataset of hotel bookings is initialized to the variable named hotel_data.
hotel_data =pd.read_csv('/content/drive/MyDrive/Colab Notebooks/colab_csv/Hotel Bookings.csv')

### Dataset First View

In [None]:
# Dataset First Look
hotel_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
hotel_data.shape

So in here we have got 119390 rows and 32 columns of dataset involved with hotel_data







### Dataset Information

In [None]:
# Dataset Info
hotel_data.info()

In here we are looking into the non-null values of the each and every columns(features)of the dataset we are working with.
And also the datatype of the columns.

In [None]:
#desribing the dataset
hotel_data.describe()

'''*In* here we will seeing the non_null and also the null type values excluding the values which are represented in the string format.'''  

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count'
count_of_dup=len(hotel_data[hotel_data.duplicated()])
print(count_of_dup)

Here the duplicate values ie.the values which acts as clone will be resulted in FALSE so the values which are returning false will be taken in account and reflected in the output result.It results in 31994 duplicated values.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
hotel_data.isnull().sum().sort_values(ascending=False)

*It show the sum of all null values of individual columns so there are nearly 3 to 4 columns holding null values.*

In [None]:
# Visualizing the missing values
plt.title("Missing values")
sns.heatmap(hotel_data.isnull(), cmap='cividis',cbar=False)

*Heatmaps are the better way to look into the missing values because they reflect the nature of each cell using colors.However,Heatmaps visualize the data in a very good manner.*

### What did you know about your dataset?

*So far,the hotel booking dataset is containing the data of the hotel and their characteristis and features which is a huge data of 119390 rows and 32 columns.*
*So by viewing the dataset as a Data Analyst the role is to provide deep insights and also the predictons for the marketing strategies.*
*About the description,the dataset is conceived with minimal amount of null and missing values.*

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
hotel_data.columns.tolist()

In [None]:
# Dataset Describe
hotel_data.describe(include='all')

*In here the (include='all') parameter include bothe numeric and nobn-numeric values and shows their statistical representation*.
*ie.the index values represents the ideology of the statictics function*.

### Variables Description

•	**Hotel-**H1= Resort Hotel;
H2=City Hotel

•	**Is_cancelled-**If the booking was cancelled(1) or
not(0)

•	**Lead_time-**Number of days that elapsed between
                the entering date of the booking into the
                PMS and the arrival date

•	**Arrival_date_year**- Year of arrival date

•	**Arrival_date_month-**Month of arrival date

•	**Arrival_date_week_number**-Week number for arrival date

•	**Arrival_date_day**-Day of arrival date

•	**Stays_in_weekend_nights**-Number of weekend nights (Saturday or Sunday) the
                              guest stayed or booked to stay at the hotel


•	**Stays_in_week_nights-**Number of week nights (Monday to Friday) the guest
                         stayed or booked to stay at the hotel

•	**Adults** -Number of adults

• **Children**- Number of children

• **Babies**-Number of babies

• **Meal**-Kind of meal opted for


• **Country**-Country code

•	**Distribution _channel**-How the customer accessed the stay-
corporate booking/Direct/TA.TO

•	**Is_repeated_guest**-Guest coming for first time or not

•	**Previous_cancellation**-Was there a cancellation before

•	**Previous_bookings**-Count of previous bookings

•	**Reserved_room_type**-Type of room reserved

• **Assigned_room_type**-Type of room assigned

• **Booking_changes**-Count of changes made to booking

• **Deposit_type**-Deposit type

•	**Agent**-Booked through agent

•	**Days_in_waiting_list**-Number of days in waiting list

•	**Customer_type**-Type of customer

•	**Required_car_parking**-If car parking is required

•	**Total_of_special_req-**Number of additional special
	                     requirements

•	**Reservation_status**-Reservation of status

•	**Reservation_status_date-**Date of the specific status















The above shown format is the variable and its description on which it works.

### Check Unique Values for each variable.

This is because this ensure that the original data is unaffected with the wrangling and the modification we are going to do.

In [None]:
# Check Unique Values for each variable.
column_names=hotel_data.columns.tolist()
#iterating through every element in the list ie.the columns names to extract the unique elements.
for i in column_names:
  unique_values=hotel_data[i].unique()
  print(i.upper(),unique_values)
  print('\n')


*Iterating through every element in the list ie.the columns names to extract the unique elements.*

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#making a copy of dataset hotel_data.
data=hotel_data.copy()


# Write your code to make your dataset analysis ready.
data[data.duplicated()].shape                           #Represent how many rows are infused with the duiplicate  values.

#dropping
data.drop_duplicates(inplace=True)                            #So the duplicates should be dropped from the data set,to prevent bias in analysis.
data.shape                                                    #After the drop , the dataset is shaped to 87396 rows and 32 columns





#missing_values
#the null values should be identified and replaced with the "mean" of the related values.
data.isnull().sum().sort_values(ascending=False)              #this shows the sum of null values arranged in the format of high to low.
#this missing values are because the customer haven't used these features.So we need to handle these cases by replacing with appropriate values
'''Here it shows the null value are with 4 columns

*Company 112593

Agent 16340

Country 488

Children 4*'''




#1st the columns with higher missing values to b solved because we can't replace it with the mean because of huge missings.
#as these columns are consider as "int type" so we are going to replace those values with numeric value of 0.
#filling the company and agent with 0.
data[['company','agent']]=data[['company','agent']].fillna(0)
data[['company','agent']]




data['country']=data['country'].fillna('not_mentionedd')
#we have filled the null values of country with "not_mentioned" keyword.
data['country']




mean_child=data['children'].mean
data['children']=data['children'].fillna(mean_child)
#in here we have replace the null values of the children column with the mean values of the same column.
#because of the rate of missing value in children column is low compared to other columns.
data['children']



#replacing the column name
#the 2nd column which contains the values 0 and 1 replacing it with no and yes respectively.
# data ['is_canceled'] = data['is_canceled'].replace({0: 'No', 1: 'Yes'})



#now cchecking for null values.
data.isnull().sum().sort_values(ascending=False)
#so the null values have been optimised.



###############################################################################
'''The below procees is going to rename name the column holding the values of year,month and the day and the they will will be combine into single column so the

   storage of data will be optimised aswell and we can seee the customer checked in status which makes the data analysis process much more better'''

# # Create a dictionary to map month names to numerical values
month_map = {
    'January': '01', 'February': '02', 'March': '03', 'April': '04', 'May': '05', 'June': '06',
    'July': '07', 'August': '08', 'September': '09', 'October': '10', 'November': '11', 'December':'12'
}
# Map month names to numerical values as strings
data['arrival_date_month'] = data['arrival_date_month'].map(month_map)

# Rename columns to match [year, month, day] order
data.rename(columns={
    'arrival_date_year': 'year',
    'arrival_date_month': 'month',
    'arrival_date_day_of_month': 'day'
}, inplace=True)
# Rearrange the order of columns for 'pd.to_datetime()' (year, month, day)
data['dates'] = pd.to_datetime(data[['year', 'month', 'day']])

# col_to_drop=['year', 'month', 'day']
# data=data.drop(columns=col_to_drop,axis=1)                                      #here the 3 columns represent the YEAR,MONTH,DAY are dropped off.


###############################################################################
values=data['dates']
# Specify the position where you want to insert the new column (index starts at 0)
position = 7  # Inserting the new column after the 4th column

# Insert the new column with the default value at the specified position
data.insert(position, 'checked_in_status',values)

data=data.drop(data.columns[-1],axis=1)
# Now, 'date' column contains the combined dates
###############################################################################

#the no.of stays are calculateed into single column.
total_nyts=data['stays_in_weekend_nights']+data['stays_in_week_nights']
data.insert(8,'total_nights',total_nyts)

#creating new column for no.of member in a stay.
total_members=data[['adults','children','babies']].sum(axis=1)
data.insert(11,'total_members',total_members)


data.head()

In [None]:
data.columns.tolist()

### What all manipulations have you done and insights you found?

In the segment of Data Wrangling,we have made a copy of the dataset in order to not to affect the original dataset.The duplicates are modified by dropping.
Then the null and missing values were replaced with apropriate values of dtype.

And also I have created a new columns which may help much more better in working on this project.The procees renamed name the column holding the values of year,month and the day and the they will will be combine into single column so the storage of data will be optimised aswell and we can seee the customer checked in status which makes the data analysis process much more better.

So very well the data is optimised.


I have analyzed the dataset to identify the times of the year when hotel bookings are at their peak and when they are at their lowest.
I'd look into the data to determine which days of the week are the most popular among guests for making hotel reservations.
Booking Lead Time:
I'd examine the data to understand how far in advance guests typically make their hotel reservations.

I have examined the dataset to understand the distribution of bookings across various channels, such as online travel agencies, direct bookings, and phone reservations.
I'd assess the dataset to determine which booking channels generate the highest revenue and evaluate their profitability.


I have computed the cancellation rates by analyzing the dataset to determine the percentage of bookings that have been canceled.
I'd delve into the data to find common reasons for booking cancellations and understand the factors driving these cancellations.

These are the insights from my side which will be reflected in the visualizations of this dataset.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1--Booking Distribution over years(Bivariate distribution)


In [None]:
# Chart - 1 visualization code
# Assuming your dataset has a 'year' column
yearly_distribution = data.groupby('year')['month'].value_counts().unstack(fill_value=0)
yearly_distribution_normalized = yearly_distribution.div(yearly_distribution.sum(axis=1), axis=0)
plt.figure(figsize=(12, 6))
sns.heatmap(yearly_distribution_normalized, cmap='Blues', annot=True, fmt='.2f',square=True)

plt.title('Booking Distribution Heatmap')
plt.xlabel('Month')
plt.ylabel('Year')
xticks=(range(0, 12), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap can be used to visualize the distribution of bookings across months and years. Darker colors can represent higher proportions.Data distributed across two dimensions: months and years. Heatmaps are excellent for visualizing such two-dimensional data because they allow to represent values as colors in a grid. Each cell in the heatmap corresponds to a combination of a month and a year, making it suitable for this kind of data.Heatmaps make it easy to compare values across months and years. You can quickly spot patterns and trends, such as seasonality or changes in booking behavior over time. The annotations in the cells provide precise values, aiding in quantitative analysis.

##### 2. What is/are the insight(s) found from the chart?

Heatmaps can reveal seasonal patterns in the booking data. Look for rows (years) where certain months consistently have darker colors, indicating higher booking proportions. This could suggest that certain times of the year are more popular for bookings. The darkest colors consistently across all years could be the peak booking months, and you may want to focus marketing efforts or allocate resources accordingly during those times. The months with consistently lighter colors may represent low-booking periods, and might consider strategies to boost bookings during these times.

Certain months or years are entirely blank, it could indicate missing data or a lack of bookings during those periods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Growth:

Discovering the peak booking months can be highly beneficial. It allows businesses to allocate resources, staff, and marketing efforts more effectively during these high-demand periods, potentially increasing revenue.

Understanding seasonal trends can help businesses plan targeted marketing campaigns and promotions during off-peak months to stimulate bookings and counteract low-season trends.

Insights about booking behavior can inform customer engagement strategies. For example, businesses can create loyalty programs, special offers, or events during historically slow months to keep customers engaged.

Negative Growth:

Overreacting to seasonal trends can result in resource imbalances. For example, hiring too many staff during peak months can lead to increased labor costs during slow periods, negatively impacting profitability.

Assuming that historical trends will continue indefinitely can be risky. If external factors (e.g., economic changes, competition) shift, a business that doesn't adapt its strategy may face negative growth during previously strong months.

#### Chart - 2--Visualizing the distribution of bookings between different hotel types and  the count of bookings for each room type.

In [None]:
# Chart - 2 visualization code
booking_counts = data['hotel'].value_counts()

# Create a bar plot
plt.figure(figsize=(8, 6))
plt.bar(booking_counts.index, booking_counts.values, color=['blue', 'green'])  # Specify colors for each hotel type
plt.xlabel('Hotel Type')
plt.ylabel('Number of Bookings')
plt.title('Distribution of Bookings by Hotel Type')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

*italicized text*#### Chart - 3--The distribution of the Average Daily Rate (ADR) and the count of bookings for each room type

In [None]:
# Chart - 3 visualization code
fig, axes = plt.subplots(1, 2, figsize=(14,7))

# Subplot 1: Count of Bookings by Room Type
sns.countplot(ax=axes[0], x='assigned_room_type', data=data, palette='Set1')
axes[0].set_xlabel('Assigned Room Type')
axes[0].set_ylabel('Number of Bookings')
axes[0].set_title('Count of Bookings by Room Type')
axes[0].tick_params(axis='x')

# Subplot 2: ADR Distribution by Room Type
sns.boxplot(ax=axes[1], x='assigned_room_type', y='adr', data=data, palette='Set2')
axes[1].set_xlabel('Assigned Room Type')
axes[1].set_ylabel('ADR (Average Daily Rate)')
axes[1].set_title('ADR Distribution by Room Type')
axes[1].tick_params(axis='x')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Countplot:

I used a countplot (bar chart) to visualize the count of bookings for each room type.
Countplots are effective for showing the distribution of categorical data, in this case, the distribution of room types.
It helps answer questions like "Which room type is most in demand?"

Boxplot:

I used a boxplot to visualize the distribution of the Average Daily Rate (ADR) for each room type.They are suitable for displaying the distribution, spread, and outliers of numerical data within different categories.
It helps answer questions like "Which room type tends to have the highest ADR, and how variable are the ADR values within each room type?"

##### 2. What is/are the insight(s) found from the chart?



The first subplot, "Count of Bookings by Room Type," shows the distribution of bookings across different room types.
From this subplot, you can observe which room types are more popular or in higher demand based on the number of bookings.
Insight: It appears that certain room types have a significantly higher number of bookings compared to others. This information can be valuable for hotel management to understand which room types are preferred by guests.


The second subplot, "ADR Distribution by Room Type," displays boxplots that represent the distribution of Average Daily Rate (ADR) for each room type.
Boxplots provide insights into the spread and central tendency of ADR values for each room type.
Insight: Some room types have a wider range of ADR values, indicating variability in pricing. Additionally, the position of the boxplots' medians provides information about the central tendency of ADR for each room type. This insight can be used to assess pricing strategies and identify potential opportunities to optimize ADR.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive growth:


If the analysis shows that a particular room type is in high demand, the hotel management can allocate more resources and marketing efforts to promote that room type. This could lead to increased revenue and positive growth.


Identifying the room type with the highest ADR allows the hotel to set pricing strategies. If guests are willing to pay more for a specific room type, the hotel can adjust prices accordingly, potentially increasing revenue and profitability.


Negative Growth:
  
  
  Paradoxically, if the most in-demand room type is also the one with the lowest ADR, the hotel might be missing out on revenue. This could happen if the hotel is not optimizing its pricing strategy for the high-demand room type.

  Paradoxically, if the most in-demand room type is also the one with the lowest ADR, the hotel might be missing out on revenue. This could happen if the hotel is not optimizing its pricing strategy for the high-demand room type.

#### Chart - 4--Time series plot with individual days of the month.Comparing trends over time.

In [None]:
# Chart - 4 visualization code
grouped_data = data.groupby(['month', 'day'])['days_in_waiting_list'].mean()

# Create a time series plot with individual days of the month
plt.figure(figsize=(12, 6))  # Adjust figure size if needed
grouped_data.plot()

# Add labels and title
plt.xlabel('Day of the Month')
plt.ylabel('Mean Days in Waiting List')
plt.title('Mean Days in Waiting List Over Time by Day of the Month')

# Display the plot

plt.grid(True)  # Add grid lines if desired
plt.xticks(rotation=45)  # Rotate x-axis labels for readability
plt.tight_layout()  # Ensure labels are not cut off
plt.show()


##### 1. Why did you pick the specific chart?

The data involves time-based information, specifically mean waiting times, which can change over time. Time series plots are particularly well-suited for visualizing such data because they emphasize the chronological order of observations.The data can be grouped by different months, creating multiple categories. A line plot allows you to represent these categories (months) with separate lines, making it easy to compare trends over time.

Line plots facilitate comparisons between different months and enable the identification of patterns, seasonality, and trends over time.



##### 2. What is/are the insight(s) found from the chart?

This can help you identify specific days of the month that consistently have higher or lower waiting times.

might notice that waiting times tend to be higher during certain months, possibly due to seasonal factors or events.

Any sudden spikes or dips in the mean waiting times for specific days or months may indicate outliers or unusual events. Investigating these anomalies can help you understand their causes.

 Seasonal variations can provide insights into the impact of holidays or specific seasons on waiting times.Understanding waiting time variations allows you to consider the customer experience. Lower waiting times on specific days can lead to improved customer satisfaction.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Growth:

Insights that lead to better management of peak demand periods and reductions in waiting times can enhance the customer experience. This improvement can result in increased customer satisfaction, positive word-of-mouth, and repeat business.

Identifying patterns and allocating resources more effectively based on insights can lead to cost savings and operational efficiencies. For example,we can adjust staffing levels to meet demand fluctuations more efficiently.

Long-term trend analysis can inform strategic decisions. For instance, recognizing a consistent decrease in waiting times may indicate successful operational improvements, contributing to positive business growth.

Insights into seasonal trends can inform marketing strategies and pricing decisions. Businesses can capitalize on peak seasons by adjusting marketing campaigns and pricing structures.

Negative Growth:

Failure to address high waiting times during peak periods can result in lost revenue opportunities. Customers may choose competitors with shorter wait times, leading to revenue decline.

If insights are not acted upon and waiting times remain consistently high or exhibit undesirable patterns, it can result in dissatisfied customers. This can lead to negative online reviews, decreased customer loyalty, and harm to the brand's reputation.



#### Chart - 5--Calculating the percentage of unique IDs

In [None]:
# Chart - 5 visualization code

# Group the data by 'distribution_channel' and calculate the percentage of unique IDs
group_by_dc = data.groupby('distribution_channel')
df = pd.DataFrame(((group_by_dc.size() / data.shape[0]) * 100)).reset_index().rename(columns={0: 'Booking_%'})

# Create a horizontal bar chart
plt.figure(figsize=(8, 6))
plt.barh(df['distribution_channel'], df['Booking_%'], color='skyblue')
plt.xlabel('Booking Percentage (%)')
plt.ylabel('Distribution Channel')
plt.title('Distribution of Booking Percentage by Channel')
plt.show()


##### 1. Why did you pick the specific chart?

The "distribution_channel" column contains categorical data, which makes a pie chart a suitable choice for showing the relative proportions of different categories.


 Horizontal bar chart is effective when you need to compare categories (in this case, distribution channels) because it allows for a straightforward visual comparison of values.

 Horizontal bar charts are often easier to read than pie charts, especially when you have multiple categories or want to show precise values. In a pie chart, it can be challenging to compare the sizes of slices accurately.



##### 2. What is/are the insight(s) found from the chart?



Can easily see which distribution channels are the most popular or frequently used for bookings. The higher the percentage of bookings through that channel.

Can quickly see if one channel dominates the others or if bookings are spread relatively evenly among multiple channels. If one or more categories are significantly larger than the others, it indicates that certain distribution channels play a dominant role in generating bookings. This information can be valuable for marketing and business strategies.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive growth:

Insights into distribution channel performance can help tailor marketing strategies. Online travel agencies (OTAs) are driving a large portion of bookings, the business can invest in optimizing its presence on these platforms.


Understanding the cost-effectiveness of different channels is crucial. If certain channels have a lower cost per booking, the business can focus on maximizing returns from those channels, which can improve profitability.


Negative Growth:

Overdependence on a Single Channel: If the chart shows that the majority of bookings come from a single channel, this can be risky. Any disruptions or changes in that channel (e.g., policy changes by an OTA) can significantly impact the business. Overreliance on one channel can lead to vulnerability.



#### Chart - 6--visualizing the customer types and their contribution.

In [None]:
# Chart - 6 visualization code

group_by_dc = data.groupby('customer_type')


df= pd.DataFrame(((group_by_dc.size() / data.shape[0])*100)).reset_index().rename(columns={0: 'Booking_%'})
plt.figure(figsize=(8, 8))
d1= df['Booking_%']
labels = df['customer_type']
explode = [0.05] * len(labels)  # Adjust the amount of explosion as needed

# Create the pie chart
plt.pie(x=d1, autopct="%.2f%%", explode=explode,labels=labels, pctdistance=0.75)  # Adjust pctdistance as needed

plt.title("contribution of customer types", fontsize=14)
plt.axis('equal')  # Ensure the pie chart is circular

plt.show()

##### 1. Why did you pick the specific chart?

Pie charts are commonly used when you want to show the composition of a whole in relation to its parts.

Pie charts are effective for displaying data as percentages of a whole. Each slice of the pie represents a percentage, and the whole pie represents 100%.

Pie charts can be visually appealing and are often used in reports, presentations, and dashboards to provide a quick overview of data distribution.

##### 2. What is/are the insight(s) found from the chart?

The most significant insight is that the 'Transient' customer type represents the majority of bookings, accounting for approximately 82.37% of all bookings. This indicates that a large portion of the hotel's customers fall into the 'Transient' category.

The 'Contract' and 'Transient-Party' customer types also contribute to bookings, but to a much lesser extent. 'Contract' accounts for about 3.59% of bookings, while 'Transient-Party' accounts for around 13.42%. These two categories.

The chart highlights an imbalance in the distribution of booking types. The 'Transient' category dominates the bookings, while the other categories are relatively small in comparison.

Given that 'Transient' customers are the primary source of bookings, the hotel should continue to cater to their needs and provide exceptional service to maintain a high booking rate from this segment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Growth:


The insight that 'Transient' customers represent the majority of bookings suggests that catering to the needs and preferences of this segment can have a positive impact. This could include tailoring marketing strategies, improving customer service, and optimizing pricing to attract and retain transient customers.

The hotel can work on strategies to attract more 'Contract' and 'Group' customers, such as offering special packages or discounts for group bookings or establishing partnerships with organizations for contract bookings.

Negative Growth:

While 'Transient' bookings are a significant source of revenue, overreliance on a single customer segment can be risky. Economic fluctuations, changes in customer preferences, or external factors (e.g., travel restrictions) can impact the volume of transient bookings. If the hotel doesn't diversify its customer base, it may be vulnerable to such fluctuations.

The hotel's revenue may be sensitive to changes in the 'Transient' market. Economic downturns or shifts in travel trends can impact transient bookings, potentially leading to periods of reduced revenue.

#### Chart - 7--Univariate distribution to visualize the distribution of bookings across months(histogram).

In [None]:
# Chart - 7 visualization code
# Create a histogram to visualize the distribution of bookings across months
plt.figure(figsize=(10, 6))  # Set the figure size
sns.histplot(data['month'], bins=12, kde=True, color='skyblue')  # Create a histogram with 12 bins (one for each month)
plt.title('Distribution of Bookings Across Months')  # Set the title of the plot
plt.xlabel('Month')  # Label the x-axis as 'Month'
plt.ylabel('Count')  # Label the y-axis as 'Count'
plt.xticks(range(0, 12), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
# Set x-axis labels to month abbreviations

plt.show()

##### 1. Why did you pick the specific chart?

The choice of a histogram chart in this context is based on the nature of the data and the specific analysis goal.

The data in question represents the number of bookings for each month of the year. This data is essentially a univariate distribution, to understand how the bookings are distributed across the months. A histogram is an appropriate choice for visualizing such distributions.


A histogram allows you to divide the data into bins (in this case, one bin for each month), and it counts the number of data points falling into each bin. This helps in visualizing the frequency or count distribution across discrete categories (months in this case).

Histograms are effective for identifying patterns and trends in data. You can observe if there are any months with significantly higher or lower booking counts, seasonal patterns, or irregularities.

##### 2. What is/are the insight(s) found from the chart?

Check if there are clear peaks or valleys in the histogram. For example, we might see higher booking counts in the summer months and lower counts in the winter months, indicating a seasonal pattern.

Jan,Feb are the months when the hotels are typically busiest, and you might want to allocate more resources or plan promotions during these times.

May,Jun,Jul are potentially the hotel's low-season months, and you might consider offering special deals or marketing campaigns to attract guests during these times.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Growth:

Can allocate more marketing budget and staff during high-season months to maximize revenue. This targeted approach can lead to increased bookings and revenue during peak times.

Low-occupancy months can prompt you to create special promotions or packages to attract guests. Discounted rates, bundled services, or unique experiences can entice visitors during these periods.With insights into booking patterns, you can enhance the guest experience. During high-season months, focus on efficient check-ins and service delivery to handle the increased volume. During low-season months, provide personalized experiences to make guests feel valued.


Negative Growth:


If the hotel becomes overly reliant on revenue generated during high-season months, it may struggle to cover operational costs during low-season periods. This can lead to financial instability.


Overcrowding during peak months can negatively impact customer satisfaction. Guests may have a less enjoyable experience if the hotel is too crowded, leading to negative reviews and reduced repeat business.

#### Chart - 8--Grouped bar plot for is_canceled against categorical feature like customertype to see cancellation patterns.

In [None]:
# Chart - 8 visualization code
# Assuming you have your dataset loaded as a DataFrame df
plt.figure(figsize=(12, 6))

# Choose a categorical feature to analyze (e.g., 'customer_type' or 'meal')
categorical_feature = 'customer_type'

# Create a grouped bar plot
sns.countplot(data=data, x=categorical_feature, hue='is_canceled', palette=['green', 'red'])
plt.xlabel(categorical_feature.capitalize())
plt.ylabel('Count')
plt.title(f'Distribution of Cancellation Status by {categorical_feature.capitalize()}')
plt.legend(title='Cancellation Status', labels=['Not Canceled', 'Canceled'])
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability if needed

plt.show()

##### 1. Why did you pick the specific chart?

Grouped bar plots are particularly useful to compare categorical data (e.g., customer types or meal options) across multiple categories.To understand how different categorical factors influence booking cancellations. A grouped bar plot allows to see, for each category, the proportion of bookings that were canceled and those that were not canceled. This helps in identifying which categories have higher or lower cancellation rates.

##### 2. What is/are the insight(s) found from the chart?

Business travelers might have a lower cancellation rate compared to transient or group travelers.

Can see there are specific meal options (e.g., breakfast included, half-board, full-board) that are associated with higher or lower cancellation rates. This insight could help in menu planning or pricing strategies.Understanding cancellation patterns can inform marketing efforts. For instance, finding that leisure travelers tend to cancel more often when breakfast is included, you might tailor promotions or incentives for this group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Growth:

Understanding which customer segments are less likely to cancel can help in pricing optimization.Can adjust prices or offer discounts to incentivize bookings from these segments, potentially increasing revenue.

 By aligning theofferings with customer preferences and behaviors, can improve the overall customer experience. This can lead to positive reviews, repeat business, and positive word-of-mouth marketing.


Negative Growth:


Failing to optimize pricing based on cancellation patterns may result in missed revenue opportunities. If you consistently underprice or overprice certain customer segments, it can impact the bottom line.May waste resources targeting customer segments with high cancellation rates, leading to a lower return on investment.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
#correleation plot
correlated_data=data[['lead_time','previous_cancellations','previous_bookings_not_canceled','booking_changes','days_in_waiting_list','total_members','total_of_special_requests','required_car_parking_spaces','adr','total_nights']]
corr=correlated_data.corr()
fig,ax=plt.subplots(figsize=(12,6))
sns.heatmap(corr, annot=True, fmt='.2f', annot_kws={'size': 7},square=True)
plt.title("Correlations ")
plt.show()

##### 1. Why did you pick the specific chart?

*In here the aim is to know is there any correlation between the columns in the dataset.So, to get the results,only the numerical values ie.which columns falls* *into the numerical order are taken in account.So,the "correlated_data" variable is used to store the data of the numerical data.*

*In this case the Heatmaps are the appropriate ones for the execution and visulization of the correlated data.*

##### 2. What is/are the insight(s) found from the chart?

Same features of the data are always fall under the value of 1 as correlated value.

Apart from that the lead_time and total_nights are correlated to each other.It could suggest that for extended hotel stays, individuals tend to make their reservations relatively close to their actual check-in dates.

Then,the adr and total_member are also correlated to each other,which shows more the number of people stay in and the more number of revene the hotel can make

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
columns_of_interest = ['lead_time', 'total_nights','total_members']

# Create a DataFrame with the selected columns
subset_df = data[columns_of_interest]

# Create the pair plot
sns.pairplot(subset_df, diag_kind='kde')
plt.show()








##### 1. Why did you pick the specific chart?

I recommended creating a pair plot for the selected columns 'lead_time', 'total_nights', and 'total_members' because a pair plot is a suitable choice for visualizing relationships and distributions between multiple numerical variables.

 A pair plot includes scatterplots for each combination of variables, which allows visually assess how two variables are related. Scatterplots are valuable for identifying patterns, trends, and potential correlations between variables.Pair plots are relatively easy to interpret, making them suitable for exploratory data analysis.

##### 2. What is/are the insight(s) found from the chart?

Lead Time vs. Total Nights:

Positive Relationship: If you observe an upward trend in the scatterplot between 'lead_time' and 'total_nights,' it suggests that as lead time increases, guests tend to stay for a longer duration.

Total Nights vs. Total Members:

Accommodation Size: The scatterplot between 'total_nights' and 'total_members' can reveal insights about the size of accommodations booked. For example, you may notice that longer stays tend to involve more total members, indicating larger group bookings.

Lead Time vs. Total Members:

Group Bookings: Explore whether there is a relationship between lead time and the number of members. You might find that group bookings tend to have longer lead times compared to individual bookings.

While pair plots do not provide correlation coefficients, you can visually assess the strength and direction of associations between variables.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.


•	Certainly, here are the key recommendations in bullet points for achieving the business objectives based on the hotel bookings data analysis:

•	Implement targeted marketing strategies by leveraging customer segmentation insights.

•	Adopt dynamic pricing strategies based on demand and customer segments.

•	Focus on enhancing customer satisfaction by addressing specific pain points like room preferences and meal choices.

•	Optimize operational processes, including efficient handling of waiting lists and parking allocation.

•	Deploy predictive modeling to forecast booking cancellations and allocate resources proactively.

•	Conduct ongoing competitive analysis to stay competitive in the market.

•	Establish a feedback loop with customers to adapt to evolving needs and preferences.

•	Prioritize data governance and security to protect customer data and ensure compliance.

•	Invest in staff training and development to improve service quality.

•	Regularly monitor performance to assess the effectiveness of strategies.





# **Conclusion**

In conclusion, my exploratory data analysis (EDA) of hotel bookings revealed vital insights into booking patterns and customer behaviors. By carefully examining a dataset containing 35 columns of booking information, I uncovered significant trends and factors that influence hotel reservations. This analysis illuminated the busiest booking periods and highlighted specific days of the week when reservations are most common. Moreover, a close examination of booking cancellations, customer types, meal preferences, and operational aspects provided valuable knowledge that can inform strategic decision-making. Additionally, the potential for predictive modeling within the dataset hints at exciting opportunities for the future. Overall, the outcomes of this EDA offer actionable insights that can guide pricing strategies, marketing efforts, and enhancements to customer satisfaction, ultimately contributing to the hotel's success and operational efficiency.

(1) Around 60% bookings are for City hotel and 40% bookings are for Resort hotel, therefore City Hotel is busier than Resort hotel. Also the overall adr of City hotel is slightly higher than Resort hotel.

(2) Mostly guests stay for less than 5 days in hotel and for longer stays Resort hotel is preferred.

(3) Both hotels have significantly higher booking cancellation rates and very few guests less than 3 % return for another booking in City hotel. 5% guests return for stay in Resort hotel.

(4) Most of the guests came from european countries, with most of guests coming from Portugal.

(5) Guests use different channels for making bookings out of which most preferred way is TA/TO.

(6) For hotels higher adr deals come via GDS channel, so hotels should increase their popularity on this channel.

(7) Almost 30% of bookings via TA/TO are cancelled.

(8) Not getting same room as reserved, longer lead time and waiting time do not affect cancellation of bookings. Although different room allotment do lowers the adr.

(9) July- August are the most busier and profitable months for both of hotels


(10) For customers, generally the longer stays (more than 15 days) can result in better deals in terms of low adr.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***