<a href="https://colab.research.google.com/github/vijaytiramale/Project_EDA_Hotel_Booking_Analysis/blob/main/Hotel_Booking_Anslysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Hotel Booking Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

**Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions!
This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data.
Explore and analyze the data to discover important factors that govern the bookings.**

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**The hotel industry is constantly facing changes in demand for bookings, influenced by various factors such as seasonality, location, and customer demographics. Understanding and predicting these changes is crucial for hotels to manage their resources and pricing strategies effectively.**

#### **Define Your Business Objective?**

**To analyze and understand the patterns and trends in hotel booking demand to inform hotel industry stakeholders and support data-driven decision-making. This includes identifying key drivers of hotel bookings, exploring the relationships between various features and booking demand, and forecasting future demand based on historical data. The ultimate goal is to maximize revenue and occupancy rates for hotels through improved pricing and marketing strategies.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive     
drive.mount('/content/drive')

In [None]:
#csv file location
file_path= "/content/drive/MyDrive/Hotel Booking Data Set/Hotel Bookings.csv" 

### Dataset First View

In [None]:
# Dataset First Look
df = pd.read_csv(file_path)

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f' We have total {df.shape[0]} rows and {df.shape[1]} columns.')

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df1=df.copy()
df1.duplicated().value_counts()    #true means duplicate rows

In [None]:
#dropping the duplicate rows
df1= df1.drop_duplicates()
df1.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df1.isna().sum().sort_values(ascending=False)[:6].reset_index().rename(columns={'index':'Columns',0:'Null values'})

In [None]:
# Visualizing the missing values
plt.figure(figsize=(25, 10))
sns.heatmap(df1.isnull(), cbar=False, yticklabels=False,cmap='viridis')
plt.xlabel("Name Of Columns")
plt.title("Places of missing values in column")

In [None]:
# Filling/replacing null values with 0.
null_columns=['agent','children','company']
for col in null_columns:
  df1[col].fillna(0,inplace=True)


# Replacing NA values with 'others'
df1['country'].fillna('others',inplace=True)

#Successfully handled  Null Values
df1.isna().sum().sort_values(ascending=False)[:6].reset_index().rename(columns={'index':'Columns',0:'Null values'})

### What did you know about your dataset?

This data set contains a single file which compares various booking information between two hotels: a city hotel and a resort hotel.Includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. The dataset contains a total of 119390 rows and 32 columns.Dataset Contains duplicated items i.e 31944 which is removed later .In this dataset we find data types of every columns i.e (Int, float ,string) and observe that some columns data types is not accurate and remove later .We find unique value of every columns it means what actual values in every columns

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df1.columns

In [None]:
# Dataset Describe
df1.describe()

In [None]:
df1.shape

**Variables Description**

## Description of individual Variable
  
**The columns and the data it represents are listed below:**

1. **hotel :** Name of the hotel (Resort Hotel or City Hotel)

2. **is_canceled :** If the booking was canceled (1) or not (0)

3. **lead_time:** Number of days before the actual arrival of the guests

4. **arrival_date_year :** Year of arrival date

5. **arrival_date_month :** Month of month arrival date

6. **arrival_date_week_number :** Week number of year for arrival date

7. **arrival_date_day_of_month :** Day of arrival date

8. **stays_in_weekend_nights :** Number of weekend nights (Saturday or Sunday) spent at the hotel by the guests.

9. **stays_in_week_nights :** Number of weeknights (Monday to Friday) spent at the hotel by the guests.

10. **adults :** Number of adults among guests

11. **children :** Number of children among guests

12. **babies :** Number of babies among guests

13. **meal :** Type of meal booked

14. **country :** Country of guests

15. **market_segment :** Designation of market segment

16. **distribution_channel :** Name of booking distribution channel

17. **is_repeated_guest :** If the booking was from a repeated guest (1) or not (0)

18. **previous_cancellations :** Number of previous bookings that were cancelled by the customer prior to the current booking

19. **previous_bookings_not_canceled :** Number of previous bookings not cancelled by the customer prior to the current booking

20. **reserved_room_type :** Code of room type reserved

21. **assigned_room_type :** Code of room type assigned

22. **booking_changes :** Number of changes/amendments made to the booking

23. **deposit_type :** Type of the deposit made by the guest

24. **agent :** ID of travel agent who made the booking

25. **company :** ID of the company that made the booking

26. **days_in_waiting_list :** Number of days the booking was in the waiting list

27. **customer_type :** Type of customer, assuming one of four categories

28. **adr :** Average Daily Rate, as defined by dividing the sum of all lodging transactions by the total number of staying nights

29. **required_car_parking_spaces :** Number of car parking spaces required by the customer

30. **total_of_special_requests :** Number of special requests made by the customer

31. **reservation_status :** Reservation status (Canceled, Check-Out or No-Show)

32. **reservation_status_date :** Date at which the last reservation status was updated

### Check Unique Values for each variable.

In [None]:
# droppping all 166 those rows in which addtion of of adlults ,children and babies is 0. That simply means  no bookings were made.
len(df1[df1['adults']+df1['babies']+df1['children']==0])
df1.drop(df1[df1['adults']+df1['babies']+df1['children']==0].index,inplace=True)

# Check Unique Values for each variable.
categorical_cols=list(set(df1.drop(columns=['reservation_status_date','country','arrival_date_month']).columns)-set(df1.describe()))
for col in categorical_cols:
  print(f'Unique values in column {col} are:, {(df1[col].unique())}')
# Check Unique Values for each variable.

## 3. ***Data Wrangling***

In [None]:
# lets add some new columns

df1['total_people'] = df1['adults'] + df1['babies'] + df1['children']   
df1['total_stay'] = df1['stays_in_weekend_nights'] + df1['stays_in_week_nights']   

In [None]:
df1.shape

### What all manipulations have you done and insights you found?

**Addition of columns**
We have seen that there are few columns required in Data to analysis purpose which can be evaluated from the given columns.

**Removed is_null values & duplicate entries**
Before visualize any data from the data set we have to do data wrangling. For that, we have checked the null value in all the columns. After checking, when we are getting a column which has more number of null values, dropped that column by using the 'drop' method. 

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Visualizsing the by pie chart.
df1['hotel'].value_counts().plot.pie(explode=[0.05, 0.05], autopct='%1.1f%%', shadow=True, figsize=(10,8),fontsize=20)   
plt.title('Pie Chart for Most Preffered  Hotel')

##### 1. Why did you pick the specific chart?

**To present the data that in which hotel more booking have been done.**

##### 2. What is/are the insight(s) found from the chart?

**Here, we found that the booking number is Higher in City Hotel which is 61.12% than Resort Hotel which is 38.87%. Hence we can say that City hotel has more consumption**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, for both Hotels, this data making some positive business impact : -**

**City Hotel :- Provided more services to attract more guest to increase more revenue.**

**Resort Hotel :- Find solution to attract guest and find what city hotel did to attract guest.**

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#set plotsize
plt.figure(figsize=(18,8))

#plotting 
sns.countplot(x=df1['assigned_room_type'],order=df1['assigned_room_type'].value_counts().index)
#  set xlabel for the plot
plt.xlabel('Room Type')
# set y label for the plot
plt.ylabel('Count of Room Type')
#set title for the plot
plt.title("Most preferred Room type")

##### 1. Why did you pick the specific chart?

**To show distribution by volume, which room is alotted.**

##### 2. What is/are the insight(s) found from the chart?

**This chart shows room type 'A' is most prefered by guest.**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, Positive impact because 'A','D','E' is more prefered by guest due to better services offered in room type.**

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Counting the guests from various countries.
country_df=df1['country'].value_counts().reset_index().rename(columns={'index': 'country','country': 'count of guests'})[:10]
# country_df1=df1['country'].value_counts().reset_index().rename(columns={'index': 'country','country': 'count of guests'})

# Visualizing by  plotting the graph
plt.figure(figsize=(20,8))
sns.barplot(x=country_df['country'],y=country_df['count of guests'])
plt.xlabel('Country')
plt.ylabel('Number of guests',fontsize=12)
plt.title("Number of guests from diffrent Countries")
print("\n\nPRT = Portugal\nGBR = Great Britain & Northern Ireland\nFRA = France\nESP = Spain\nDEU = Germany\nITA = Italy\nIRL = Ireland\nBRA = Brazil\nBEL = Belgium\nNLD = Netherland")

##### 1. Why did you pick the specific chart?

**We have seen that mostly from which country Guests is coming**

***Chart is showing for top 10 country***

##### 2. What is/are the insight(s) found from the chart?

**As we can see, that maximum guest is coming from Portugal**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**We can do more advertising & can provide attractive offers to  Portugal guests to enhance the customer volume**

#### Chart - 4

In [None]:
# Chart - 4 visualization code
df1['is_canceled'].value_counts().plot.pie(explode=[0.05, 0.05], autopct='%1.1f%%', shadow=True, figsize=(10,8),fontsize=20)
plt.title("Cancellation and non Cancellation")

##### 1. Why did you pick the specific chart?

**In this chart, we presented the cancellation rate of the hotels booking**

##### 2. What is/are the insight(s) found from the chart?

**Here, we found that overall more than 25% of booking got cancelled**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Here, we can see, that more than 27% booking getting cancelled.**


**Solution: We can check the reason of cancellation of a booking & need to get this sort on business level**

#### Chart - 5

In [None]:
# Chart - 5 visualization code
df1['is_repeated_guest'].value_counts().plot.pie(explode=(0.05,0.05),autopct='%1.1f%%',shadow=True,figsize=(12,8),fontsize=20)

plt.title(" Percentgae (%) of repeated guests")

##### 1. Why did you pick the specific chart?

**To show the percentage share of repeated & non-repeated guests.**

##### 2. What is/are the insight(s) found from the chart?

**Here, we can see that the number of repeated guests is very less as compared to overall guests**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**We can give alluring offers to non-repetitive customers during Off seasons to enhance revenue**

#### Chart - 6

In [None]:
# Chart - 6 visualization code
market_segment_df=df1[df1['is_canceled']==1]   # canceled=1
market_segment_df
market_segment_df=market_segment_df.groupby(['market_segment','hotel']).size().reset_index().rename(columns={0:'counts'})   # group by

market_segment_df




In [None]:
plt.figure(figsize=(20,8))
sns.barplot(x='market_segment',y='counts',hue="hotel",data= market_segment_df)

# set labels
plt.xlabel('market_segment')
plt.ylabel('Counts')
plt.title('Cancellation Rate Vs market_segment')

##### 1. Why did you pick the specific chart?

**In this chart, we have seen market segment by which hotel has booked**

##### 2. What is/are the insight(s) found from the chart?

**Online TA has been used most frequently to book hotel by the guest.**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, it is creating positive business impact that guests are using Online TA market segment as most prefered to book hotels.**

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# groupby arrival_date_month and taking the hotel count
bookings_by_months_df=df1.groupby(['arrival_date_month'])['hotel'].count().reset_index().rename(columns={'hotel':"Counts"})
# Create list of months in order
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
# creating df which will map the order of above months list without changing its values.
bookings_by_months_df['arrival_date_month']=pd.Categorical(bookings_by_months_df['arrival_date_month'],categories=months,ordered=True)
# sorting by arrival_date_month
bookings_by_months_df=bookings_by_months_df.sort_values('arrival_date_month')

bookings_by_months_df

In [None]:
# set plot size
plt.figure(figsize=(20,8))

#pltting lineplot on x- months & y- booking counts
sns.lineplot(x=bookings_by_months_df['arrival_date_month'],y=bookings_by_months_df['Counts'])

# set title for the plot
plt.title('Number of bookings across each month')
#set x label
plt.xlabel('Month')
#set y label
plt.ylabel('Number of bookings')

##### 1. Why did you pick the specific chart?

**for finding that which month most of the bookings happened?**

##### 2. What is/are the insight(s) found from the chart?

**july and August months had the most Bookings.**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

 **July and August months had the most Bookings. Summer vaccation can be the reason for the bookings.**

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# group by hotel
grup_by_hotel=df1.groupby('hotel')

#grouping by hotel adr
highest_adr=grup_by_hotel['adr'].mean().reset_index()

#set plot size
plt.figure(figsize=(10,8))

# set labels
plt.xlabel('Hotel type')
plt.ylabel('ADR')
plt.title("Avg ADR of each Hotel type")

#plot the graph
sns.barplot(x=highest_adr['hotel'],y=highest_adr['adr'])

##### 1. Why did you pick the specific chart?

**for finding that Which Hotel type has  the highest ADR?.**

##### 2. What is/are the insight(s) found from the chart?

**City hotel has the highest ADR. That means city hotels are generating more revenues than the resort hotels. More the ADR more is the revenue.**

**City hotel has the highest ADR**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Here, we can do more advertising for City hotel to get more customer, which result higher profit**

#### Chart - 9

In [None]:
# Chart - 9 visualization code
df1.drop(df1[df1['adr'] > 5000].index, inplace = True)

In [None]:
plt.figure(figsize=(16,8))
sns.scatterplot(x=df1['total_stay'],y=df1['adr'])
plt.title('Relationship between  adr and total stay')

##### 1. Why did you pick the specific chart?

**To show comparision & affect of total stay days vs ADR**

##### 2. What is/are the insight(s) found from the chart?

**Here, we found that if guest's stay days is getting decreased, ADR is getting high**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(18,10))
sns.heatmap(df1.corr(),annot=True)
plt.title('Co-relation of the columns')

##### 1. Why did you pick the specific chart?

**To understand the co-relation of variables**

##### 2. What is/are the insight(s) found from the chart?

**There is a negative correlation between "is_canceled" and "same_room_alloted_or_not", meaning customers are less likely to cancel their booking if they receive the same room as reserved. This relationship has been visualized.
"Lead_time" and "total_stay" have a positive correlation, meaning the longer the customer stays, the longer their lead time.
The number of adults, children, and babies are correlated with each other, meaning a higher number of people results in a higher average daily rate.
There is a strong correlation between "is_repeated_guest" and "previous_bookings_not_canceled", which suggests that repeat guests are less likely to cancel their bookings.**


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

**The success of a hotel business depends on several factors, including high revenue generation, customer satisfaction, and employee retention.**

**By using a bar chart to display the most reserved room types and the months with the highest visitor numbers, revenue can be increased.**

**Preparing in advance with this information can minimize customer grievances and enhance hospitality for the long term.**

**A scattered plot can highlight when a high number of visitors leads to a decrease in the average daily rate (ADR), allowing the client to focus on bulk bookings during off-seasons for additional revenue.**

**The trend of visitor arrivals can be monitored to engage visitors in advance for entertainment and leisure activities.**

**The correlation between various values can be displayed to show the maximum and minimum percentages, enabling the client to focus on areas that need improvement to increase high-performing metrics.**

# **Conclusion**

1. Travelers seem to prefer City Hotels, which generates higher revenue and profits.

2. July and August see the highest number of bookings compared to other months.

3. Room Type A is the most sought-after among travelers.

4. Portugal and Great Britain are the top sources of bookings.

5. City Hotels have a higher retention rate for guests.

6. Approximately one-fourth of all bookings get cancelled, with more cancellations from City Hotels.

7. New guests tend to cancel bookings more frequently than repeat customers.

8. The length of the waitlist or assignment of the reserved room does not impact the cancellation of bookings.

9. Corporate clients have the highest percentage of repeat guests, while TA/TO has the lowest. However, in terms of cancelled bookings, TA/TO has the highest percentage, while Corporate has the lowest.

10. As the Average Daily Rate (ADR) increases, the length of stay decreases, likely due to cost considerations.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***