<a href="https://colab.research.google.com/github/vsbagal/Hotel-Booking-EDA/blob/main/Hotel_Booking_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Hotel Booking Analysis





##### **Project Type**    - Exploratory Data Analysis
##### **Contribution**    - Individual
##### **Member**   -  Vishal Suresh Bagal 


# **Project Summary -** 

### The purpose of this project is to perform an exploratory data analysis (EDA) on a dataset of hotel bookings. The dataset contains information on bookings made at two hotels over a period of time, including information on the guests, the booking, and the hotel itself. The dataset also includes information on cancellations and no-shows.

### The goal of the EDA is to gain insights into the data and answer questions such as:

### What are the most popular months for bookings? How long do guests typically stay? What factors are most strongly associated with cancellations?Are there any patterns in the data that suggest certain types of guests are more likely to cancel or no-show?The EDA will begin with a data cleaning process, where missing or erroneous data will be identified and addressed. The data will then be explored through a variety of visualizations, including scatter plots, histograms, and heatmaps, to identify patterns and relationships in the data.

### The analysis will also include statistical tests and machine learning models to identify correlations and make predictions. For example, a logistic regression model could be used to identify factors that are strongly associated with cancellations.

### The final deliverables for this project will include a report summarizing the findings, as well as visualizations and code used in the EDA. This report will be useful for stakeholders such as hotel managers, who can use the insights gained to improve their operations and make data-driven decisions.

# **GitHub Link -**

[Github link](https://github.com/vsbagal/Hotel-Booking-EDA)

# **Problem Statement**


### The hospitality industry is highly competitive, and hotels need to continuously improve their services to stay ahead of the competition. One way to achieve this is by analyzing customer data and behavior to identify trends and patterns that can be used to improve customer satisfaction and increase revenue. 
### The problem we aim to solve in this project is to analyze the data of hotel bookings and cancellations to identify factors that affect customer bookings and cancellations. We will use exploratory data analysis (EDA) techniques to identify patterns in the data that can help hotels optimize their operations and improve customer satisfaction.

### The goal of this project is to provide insights that can be used to improve hotel booking systems, enhance the customer experience, and increase revenue. We will explore the data to answer questions such as: What are the most common reasons for booking cancellations? What factors influence customers to book a particular hotel? How can hotels use this information to improve their services and attract more customers?

### The insights gained from this analysis can help hotels optimize their pricing, improve customer service, and offer personalized experiences to their customers. Additionally, it can provide valuable information to other stakeholders such as travel agencies, online travel agents, and hotel industry researchers.



### **Define Your Business Objective?**

### The business objective of the hotel booking analysis is to provide actionable insights to the hotel industry stakeholders, such as hotel managers, marketers, and other decision-makers, that can help them improve their operations and enhance the customer experience. Specifically, the analysis aims to:

### 1.Identify the factors that affect hotel booking and cancellation rates

### 2.Discover patterns in the data that can help optimize pricing strategies and occupancy rates

### 3.Analyze customer preferences and behavior to personalize hotel offerings and enhance customer satisfaction

### 4.Provide insights into the competitive landscape and help hotels differentiate themselves from their competitors

### 5.Optimize marketing strategies to attract more customers and increase revenue.

### By achieving these objectives, the hotel industry stakeholders can make informed decisions and take action to improve their operations, increase revenue, and provide a better customer experience. Ultimately, the goal is to use the insights gained from the analysis to stay competitive in the market and achieve long-term success.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# **Let's Begin !**

## **1. Know Your Data**

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#data set path
path =('/content/drive/MyDrive/DATA SCIENCE/Hotel Bookings.csv')
hotel_booking = pd.read_csv(path)


### Dataset First View

In [None]:
# Dataset First Look
hotel_booking.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
hotel_booking.shape

### Dataset Information

In [None]:
# Dataset Info
hotel_booking.info()

#### Duplicate Values

In [None]:

# Count the number of duplicate rows based on all columns
num_duplicates = hotel_booking.duplicated().sum()
num_duplicates


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
hotel_booking.isnull().sum().sort_values()


In [None]:
# Visualizing the missing values
sns.heatmap(hotel_booking.isnull(), cbar=False,cmap='viridis')
plt.title("missing values in dataset")
plt.figure(figsize=(15,15))


### What did you know about your dataset?

* Above datset has 119390 rows and 32 columns, with few duplicate and null values as mentaioned below.

* In this dataset 31994 cells are duplicate while the mejor data is missing from 4 different columns as 4 cells are missing from children, 488 from country, 16340 from agent, & 112593 from company column. 

* While whole the dataset contains 3 differnt data types as float 64(in 4 columns), int64(in 16 columns), object(in 12 columns).

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
hotel_booking.columns

In [None]:
# Dataset Describe
hotel_booking.describe()

### Variables Description 

1.hotel : resort hotel(h1), city hotel(h2)

2.is_canceled : If the booking was cancelled (1) or not (0)

3.lead_time: Number of days that elapsed between the entering date of the booking into the PMS and the arrival date

4.arrival_date_year : Year of arrival date

5.arrival_date_month : Month of month arrival date

6.arrival_date_week_number : Week number for arrival date

7.arrival_date_day_of_month : Day of arrival date

8.stays_in_weekend_nights : Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

9.stays_in_week_nights : Number of weeknights (Monday to Friday) the guest stayed or booked to stay at the hotel

10.adults : Number of adults among guests

11.children : Number of children among guests

12.babies : Number of babies among guests

13.meal : kind of meal opted for

14.country : Country code

15.market_segment : which segment customer belongs to

16.distribution_channel : Name of booking distribution channel.The term “TA” means “Travel Agents” and “TO” means “Tour Operators”

17.is_repeated_guest : If the booking was done from a repeated guest (1) or not (0)

18.previous_cancellations : Number of previous bookings that were cancelled by the customer prior to the current booking

19.previous_bookings_not_canceled : Number of previous bookings not cancelled by the customer prior to the current booking

20.reserved_room_type : Code of room type reserved

21.assigned_room_type : Code of room type assigned

22.booking_changes : Number of changes/amendments made to the booking

23.deposit_type : Type of the deposit made by the guest

24.agent : ID of travel agent who made the booking

25.company : ID of the company that made the booking

26.days_in_waiting_list : Number of days the booking was in the waiting list

27.customer_type : Type of customer, assuming one of four categories

28.adr : Average Daily Rate, as defined by dividing the sum of all lodging transactions by the total number of staying nights

29.required_car_parking_spaces : Number of car parking spaces required by the customer

30.total_of_special_requests : Number of special requests made by the customer

31.reservation_status : Reservation status (Cancelled, Check-Out or No-Show)

32.reservation_status_date : Date at which the last reservation status was updated


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for elem in hotel_booking.columns.tolist():
  print("No. of unique values in ",elem,"is",hotel_booking[elem].nunique() )

In [None]:
# checking unique values in different columns
# unique values in hotel column
hotel_booking['hotel'].unique()

In [None]:
# unique values in arrival_date_year column
hotel_booking['arrival_date_year'].unique()

In [None]:
# unique values in is_canceled column
hotel_booking['is_canceled'].unique()

In [None]:
# unique values in meal column
hotel_booking['meal'].unique()

In [None]:
# unique values in 'distribution_channel' column
hotel_booking['distribution_channel'].unique()

In [None]:
# unique values in 'market_segment' column
hotel_booking['market_segment'].unique()

In [None]:
# unique values in 'children' column
hotel_booking['children'].unique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# missing values in Columns .
hotel_booking.isnull().sum().sort_values(ascending = False)[:6]

In [None]:
#Replacing null values of company,agent and children columns with value 0 and replacing null values of country column with other
hotel_booking[['company','agent','children']] = hotel_booking[['company','agent','children']].fillna(0)
hotel_booking[['country']] = hotel_booking[['country']].fillna('other')

# Checking if all null values are removed
hotel_booking.isnull().sum().sort_values(ascending = False)[:6]

In [None]:
# checking duplicated values is in dataset
num_duplicates

In [None]:
# droping the duplicates
hotel_booking = hotel_booking.drop_duplicates()

In [None]:
# checking the number of columns and rows after droping duplicates
hotel_booking.shape

In [None]:
#Checking the shape of dataset whose combining values of adults,babies and children columns is 0.
hotel_booking[hotel_booking['adults']+hotel_booking['babies']+hotel_booking['children'] == 0].shape

In [None]:
#Dropping the rows where combining values of adults,babies and children columns is 0 because that simply means no bookings were made
hotel_booking.drop(hotel_booking[hotel_booking['adults']+hotel_booking['babies']+hotel_booking['children'] == 0].index, inplace = True)

## converting columns to appropriate data types

In [None]:
# changing datatype of column 'reservation_status_date' from object to date_type.
hotel_booking['reservation_status_date'] = pd.to_datetime(hotel_booking['reservation_status_date'], format = '%Y-%m-%d')

In [None]:
# Adding total staying days in hotels
hotel_booking['total_stay'] = hotel_booking['stays_in_weekend_nights']+hotel_booking['stays_in_week_nights']


# Adding total people num as column, i.e. total types of person = num of adults + children + babies
hotel_booking['total_people'] = hotel_booking['adults']+hotel_booking['children']+hotel_booking['babies']

In [None]:
#Checking the final no of rows and columns
hotel_booking.shape

### What all manipulations have you done and insights you found?

We have done the following manupulations and the insights found by us are as follows-

We found that there were four columns containing null values So we Had Null values in columns- Company, agent, Country and children.
So,for company and agent We have filled the missing values with 0
for country,, We have fill Missing values with oject 'Others'( assuming while collecting data country was not found so user selected the 'Others' option.)
As the count of missing values in Children Column was only 4, so it was replaced with 0 considering no childrens.
This dataset was also containing duplicate values so duplicate values was dropped.
We found that there were some rows in which the combining values of adults,babies and childrens was 0 so this simply means there were no guests as 0 indicates presence of none.So,there were no bookings made.So,as a result ,We dropped the rows where combining values of adults,babies and children columns was 0.
The data type of 'reservation_status_date' column was object type so it was changed to date type format for better use.
There were two new columns that was added, one is 'total_people' and other is 'total_stay'

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# Chart - 1 Pie chart for most prefferd hotel(Univariate)

In [None]:
# Chart - 1 visualization code
hotel_booking['hotel'].value_counts().plot.pie(explode=[0.06, 0.06], autopct='%1.1f%%', shadow=True, figsize=(10,10),fontsize=20)   
plt.title('Pie Chart for City & Resort Hotels',fontsize=20)

##### 1. Why did you pick the specific chart?

A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors.Wherever differenet percentage comparison comes into action ,pie chart is used frequently. So, We have used Pie chart and which helps us to get the percentage comparision more clearly and precisely.

##### 2. What is/are the insight(s) found from the chart?

From the above chart,We got to know that City Hotel is most preffered hotel by guests. Thus city hotels has maximum bookings.61.1 % guests preffered city hotel while only 38.9 % guests have shown interest in resort hotel.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, for both types of Hotels, this graph have some positive business impacts .

City Hotel are doing well so they are Providing more services to attract more guest to increase more revenue. So,they are doing well. But, in case of Resort Hotel, Guest had shown less interest than city hotel so City hotel need to Find solution to attract guests and find what city hotel have done to attract guest. So, there is an scope of tremendous growth in resort hotels if they upgrade  their services and adopt the path of growth and success learning from the success strategies of city hotels and adding new ideas of themselves.

# Chart - 2 Hotel type with highest ADR (Bivariate with Categorical - Numerical)

In [None]:
# Chart - 2 visualization code
# group by hotel
group_by_hotel=hotel_booking.groupby('hotel')

In [None]:
#grouping by hotel adr
highest_adr =group_by_hotel['adr'].mean().reset_index()

#set plot size
plt.figure(figsize=(10,8))

# set labels
plt.xlabel('Hotel type',fontsize=15)
plt.ylabel('ADR',fontsize=15)
plt.title("Average ADR of each Hotel",fontsize=15)

#plot the graph
sns.barplot(x=highest_adr['hotel'],y=highest_adr['adr'])


##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the average adr of each hotel type in a clear and feasible way, We have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

City hotel has the highest ADR. This means city hotels are generating more revenues than the resort hotels. More the ADR more will be the revenue.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

So, City hotel can do more advertising to get more customers that will ultimately add up to their revenue.Thats why, the city hotels are already able to genrate high adr but a bit more of positive efforts towards growth will definitely adds alot to their growth and overall revenue. While on the other hand resort hotels had less adr. So there is more scope of adr improovement in resort hotels.

# Chart - 3 Relationship between ADR and Total Stay (Bivariate with Numerical-Numerical)

In [None]:
# Chart - 3 visualization code
# Groupby adr,total,stay,hotel,
adr_vs_stay = hotel_booking.groupby(['total_stay', 'adr','hotel']).agg('count').reset_index()
adr_vs_stay = adr_vs_stay.iloc[:, :3]
adr_vs_stay = adr_vs_stay.rename(columns={'is_canceled':'Number of stays'})
adr_vs_stay=adr_vs_stay[:18000] 
adr_vs_stay

In [None]:
#plotting the graph in line chart
plt.figure(figsize=(22,6))
sns.lineplot(x='total_stay',y='adr',data=adr_vs_stay)
plt.xlabel('Total_stay',fontsize=15)
plt.ylabel('adr',fontsize=15)
plt.title('Relationship between adr and total stay',fontsize=15)

##### 1. Why did you pick the specific chart?

This is a line chart and it helps to show small shifts that may be getting hard to spot in other graphs.It helps show trends for different periods. They are easy to understand. So,here we can easily track the ups and downs of the graph very precisely.

##### 2. What is/are the insight(s) found from the chart?

From this line chart, we have found that as the total stay increases the adr also increases. So, adr is directly proportional to total stay.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The city hotels should focus on incresing their adr and the more advertisement and better facilities and good offers will let the guests to stay more that will directly result in increasing adr . So, Hotels should offer more attractive offers and facilties so that total stay can be increased that will directly multiply their edr and ultimately revenue will increase.

# Chart - 4 Percentage of repeted guests

In [None]:
# Chart - 4 visualization code
hotel_booking['is_repeated_guest'].value_counts().plot.pie(explode=(0.05,0.05),autopct='%1.1f%%',shadow=True,figsize=(12,8),fontsize=20)

plt.title(" Percentgae (%) of repeated guests",fontsize =20)

##### 1. Why did you pick the specific chart?

Pie charts are used to represent the proportional data or relative data in a single chart. The concept of pie slices is used to show the percentage of a particular data from the whole pie. Thus, We have used to show the percentage of repeated guests or not (where 0 is not repeated guest and 1 is repeated guest) through pie chart with different colored area under a circle.

##### 2. What is/are the insight(s) found from the chart?

Repeated guests are very few which only 3.9 % while 96.1 % guests are not returning to the same hotel. So,it's a matter of deep thinking and taking proper steps to increase the repeated guests numbers. In order to retained the guests management should take feedbacks from guests and try to imporve the services.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


Yes,the proportion of repeated guests is very much low, so if hotels work well in this side also then the increase in no of repeated guests will ultimately boost their revenue so Hotels can give attractive offers to non-repetitive customers during Off seasons to enhance revenue. So,right steps should be taken like taking feedbacks, solving problems of customers within time limit and offering best offers to the customers.

# Chart - 5 percentage distribution of required_car_parking_spaces ( Univariate )

In [None]:
# Chart - 5 visualization code
hotel_booking['required_car_parking_spaces'].value_counts().plot.pie(explode=[0.05]*5, autopct='%1.1f%%',shadow=True, figsize=(12,10), fontsize=15,labels=None)

labels=hotel_booking['required_car_parking_spaces'].value_counts().index
plt.title('%Distribution of required car parking spaces', fontsize=15)
plt.legend(bbox_to_anchor=(0.85,1), loc='upper left', labels = labels)


##### 1. Why did you pick the specific chart?

We have used pie chart here becuase it gives the output in a more understanding manner as here we can clearly see the different two colors reflecting the demand of car parking spaces by guests. So, it's a very useful chart to get proper insights as we can use other charts also but We have found it more relevant here.

##### 2. What is/are the insight(s) found from the chart?

This chart shows that 91.6 % guests did not required the parking space. only 8.3 % guests required only 1 parking space.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from here definitely help the hotels to provide better services. It can be said that hotels need to work less on car parking spaces as only 1 car parking space was required by 8.3% of guests so it's better to have a limited car parking spaces and use that space for other important purposes rather than just letting it go unused. So, It's better to focus on other areas to increase quality of hotel rather than focusing mainly on car parking area only.The demand for car parking area is less. This might be due to the reason as many guests prefers to use public vechiles for travel.

#Chart - 6 Meal type Distribution (Univariate)

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(16,7))
sns.countplot(x=hotel_booking['meal'])
plt.xlabel('Meal Type',fontsize =15)
plt.ylabel('Count',fontsize=15)
plt.title("Preferred Meal Type",fontsize=15)

##### 1. Why did you pick the specific chart?

We have used this count plot because it Show the counts of observations in each categorical bin using bars. Bar plots look similar to count plots, but instead of the count of observations in each category, they show the mean of a quantitative variable among observations in each category. So,to get clear insights about the counts of different types of meal , We have used this count plot.

##### 2. What is/are the insight(s) found from the chart?

The insights that We have found from the above graph is that the most preferred meal type by the guests is BB( Bed and Breakfast) while HB- (Half Board) and SC- (Self Catering) are equally preferred. Types of meal in hotels are as follows:-

BB - (Bed and Breakfast) HB- (Half Board) FB- (Full Board) SC- (Self Catering)

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

So, the insights here also have positive impact as hotels need to focus more on the BB meal type so that the majority of customers are satisfied while others type of meals should be given equal importance with proper management of food services so as to offer best services to customers.

# Chart - 7 Pie chart for Mostly used Distribution channel and Relationship of Distribution channel and adr

In [None]:
# Chart - 7 visualization code
#using  group by on distribution channel and hotel
distribution_channel_df=hotel_booking.groupby(['distribution_channel','hotel'])['adr'].mean().reset_index()

# set plot size and plot barchart
plt.figure(figsize=(16,8))
sns.barplot(x='distribution_channel', y='adr', data=distribution_channel_df, hue='hotel')
plt.title('ADR across Distribution channel')

##### 1. Why did you pick the specific chart?

In the first visualization here, we have used the pie chart to get clear understanding of mostly used booking distribution channel with occupancy percentage of each booking channel. A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, We have used Pie chart and which helped me to get the percentage comparision of the dependant variable.

While in 2nd visulization, Bar charts are used. Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages. To show the mean of adr with respect to the type of distribution channel, we have used Bar Chart. So, this graph will help us to know about various distribution channel and their contribution to adr in order to increase the the income and revenue.

##### 2. What is/are the insight(s) found from the chart?

From the above 1st chart, We have found that 'TA/TO' has been mostly(79.1%) used for booking hoetls. Direct constitues of 14.9%, Corporates constitutes of 5.8%, GDS constitutes of only 0.2 % and rest unidentified are 0%.

From the above 2nd chart it is clear that 'Direct' and 'TA/TO' have almost equally contribution in adr in both type of hotels i.e. 'City Hotel' and 'Resort Hotel'. While, GDS has highly contributed in adr in 'City Hotel' type. GDS needs to increase Resort Hotel bookings. Corporate- These are corporate hotel booing companies which makes bookings possible. The defintions of abbreviations used in this graph are as follows:-

GDS- A GDS is a worldwide contact between travel bookers and suppliers, such as hotels and other accommodation providers. It communicates live product, price and availability data to travel agents and online booking engines, and allows for automated transactions.

1. Direct- means that bookings are directly made with the respective hotels
2.TA/TO- means that booings are made through travel agents or travel operators.
3.Undefined- Bookings are undefined. 
It may be the reason that customers made their bookings on arrival.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

So, from this 1st graph, we have got the insights that 'TA/TO' is the leading one while Direct and Corporates have the potential to grow as they occupy much less space than the leading one. So, a good understanding and relation of Hotels with 'TA/TO' will definetely add up to the growth and revenue.

Yes, the gained insights from 2nd visualization will definitely help different distribution channels to work on the spaces where they are lacking behind like GDS is lacking behind in the bookings of resort hotel. GDS is dominating in terms of adr in case of City hotels but lacking behind in the category of Resort hotel. So, just taking proper steps in the right direction will help to increase the overall revenue.

# Chart - 8 Bookings by month and Optimal Stay length in hotels

In [None]:
# Chart - 8 visualization code
# using groupby on arrival_date_month and taking the hotel count
bookings_by_months_df=hotel_booking.groupby(['arrival_date_month'])['hotel'].count().reset_index().rename(columns={'hotel':"Counts"})
# Create list of months in order
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# creating df which will map the order of above months list without changing its values.
bookings_by_months_df['arrival_date_month']=pd.Categorical(bookings_by_months_df['arrival_date_month'],categories=months,ordered=True)

# sorting by arrival_date_month
bookings_by_months_df=bookings_by_months_df.sort_values('arrival_date_month')

bookings_by_months_df


In [None]:
# set plot size
plt.figure(figsize=(20,8))

#plotting lineplot on x- months & y- booking counts
sns.lineplot(x=bookings_by_months_df['arrival_date_month'],y=bookings_by_months_df['Counts'])

# set title for the plot
plt.title('Number of bookings across each month',fontsize=15)

#set x label
plt.xlabel('Month',fontsize=15)

#set y label
plt.ylabel('Number of bookings',fontsize=15)

In [None]:
#using groupby function  on total stay and hotel
stay = hotel_booking.groupby(['total_stay', 'hotel']).agg('count').reset_index() 
# taking only first three columns  
stay = stay.iloc[:, :3] 
#Renaming the columns                                                  
stay = stay.rename(columns={'is_canceled':'Number of stays'})   

In [None]:
# setting plot size and plot barchart
plt.figure(figsize=(18,10))
sns.barplot(x='total_stay',y='Number of stays',hue='hotel',data=stay)

#set labels
plt.title('Optimal Stay Length in Both hotel types',fontsize=15)
plt.ylabel('count of stays',fontsize=15)
plt.xlabel('total_stay(days)',fontsize=15)

##### 1. Why did you pick the specific chart?

For 1st chart, We have picked the line chart here because it helps to show small shifts that may be getting hard to spot in other graphs. It helps show trends for different periods. They are easy to understand. So, here we can easily track the change of number of bookings with respect to month.

while in the 2nd chart here,bar plot has been used. We have used this chart to get clear view in understanding the relation between total stay in terms of days and count of stays(means total number of customers stayed)

##### 2. What is/are the insight(s) found from the chart?

From this graph of 1st chart, we have found that July and August months had the most Bookings. As, July and August generally surrounds in and near the summer vacation. So, Summer vaccation can be the reason for the higher bookings.

While, 2nd chart gives us different insights. So, from the above observatons, We have found that the Optimal stay in both the type hotel is less than 7 days.  So, after that staying numbers have declined drastically.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes from the 1st chart it is clear that, this provides a good insights that hotels should be well prepared for the month of July and August as maximum bookings takes place for these months. So, better the preparation and good approach will definetely adds to the growth of Hotels.

While 2nd chart also have positive impact. Yes, from the insights gathered here,hotels can work in the domain to increase the staying length of customers to increase their revenue. The other understanding is that customers usually prefers a one week stay in a hotel so Hotels need to work efficiently in these seven days so that customers would return to the same hotel again so this will increase the revenue.



# Chart - 9 Plotting Histogram

In [None]:
# Chart - 9 visualization code
hotel_booking.hist(figsize=(23,18))
plt.show()
     

##### 1. Why did you pick the specific chart?

To understand the data in a clear way with proper insights, I have used the histogram here. The histogram is a popular graphing tool. It is used to summarize discrete or continuous data that are measured on an interval scale. It is often used to illustrate the major features of the distribution of the data in a convenient form. It is also useful when dealing with large data sets (greater than 100 observations). It can help detect any unusual observations (outliers) or any gaps in the data. Thus, We have used the histogram plot to analysis the variable distributions over the whole dataset whether it's symmetric or not.

##### 2. What is/are the insight(s) found from the chart?

We can see that the maximum guest came in the year 2016.

Maximum arrival week number is 30.

Maximum arrival happens in the last of the month.

Maximum guests comes with no children.

There is very less requirement of Car parking spaces .

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Most of the customer arrivals are happens in monthend and there is very less requiremnt for car parking. So, hotels can focus more on their services at the time of month end or they can increase their staff to serve best. Less requirment of car parking can helps to make hotels more specious and attractive.

# Chart - 10 Year and Hotel wise confirmed bookings and cancellation distribution

In [None]:
# Chart - 10 visualization code
#Finding out the percentage and counts of confirmed and caceled bookings
#Plotting a Count Plot chart using seaborn for  counts of confirmed and caceled bookings
plt.figure(figsize=(12,6))
sns.countplot(x= 'hotel', hue='is_canceled',palette='Set2',  data= hotel_booking)
plt.legend(['Confirmed', 'Canceled'])
plt.title("Hotel wise confirmation and cancellation of the bookings", fontsize = 18)
plt.ylabel("Count of confirmation and cancelltions",fontsize = 15)
plt.xlabel("Hotel",fontsize = 15)

In [None]:
#Plotting a Pie chart using matplotlib for percentage of confirmed and caceled bookings of Resort Hotel
resort_hotel = hotel_booking.loc[(hotel_booking["hotel"] == "Resort Hotel")]
resort_hotel_checkin_cancel = resort_hotel['is_canceled'].value_counts()
mylabels = ["Confirmed", "Cancelled"]
myexplode = [0.2, 0]
resort_hotel_cancelation = plt.pie(resort_hotel_checkin_cancel, labels = mylabels, explode = myexplode, autopct='%1.1f%%',)
plt.title('Resort Hotel Confirmed and Cancellations')
resort_hotel_checkin_cancel

In [None]:
#Removing the cancelled bookings from the data and creating anew dataframe
data_not_canceled = hotel_booking[hotel_booking['is_canceled']==0]
#Year wise Bookings of hotels  
sns.set_style(style='darkgrid')
plt.figure(figsize=(12,5))
sns.countplot(x= 'arrival_date_year', hue='hotel',palette='tab10',  data= data_not_canceled)
plt.legend(['Resort Hotel', 'City Hotel'])
plt.title("Year wise bookings of hotels ", fontsize = 18)
plt.ylabel("Number of bookings",fontsize = 15)
plt.xlabel("Year",fontsize = 15)
     

##### 1. Why did you pick the specific chart?

We have picked out the count plot and pie plot to get proper insights on Hotel wise cancelation and confirmation of bookings. 

##### 2. What is/are the insight(s) found from the chart?

We can clearly deduce from the the above graphs that the City hotel is having greater number of bookings as compared to Resort hotel. But, the cancellation percentage is high of the City hotel.

From the above graphs, it can be summarised that in the year 2016 both the hotel saw a massive increase in their bookings and by far the year 2016 is the year of the highest bookings of both hotel. In each year that is 2015, 2016 and 2017 the city hotel is having the highest number of bookings.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Overall the graphs show a positive outcome but the visualisation of cancellation graph ceates a situation of deep concern. So, Here as we can see, that more than 1/4th of overall booking got cancelled. So, it's a matter of deep concern. Thus, we need to look over this problem. The solution to this problem is that We can check the reasons of cancellation of a booking & need to get this sorted out as soon as possible at the business level to stop the problems getting broader.

# Chart - 11  ADR across different months

In [None]:
# Chart - 11 visualization code
#Using groupby function 
bookings_by_months_df=hotel_booking.groupby(['arrival_date_month','hotel'])['adr'].mean().reset_index()

#create month list
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# it will take the order of the month list in the df along with values
bookings_by_months_df['arrival_date_month']=pd.Categorical(bookings_by_months_df['arrival_date_month'],categories=months,ordered=True)

#sorting the values
bookings_by_months_df=bookings_by_months_df.sort_values('arrival_date_month')
bookings_by_months_df
     

In [None]:
# setting plot size and plotting the line
plt.figure(figsize=(20,8))
sns.lineplot(x=bookings_by_months_df['arrival_date_month'],y=bookings_by_months_df['adr'],hue=bookings_by_months_df['hotel'])

# setting labels
plt.title('ADR across each month',fontsize=15)
plt.xlabel('Month',fontsize=15)
plt.ylabel('ADR',fontsize=15)

##### 1. Why did you pick the specific chart?

We have picked the line chart here to get the clear insights of adr by City and Resort Hotels across each month. Line chart is very useful because it helps to show small shifts that may be getting hard to spot in other graphs. It helps show trends for different periods. They are easy to understand. To compare data, more than one line can be plotted on the same axis.

##### 2. What is/are the insight(s) found from the chart?

For Resort hotel , ADR is high in the months of June, July, August as compared to City Hotels. The reason may be that Customers/People want to spend their Summer vaccation in Resort Hotels.

The best time for guests to visit Resort or City hotels is January, February, March, April, October, November and December as the average daily rate in this month is very low. So, it would be feasible and sustainable.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

So, the higher the adr, the higher will be the revenue so its a good sign.Hotels should work more to enchance their adr by offering good schemes to attract customers in winter vacation also and on other holidays.

# Chart - 12 Weekly stay distribution and Calculation of Cancellation and non-cancellation

In [None]:
# Chart - 12 visualization code
#As we haved already created a column "total_stay" above i.e.,
# Adding total staying days in hotels
hotel_booking['total_stay'] = hotel_booking['stays_in_weekend_nights']+hotel_booking['stays_in_week_nights']

#Using a violoin plot to know in which weeks, visitors stays the most
plt.figure(figsize =(18,8))
sns.violinplot(x="arrival_date_week_number", y="total_stay",palette="Set2", data=hotel_booking)

plt.title("Week wise number of stays", fontsize = 18)
plt.ylabel("Number of stays",fontsize = 15)
plt.xlabel("Week number",fontsize = 15)

##### 1. Why did you pick the specific chart?

We have used here the violin plot to gather proper relation between no of stays and week wise number of stays and Violin plots are used when one want to observe the distribution of numeric data, and are especially useful when you want to make a comparison of distributions between multiple groups. The peaks, valleys, and tails of each group's density curve can be compared to see where groups are similar or different.

We have picked this pie plot as it's look very precise and clear to get the insights between two variables. As,we can see now 27.5% tickets was cancelled.Here, O denotes not cancelled and 1 denotes the cancelled one. So, We have used the pie plot because It represents data visually as a fractional part of a whole, which can be an effective communication tool for the even uninformed audience. It enables the audience to see a data comparison at a glance to make an immediate analysis or to understand information quickly.

##### 2. What is/are the insight(s) found from the chart?

From the above violin plot, we have found that from the week 28 to 31, it has shown the highest days of stay whereas from the week 1 to 11 has shown a very steady trend in the number of stays and also the week 18 to 22 has shown the least number of stays by the visitors in aggregate of all 3 years 2015, 2016 and 2017.

From the graph, we have found the insights that more than 1/4 th of the overall bookings. i.e, approx 27.5% of the tickets was got cancelled.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, according to the outcomes, Client can have a better plan to provide better services to the guests so that the revenue can be multiplied.

So, Here as we can see, that more than 27% booking got cancelled. So, it's a matter of deep concern .Thus, we need to look over this problem. The solution to this problem is that We can check the reasons of cancellation of a booking & need to get this sorted out as soon as possible at the business level to stop the problems getting broader.

# Chart - 13 Room type preference and Customer types

In [None]:
# Chart - 13 visualization code
#set plotsize
plt.figure(figsize=(18,8))

#plotting 
sns.countplot(x=hotel_booking['assigned_room_type'],order=hotel_booking['assigned_room_type'].value_counts().index)

#  set xlabel for the plot
plt.xlabel('Room Type',fontsize=15)

# set y label for the plot
plt.ylabel('Count of Room Type',fontsize=15)

#set title for the plot
plt.title("Most preferred Room type",fontsize=15)

##### 1. Why did you pick the specific chart?

For, 1st visualization, We have picked the bar chart to dispaly result for this set of code. Here, I have used bar graph to show distribution by volume(count of room), which type of room is alotted. Bar graph summarises the large set of data in simple visual form. It displays each category of data in the frequency distribution. It clarifies the trend of data better than the table. So, We have used the bar graph here.

while 2nd visualisation involves a count plot because it helps us to get clear insights with the total number of guests visited. So, We have use count plot here to know about the type of guests.

##### 2. What is/are the insight(s) found from the chart?

From, the above chart, it is found that the most preferred Room type is "A". So,majority of the guests have shown interest in this room type. So, overall, This chart shows room type 'A' is most prefered by guests.

From the above graph, it can be summarised that the Transient type of customers visit the most whereas the visitors who are in group comes in the category of least visitors.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

From the graph,it can be seen that there are positive impacts because 'A','D','E' is more prefered by guest due to better services offered in room type. So,overall booking in a hotel matters. So, each room type belongs to each hotel so wherever customers goes, the hotel will be benefit but Hotels should also look in the factors affecting less preference in some particular room type. So,overall, if other room types will also gain popularity then again hotel will be benefitted. So, ultimately, Hotels will encounters more bookings resulting in much more revenues.

Ofcourse the better understanding regarding the different types of guests will help to take proper right steps towards services, facilities, requirements and offers which will directly result in the growth in business.

# Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,12))
sns.heatmap(hotel_booking.corr(),annot=True)
plt.title('Co-relation of the columns',fontsize=15)

##### 1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coeficients, we have used correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

1.We have visualized, is_canceled and total_stay are negatively corelated. This means customers are unlikely to cancel their bookings if they don't get the same room as per reserved room.

2.lead_time and total_stay is positively corelated.This means more the stay of cutsomer is, more will be the lead time.

3.adults,childrens and babies are corelated to each other. This indicates more the people, more will be adr.

4.is_repeated guest and previous bookings not canceled have a strong corelation.This may be due to the reason that repeated guests are not more intersted to cancel their bookings.

So, these are some powerful insights found from the chart of corelation heatmap.

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code
sns.pairplot(hotel_booking, hue="is_repeated_guest")

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, We have used pair plot to analyse the patterns of data and relationship between the features. It's exactly same as the correlation map but here it shows the output in the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

We have found the relationship of "is_repeated_guest" with different types of columns.So,generally this chart reflects the relationship of a particular column with all other columns.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

Business objective attained as follows:

To attain high growth and more success, hotel business need to flourish and for that few things which we need to consider is high revenue generation, customers satisfaction and employeee retention.

We are able to achieve the same by showing the client which are the months which are high in revenue generation by using various charts and graphs distribution

Enhancing the revenue adopted by bar chart distribution of which type room are most preffered and reserved and which are the months suitable for visitors

we also have founded the varouses preferences in different categories like most liked meal type ,optimal stay length,facilities required by customers like cae parking spaces ,etc.So,all these insights ultimately add to have a better planning for growth and higher revenue.

So,preaparing well by using and understanding these useful outcomes, the client can be well prepared in advance so that minimum grievances would be faced by clients in long run and would help in further enhancement of their hospitality and service.

Ask for feedback often from the guests visitng the hotels so that the quality can be upgraded to the next level to increase more guests.

Periodically throw Offers to attract the old customers so as to increase the no of repeated guests.

# **Conclusion**

* City hotels are the most preferred hotel type by the guests. So, We can say that city hotels are the busiest hotel in comparision to the resort hotel.

* The average ADR of city hotels is higher as compared to the resort hotels. So, it can be said that these City hotels are generating more revenue than the resort hotels.

* The total stay of guests is directly proportional to the adr. So, higher the days of stay, the higher will be adr and revenue as well.

* The percentage of repeated guests is very low. Only 3.9 % people had revisited the hotels. Rest 96.1 % were new guests. So, retention rate is much low.

* Most of the customers (91.6%) do not require car parking spaces. The percentage of required car parking spaces is very low. This means less car parking spaces don't affect the business much. 

* Among different types of meals, BB( Bed & Breakfast) is the most preferred type of meal by the guests. So, Guests loved to opt for this meal type.

* 'Direct' and 'TA/TO' have almost equally contribution in adr in both type of hotels i.e. 'City Hotel' and 'Resort Hotel'. While, GDS has highly contributed in adr in 'City Hotel' type.

* Optimal stay length in both the hotel types(City and Resort Hotel) is less than 7 days. Usually people stay for a week. So, after 1 week, the optimal stay length decined drastically.

* Most number of bookings have taken place in the month of July and August. July and August are the favourite months of guests to visit different places.

* The mostly used distribution channel for booking is "TA/TO". 79.1 % bookings were made through TA/TO (travel agents/Tour operators).

* While calculating adr across different month, it is found that for Resort hotel , ADR is high in the months of June,July,August as compared to City Hotels.

* Almost 1/4 th of the total bookings is cancelled. Approx, 27.5% bookings have got cancelled out of all the bookings.

* Majority of the guests have shown interest in the room type 'A'. Room type 'A' is the most preferred room type.

# Capstone Project Successfully Completed!!!