# **Project Name**    - Hotel Booking EDA


##### **Project Type -**     EDA
##### **Contribution -**     Individual
##### **Team Member 1 -**    Rohit Sonawane


# **Project Summary -**

The given dataset is of hotel booking. I explored and analyzed the given dataset and discovered some important factors regarding hotel booking. For this, I first read the given data in my collab notebook and then understand the whole data like what are variables in the given data. After that cleaned the data frame in data wrangling. For that I used some functions like drop(), fillna(), isna()/isnull() functions. Then I added some columns which are required in the analysis and removed some columns which are not required. Using this data frame I explored some variables using visualization charts like pie charts, count plots, barplot, and heatmaps and I can find out some insights which are important factors. From these insights, I can suggest some business objectives to clients.

# **GitHub Link -**

GitHub Link :- https://github.com/rohit-sonawane9/EDA-capstone-project

# **Problem Statement**


Develop a predictive model to analyze the booking data for a city hotel and a resort hotel, identify significant factors that influence bookings, and predict the likelihood of receiving a disproportionately high number of special requests.

#### **Define Your Business Objective?**

The primary objective is to gain insights from the data to optimize hotel bookings and operations. By understanding the factors that govern bookings and predicting special request patterns, the hotel management can make informed decisions and take actions to improve customer satisfaction, revenue, and operational efficiency.

*****Specific goals and objectives related to this problem statement could include:*****

1.  *****Identify key factors influencing hotel bookings:***** Analyze the data to identify the most important factors that influence the number of bookings for both the city hotel and resort hotel. This can involve exploring variables such as booking lead time, length of stay, number of adults/children/babies, and other relevant features.

2.  *****Determine the best time of year to book a hotel room:***** Analyze the data to identify patterns and trends in booking dates and times. Determine if there are specific seasons or months when booking rates are higher or lower, and provide recommendations on the optimal timing for customers to make hotel reservations.

3.  *****Optimize pricing and length of stay:***** Investigate the relationship between length of stay and the daily rate to determine the optimal length of stay that provides the best value for customers. Analyze pricing strategies and their impact on bookings to identify opportunities for optimizing pricing models and revenue management.

4.  *****Predict special request patterns:***** Build a predictive model that analyzes various factors related to special requests made by guests. Predict whether a hotel is likely to receive a disproportionately high number of special requests based on factors such as room type, length of stay, number of guests, or any other relevant variables. This can help the hotel staff better allocate resources and provide a personalized experience for guests.

     The overall objective is to leverage the data to gain insights, make data-driven decisions, and enhance the hotel's performance in terms of bookings, customer satisfaction, and operational efficiency.

# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np

# Import Visualization Libraries
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Import warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
url = "https://raw.githubusercontent.com/rohit-sonawane9/EDA-capstone-project/main/Hotel%20Bookings.csv"
d = pd.read_csv(url)

### Dataset First View

In [None]:
# View top 5 rows of the dataset
d.head()

In [None]:
# View last 5 rows of the dataset
d.tail()

Create copy of dataset

In [None]:
data = d.copy()

### Dataset Rows & Columns count

In [None]:
# Checking number of rows and columns of the dataset using shape
print("Number of rows are: ",data.shape[0])
print("Number of columns are: ",data.shape[1])

### Dataset Information

In [None]:
# Checking information about the dataset using info
data.info()

#### Duplicate Values

In [None]:
# Checking duplicated rows count
data.duplicated().sum()

In [None]:
#Removing Duplicates
data.drop_duplicates()

In [None]:
# Removing duplicate rows
data[data.duplicated()].shape    # Show no. of duplicate rows



#### Missing Values/Null Values

In [None]:
# Checking missing values/null values count for each column
data.isnull().sum().sort_values(ascending =False)[:6]

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(data.isnull(), cbar=False)

### What did you know about your dataset?

This Dataset if of Hotel Booking.In this dataset there are 119390 number of rows and 32 features / Columns.Only column name "Country","Age" and "Company" have null values and other column do not have any null value.

Since, company and agent columns have comany number and agent numbers as data. There may be some cases when customer didnt booked hotel via any agent or via any company. So in that case values can be null under these columns.
We will replace null values by 0 in these columns.

In [None]:
data[['company','agent']] = data[['company','agent']].fillna(0)

In [None]:
data['children'].unique()

This column 'children' has 0 as value which means 0 children were present in group of customers who made that transaction.
So, 'nan' values are the missing values due to error of recording data.

We will replace the null values under this column with mean value of children.

In [None]:
data['children'].fillna(data['children'].mean(), inplace = True)

Next column with missing value is 'country'. This column represents the country of oriigin of customer.
Since, this column has datatype of string. We will replace the missing value with the mode of 'country' column.

In [None]:
data['country'].fillna('others', inplace = True)

In [None]:
# Checking if all null values are removed
data.isnull().sum().sort_values(ascending = False)[:6]

There are some rows with total number of adults, children or babies equal to zero. So we will remove such rows.

In [None]:
data[data['adults']+data['babies']+data['children'] == 0].shape

In [None]:
data.drop(data[data['adults']+data['babies']+data['children'] == 0].index, inplace = True)

## ***2. Understanding Your Variables***

In [None]:
# Data info
data.info()

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe()

### Variables Description

*  **Hotel**	- H1 - Resort Hotel , H2 - City Hotel
*  **is_cancelled** -	if the booking was cancelled(1) or not(0)
* **lead_time**	- Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
* **arrival_date_year** -	Year of arrival date
* **arrival_date_month** -	Month of arrival date
* **arrival_date_week_number** -	Week number for arrival date
* **arrival_date_day** -	Day of arrival date
* **stays_in_weekend_nights**	- Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
* **stays_in_week_nights** -	Number of week nights (Monday to Filday) the guest Stayed or booked to stay at the hotel
* **adults**	- Number of adults
* **children**	- Number of children
* **babies**	- Number of babies
* **meal**	- Kind of meal opted for
* **country**	- Country code
* **market_segment** - Which segment the customer belongs to
* **Distribution_channel** - How the customer accessed the stay corporate booking/Direct/TA.TO
* **is_repeated_guest** -	Guest coming for first time or not
* **previous_cancellation** - Was there a cancellation before
* **previous_bookings** - Count of previous bookings
* **reserved_room_type**	- Type of room reserved
* **assigned_room_type**	- Type of room assigned
* **booking_changes**	- Count of changes made to booking
* **deposit_type**	- Deposit type
* **agent** -	Booked through agent
* **days_in_waiting_list**	- Number of days in waiting list
* **customerType** -	type of customer
* **adr** - average daily rate of revenue genenrated
* **required_car_parking_spaces** - If car parking Is required
* **total_of_special_request**	 - Number of additional special requirements
* **reservation status** - Status of Reservation
* **reservation_status_date** - 	Date of the specific status







### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(data.apply(lambda col: col.unique()))

## 3. ***Data Wrangling***

### Data Wrangling Code

#### Converting columns to appropriate datatypes.

In [None]:
# Converting datatype of columns 'children', 'company' and 'agent' from float to int.
data[['children', 'company', 'agent']] = data[['children', 'company', 'agent']].astype('int64')

In [None]:
# changing datatype of column 'reservation_status_date' to data_type.
data['reservation_status_date'] = pd.to_datetime(data['reservation_status_date'], format = '%Y-%m-%d')

#### Adding important columns.

In [None]:
# Adding total staying days in hotels
data['total_stay'] = data['stays_in_weekend_nights']+data['stays_in_week_nights']

# Adding total people num as column, i.e. total people num = num of adults + children + babies
data['total_people'] = data['adults']+data['children']+data['babies']

We are adding this column so that we can analyse the stay length at hotels.

In [None]:
# view of data with added column
data.head()

### There are some rows with total number of adults, children or babies equal to zero this means there is no any booking were made. So we can remove such rows.###

In [None]:
# shape of columns which have no bookings
data[data['adults']+data['babies']+data['children'] == 0].shape

In [None]:
# Columns are dropped here using drop function
data.drop(data[data['adults']+data['babies']+data['children'] == 0].index, inplace = True)

### What all manipulations have you done and insights you found?

* **In the given dataframe, there were 31994 duplicate values. So those values were removed.**
* **There were 4 columns which have missing values and the columns were 'company','agent','country','children'. The values from these columns are replaced by zero.**
* **In dataframe added two columns tatal_stay and total_people.**
* **Three columns 'adults','children','babies' had valuen zero which means no booking has done here, so these columns were removed.**

### EDA -

* Lets first find the correlation between the numerical data.


* Since, columns like 'is_cancelled', 'arrival_date_year', 'arrival_date_week_number', 'arrival_date_day_of_month', 'is_repeated_guest', 'company', 'agent' are categorical data having numerical type. So we wont need to check them for correlation.

* Also, we have added total_stay and total_people columns. So, we can remove adults, children, babies, stays_in_weekend_nights, stays_in_week_nights columns.

In [None]:
data_of_num = data[["lead_time","previous_cancellations","previous_bookings_not_canceled","booking_changes","days_in_waiting_list","adr","required_car_parking_spaces","total_of_special_requests","total_stay","total_people"]]

In [None]:
#correlation matrix
corrmat = data_of_num.corr()
f, ax = plt.subplots(figsize=(12, 7))
sns.set_palette("husl")
sns.heatmap(corrmat,annot = True,fmt='.2f', annot_kws={'size': 10},  vmax=.8, square=True);

1. Total stay length and lead time have slight correlation. This may means that for longer hotel stays people generally plan little before the the actual arrival.

2. adr is slightly correlated with total_people, which makes sense as more no. of people means more revenue, therefore more adr.

Lets see does length of stay affects the adr.

In [None]:
plt.figure(figsize = (12,6))
sns.scatterplot(y = 'adr', x = 'total_stay', data = data)
plt.show()

We notice that there is an outlier in adr, so we will remove that for better scatter plot.

In [None]:
data.drop(data[data['adr'] > 5000].index, inplace = True)

In [None]:
plt.figure(figsize = (12,6))
sns.scatterplot(y = 'adr', x = 'total_stay', data = data)
plt.show()

From the scatter plot we can see that as length of total_stay increases the adr decreases. This means for longer stay, the better deal for customer can be finalised.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### **Univariate Analysis** ###

#### Chart - 1

1) Which type of hotel is mostly prefered ?

In [None]:
# Chart - 1 visualization code
hotel_counts = data['hotel'].value_counts()
hotel_counts

In [None]:
colors = ['#D62728','#9467BD', '#8C564B']  # Add more colors if needed

hotel_counts.plot.pie(explode=[0.03, 0.03], autopct='%1.2f%%', shadow=True, figsize=(10, 7), fontsize=20, colors=colors)
plt.title('Pie Chart for Most Preferred Hotel', fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

**I use pie chart because pie chart gives simple and easy to understand picture that shows which hotel has more bookings.**

##### 2. What is/are the insight(s) found from the chart?

**I found that city hotel has more bookings which are 66.41% and Resort hotel has less bookings which are 33.59%.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, gained insights help creating a positive business impact.**

**City hotel can find more services to attract more guests to increase more revenue.**

**Resort hotel can find solution to attract guest and also find which facilities provided ny city hotel to attract the guest.**

#### Chart - 2

2) Which agent made the most bookings?

In [None]:
# Chart - 2 visualization code
top_bookings_by_agent = data['agent'].value_counts().reset_index().rename(columns={'index':'agent','agent':'num_of_bookings'})[:10]
top_bookings_by_agent

In [None]:
# barplot is used for visualization

plt.figure(figsize=(14, 7))
sns.barplot(x=top_bookings_by_agent['agent'], y=top_bookings_by_agent['num_of_bookings'], order=top_bookings_by_agent['agent'])
plt.title('Most bookings by the agent', fontsize=20)
plt.ylabel('Number of bookings', fontsize=15)
plt.xlabel('Agent number', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

**I choose barplot here because it gives data visualization in pictorial form and due to this comparison of data is easy.**

##### 2. What is/are the insight(s) found from the chart?

**The insight found here is Agent no. 9 made most of the bookings.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* **Yes, Agent no.9, 240 has more bookins which makes positive impact.**
* **Aent no. 1 and 6 has less bookins which makes neative impact.**
* **Booking made by agent no 1 and 6 are about 4.27% of agent no 9 which has hihest bookings.**

#### Chart - 3

3) What is the percentage of repeated guests?

In [None]:
# Chart - 3 visualization code
repeated_guests_count = data['is_repeated_guest'].value_counts()
repeated_guests_count

In [None]:
# barplot is used for visaulization
sns.set_palette("plasma")

repeated_guests_count.plot.pie(explode=[0.03, 0.03], autopct='%1.2f%%', shadow=True, figsize=(10,7),fontsize=20)
plt.title('Percentage of repeated guests ',fontsize = 20)

##### 1. Why did you pick the specific chart?

**I use pie chart because pie chart gives simple and easy to understand picture that shows how many guests book perticular hotel repetadly.**

##### 2. What is/are the insight(s) found from the chart?

**The insight found from the chart is there are very few guests booking for the same hotel again.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the gained insights help creating a positive business impact like the hotels which do not booked repetadly by guests can take feedbacks from the guests and try to impove there services.**

#### Chart - 4

Q.4 What is the most preferred room type by the customers?

In [None]:
# Chart - 4 visualization code
room_type = data['assigned_room_type'].value_counts()
room_type

In [None]:
# countlot is used for visualization of most preferred room type
plt.figure(figsize=(14,7))
sns.countplot(x=data['assigned_room_type'],order=data['assigned_room_type'].value_counts().index)
plt.title("Most preferred Room type", fontsize = 20)
plt.xlabel('Type of the Room', fontsize = 15)
plt.ylabel('Room type count', fontsize = 15)
plt.show()

##### 1. Why did you pick the specific chart?

**I have choose countplot to visualize most prefferd roomtype because countplot display the count of each observation for each category and here we have to represent room type vs room type count.**

##### 2. What is/are the insight(s) found from the chart?

**The insighte found from the chart is A type rooms are most prefered rooms and the count is 74019 and after that D type rooms are prefered by the guest and count is 25309.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   **A type rooms are most preferred rooms. This make positive impact on business.**
*   **H,I,K,L type rooms are less preferred this insight makes neative impact.**
*   **This is beacause type A rooms have 46283 bookings anf type L room has only one booking.**





#### Chart - 5

Q.5 What type of food is mostly prefered by the guests?

In [None]:
# Chart - 5 visualization code
preferred_food = data['meal'].value_counts()
preferred_food

In [None]:
# Visualization of most preferred food using countplot
plt.figure(figsize=(14,7))
sns.countplot(x=data['meal'],order=data['meal'].value_counts().index)
plt.title("Most preferred Food", fontsize = 20)
plt.xlabel('Type of the food', fontsize = 15)
plt.ylabel('Food type count', fontsize = 15)

##### 1. Why did you pick the specific chart?

**I have choose countplot to visualize most preferred food because countplot display the count of each observation for each category and here we have to represent food type vs food type count.**

##### 2. What is/are the insight(s) found from the chart?

**The insight found here is BB type food is most preferred anf FB type of food is less preferred.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

*  **BB type of food is most preferred food this makes positive impact on business.**
*  **Undefined and FB type of food is less preferred this insight makes neative impact on business.**
* **The BB type food is preferred by 92235 guests and FB type of food is preferred by only 798 guests.**

#### Chart - 6

Q.6 In which month most of the bookings happened?

In [None]:
# Chart - 6 visualization code
bookings_by_months=data.groupby(['arrival_date_month'])['hotel'].count().reset_index().rename(columns={'hotel':"Counts of booking"})
bookings_by_months

In [None]:
bookings_by_months=data.groupby(['arrival_date_month'])['hotel'].count().reset_index().rename(columns={'hotel':"Counts of booking"})
sequence_of_months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
bookings_by_months['arrival_date_month']=pd.Categorical(bookings_by_months['arrival_date_month'],categories=sequence_of_months,ordered=True)
bookings_by_months=bookings_by_months.sort_values('arrival_date_month')
bookings_by_months

In [None]:
# barplot for visualization of month in which most booking happened.
plt.figure(figsize=(14,7))
sns.barplot(data=bookings_by_months, x="arrival_date_month", y="Counts of booking")
plt.title("Number of Bookings in Months", fontsize = 20)
plt.xlabel('Month', fontsize = 15)
plt.ylabel('Number of Bookings', fontsize = 15)

##### 1. Why did you pick the specific chart?

**I choose barplot here because it gives data visualization in pictorial form. So comparison becomes easy.**

##### 2. What is/are the insight(s) found from the chart?

**The insight found from the chart is August month has maximum number of bookings.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*    **July and August months have most bookings this is positive impact on business.**
*    **November, December and January months have less bookins this is neative impact.**
*   **July and August months have bookings above the average bookings and November, December and January months have bookings below the average bookings.**

#### Chart - 7

Q.7 Which distribution channel is mostly used for hotel booking?

In [None]:
# Chart - 7 visualization code
# distribution channel value count
distribution_channel_counts = data['distribution_channel'].value_counts()
distribution_channel_counts

In [None]:
#shape of dataframe
d3 = data.shape[0]
d3

In [None]:
# distribution channel count in data format
distribution_channel_data = data['distribution_channel'].value_counts().reset_index().rename(columns={'index':"distribution_channel",'distribution_channel':'count'})
distribution_channel_data

In [None]:
# booking by distribution channel in percent
distribution_channel_data_percent = pd.DataFrame(round((distribution_channel_counts/d3)*100,2)).reset_index().rename(columns={'index':'distribution_channel','distribution_channel':'% booking'})
distribution_channel_data_percent

In [None]:
#Visualization of mostly used distribution channels using barplot
plt.figure(figsize=(14,7))
sns.barplot(data=distribution_channel_data_percent, x="distribution_channel", y="% booking")
plt.title("Mostly used distribution Channels", fontsize = 20)
plt.xlabel('Distribution Channel', fontsize = 15)
plt.ylabel('Booking by distribution channel in percent', fontsize = 15)

##### 1. Why did you pick the specific chart?

**Because barplot gives simple and easy to understand pictorial chart.**

##### 2. What is/are the insight(s) found from the chart?

**Mostly used distribution channel is TA/TO channel.The total count of booking is 97749 and booking in percent is 82.0 %**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*  **Distribution channel TA/TO is mostly used channel this makes positive impact.**
* **Distribution channel GDS and undefined is less used channels this makes negative impact.**
* **Use of TA/TO is 82.0 % and use of GDS is 0.16 %.**
* **Other channels can provide those facilities which are provided by TA/TO channel.**

#### Chart - 8

Q.8 Which year had highest bookings?

In [None]:
# Chart - 8 visualization code
year_count = data['arrival_date_year'].value_counts().sort_index()
year_count

In [None]:
# Visualization of year wise booking using countplot chart
plt.figure(figsize=(14,7))
sns.countplot(x=data['arrival_date_year'],hue=data['hotel'])
plt.title('Year wise Bookings', fontsize = 20)
plt.xlabel('Arrival_date_year', fontsize = 15)
plt.ylabel('Count of bookings', fontsize = 15)

##### 1. Why did you pick the specific chart?

**I choose countplot because it shows clearly which type of Hotels selected as preference in give years.**

##### 2. What is/are the insight(s) found from the chart?

**2016 had highest bookings and 2015 had lowest bookings.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*  **Year 2016 had highest bookings this makes positive impact.**
* **Year 2015 had lowest bookings this makes negative impact.**
* **In 2016 there were 56622 bookings and In 2015 there were 21967 bookings.**

##  **Bivariate And Multivariate Analysis**

#### Chart - 9

Q.1 Which hotel type has the highest ADR?

In [None]:
# Chart - 9 visualization code
highest_adr = data.groupby('hotel')['adr'].mean().reset_index()
highest_adr

In [None]:
# Visualization of highest adr using barplot
plt.figure(figsize=(14,7))
sns.barplot(x=highest_adr['hotel'],y=highest_adr['adr'])
plt.title('Average ADR for each Hotel type', fontsize=20)
plt.xlabel('Type of hotel',fontsize=15)
plt.ylabel('ADR', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

**I choose bar plot because it gives simple pictorial diagram and it also easy to understand.**

##### 2. What is/are the insight(s) found from the chart?

**The insight found from the chart is City hotel has highest adr that means city hotel generate more revenue.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* **City hotel has high adr this makes positive impact.**
* **Resort hotel has less adr as compaire to city hotel this makes negative impact.**
* **City hotel has adr 105.43 means more revenue and resort hotel has 94.98 adr means less revenue than city hotel.**
* **Resort hotel should have increase there facilitis which increase revenue.**

#### Chart - 10

Q.2 Which hotel has longer waiting time?


In [None]:
# Chart - 10 visualization code
Waiting_time = data.groupby('hotel')['days_in_waiting_list'].mean().reset_index()
Waiting_time

In [None]:
# Visualization of hotel which has longer waiting time by using barplot
plt.figure(figsize=(14,7))
sns.barplot(x=Waiting_time['hotel'],y=Waiting_time['days_in_waiting_list'])
plt.title('Waiting time for each hotel type', fontsize=20)
plt.xlabel('Type of hotel',fontsize=15)
plt.ylabel('Waiting time in days', fontsize=15)

##### 1. Why did you pick the specific chart?

**I choose barplot bacuase it gives easy to understand pictorial diagram for the visualization of which hotel has longer waiting time.**

##### 2. What is/are the insight(s) found from the chart?

**City hotel has longer waiting time.Therefore city hotel is much busier than Resort hotel.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* **City hotel has longer waiting time this makes positive impact on business.**
* **Resort hotel has less waiting time this makes negative impact on business.**
* **The mean of days in waiting list for city hotel is about 3.23 day and for resort hotel is about 0.52 day.**
* **Resort hotel need to increase their facilities so that their bookings increases.**

#### Chart - 11

Q.7 Which distribution channel contributed more to adr in order to increase the income?

In [None]:
# Chart - 11 visualization code
distribution_channel = data.groupby(['distribution_channel','hotel'])['adr'].mean().reset_index()
distribution_channel

In [None]:
# Visualization of contribution of distribution channel in adr using barplot
plt.figure(figsize=(14,7))
sns.barplot(x='distribution_channel',y='adr', data=distribution_channel,hue='hotel')
plt.title('ADR across Distribution channel', fontsize=20)
plt.xlabel('Distribution channel',fontsize=15)
plt.ylabel('ADR', fontsize=15)

##### 1. Why did you pick the specific chart?

**I choose here barplot to visualise ADR across distribution channel beacuse it give easy to undertand visualization to large data.**

##### 2. What is/are the insight(s) found from the chart?

**The insight found from the above chart is that GDS channel contributed most in ADR in city hotel and Direct and TA/TO has nearly equal contribution in adr in both hotel types.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* **GDS distribution channel contributed more to adr for city hotel and Undefined distribution channel contributed more to adr for resort hotel this makes positive impact.**
* **GDS distribution channel has no any contribution to adr for resort hotel and undefined distribution channel contributed less to adr for city hotel this makes neative impact.**
* **GDS distribution channel must have increase bookings for resort hotels therefore there contribtuion to adr will increase and income will increase and undefined distribution channel must have increase bookings for city hotels therefore there contribution to adr will increase and income will increase.**

#### Chart - 12

Q.4 What is optimal stay length in both types of hotel?

In [None]:
# Chart - 12 visualization code
stay_length = data.groupby(['total_stay','hotel']).agg('count').reset_index()
stay_length = stay_length.iloc[:, :3]
stay_length = stay_length.rename(columns={'is_canceled':'Number of stays'})
stay_length

In [None]:
# Barplot is used for visualization of optimal stay length in hotel type
plt.figure(figsize=(14,7))
sns.barplot(x='total_stay',y='Number of stays', data=stay_length,hue='hotel')
plt.title('Optimal Stay Length in Both hotel types', fontsize=20)
plt.xlabel('total_stay in days',fontsize=15)
plt.ylabel('count of stays', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

**Because it gives simple visualization.**

##### 2. What is/are the insight(s) found from the chart?

**Optimal stay length in both hotel type is less than 7 days.**

#### Chart - 13

Q.5 Relationship between the repeated guests and previous bookings not canceled?

In [None]:
# Chart - 13 visualization code
repeated_guests = data[data['is_repeated_guest']==1]
repeated_guests_1 = data[data['is_repeated_guest']==0]

In [None]:
plt.figure(figsize=(14,7))
sns.barplot(x=data['is_repeated_guest'],y= data['previous_bookings_not_canceled'])
plt.xticks([0,1],['Not_repeated_guests','repeated_guests'],fontsize=15)
plt.xlabel('Is Repeated guests',fontsize=15)
plt.ylabel('Previous booking not cancelled', fontsize=15)
plt.title('Relationship between repeated guests and previous bookings not cancelled', fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

**Because bar plot is easy to understand.**

##### 2. What is/are the insight(s) found from the chart?

**The insight found from this chart is that not repeated guests cancel their bookings.**

#### Chart - 14

Q.6 Relationship between ADR and total number of people?

In [None]:
# Chart - 14 visualization code
number_of_people = data[data['total_people']<5]
number_of_people

In [None]:
plt.figure(figsize=(14,7))
sns.barplot(x=number_of_people['total_people'],y= number_of_people['adr'])
plt.title('ADR and total number of people', fontsize=20)
plt.xlabel('Total people',fontsize=15)
plt.ylabel('adr', fontsize=15)
plt.show()

1. Why did you pick the specific chart?

**I choose barplot because it gives simple visualization of data.**

2. What is/are the insight(s) found from the chart?

**The insight found from the above plot is that number of people increases adr aslo going to increase.**

#### Chart - 15 - Correlation Heatmap

In [None]:
data.head(2)

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(20,10))
sns.heatmap(data.corr(),annot=True)
plt.title('Correlation of the columns')
plt.show()

##### 1. Why did you pick the specific chart?

**I choose heatmap here becuase heatmap display a more eneralized view of neumeric values and also utilize color coded systems.**

##### 2. What is/are the insight(s) found from the chart?

* **arrival_date_year and arrival_date_week_number columns has negative correlation which is -0.54.**
* **stays_in_week_nights and total_stay has positive correlation which is 0.94.**

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

* **1) To increase hotel business some factors are important like high revenue, generation, customers satisfaction, facilities provided by hotel etc.**
* **2) I am able to achieve the same things by showing to client which hotel is most preferred , percentage of repeated guests, mostly preferred food by guests, then which hotel has highest adr etc.**
* **3) Most preferred room type is achieved by countplot so the client can be well prepare in advance and this insight help client for further enhancement of their hospatility.**
* **4) I am able to show which food type is mostly preferred so client can offer the mostly preferred food to the guests.**
* **5) Most preferred month are shown by barplot so client can be well prepared in advanced so that minimum grivances would be faced by client.**
* **6) Using barplot I am able to show which hotel type has high adr so client can analyse which hotel has high income.**
* **7) I am able to show which hotel is busiest hotel sp client can do relatable changes in facilities in less busy hotel type.**
* **8) I am able to show the relationship between repeated guests and previous bookings not cancelled so client can preferred repeated guests.**
* **9) Using barplot relationship between adr and total number of people is shown so client can preferred maximum number of people.**

# **Conclusion**

* **1) City hotel is mostly preferred hotel by guests.**
* **2) Agent no. 9 made the most bookings.**
* **3) Percentage of repeated guest is less which is 3.15%.**
* **4) Room type A is mostly preferred room type.**
* **5) Mostly preferred food type is BB type food.**
* **6) August month has most bookings and after august july has most bookings.**
* **7) TA/TO distribution channel is mostly used and percentage is 82.00 %.**
* **8) City hotel has highest ADR. Highest ADR means more revenue.**
* **9) 2016 year had highest bookings and bookings were 56622.**
* **10) City hotel has higher waiting time means city hotel is busier hotel.**
* **11) GDS distribution channel contributed most in ADR in city hotel but no contribution in resoert hotel.**
* **12) Optimal stay length in both hotel type is less than 7 days.**
* **13) Repeated guests do not cancel their bookings but not repeated guests cancel.**
* **14) If number of people is more then ADR is also increases means revenue increases.**
* **15) arrival_date_year and arrival_date_week_number columns has negative correlation which is -0.54.**
* **16) stays_in_week_nights and total_stays has positive correlation which is 0.94.**

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***