<a href="https://colab.research.google.com/github/sandeeps02/Hotel-Booking-Project--Almabetter/blob/main/Hotel_Booking_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Hotel Booking Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

This is a project that aims to provide insights into consumer preferences, trends, and patterns in the hospitality industry. By analyzing the data on hotel bookings,we can gain valuable insights into what factors are most important to guests, what types of rooms are most popular, and how hotel bookings vary over time.

The data needs to be cleaned and preprocessed to remove any duplicates, missing values, or outliers. This also involve's transforming some variables and creating new variables to make the data more useful for analysis.

After the data is cleaned, the next step is to perform exploratory data analysis using tools such as box plots and scatterplots,etc. This helps us to understand the distribution of the data, identify any patterns or trends and outliers, and visualize any relationships between variables.

Finally, we can use data visualization tools such as graphs, charts, and tables to communicate our findings and insights to stakeholders, such as hotel managers, investors, or marketing teams. This can help them to make more informed decisions about pricing, marketing, and other business strategies.

# **GitHub Link -**

https://github.com/sandeeps02/Hotel-Booking-Project--Almabetter/blob/main/Copy_of_Sample_EDA_Submission_Template.ipynb


# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

#storing path of csv file
dir_path="/content/drive/MyDrive/Colab Notebooks/Hotel Booking File/Hotel Bookings.csv"

#creating Data Frame using pandas
hotel_df=pd.read_csv(dir_path,parse_dates=['reservation_status_date'])

#creating copy of data frame
df1 = hotel_df.copy()

### Dataset First View

In [None]:
# Dataset First Look
pd.set_option('display.max_columns',32)
df1.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
row_count = df1[df1.columns[0]].count() #row count using count function 
column_count = df1.shape[1] # column count using shape
print('Row cout :',row_count)
print('Column count :',column_count)

### Dataset Information

In [None]:
# Dataset Info
df1.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df1[df1.duplicated()].shape

In [None]:
#Dropping all duplicate values from data frame
df1.drop_duplicates(inplace = True) 
df1.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df1.isna().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,6))
sns.displot(
    data=df1.isna().melt(value_name="missing"),
    y="variable",
    hue="missing",
    multiple="fill",
    aspect=1.25
)
plt.savefig("visualizing_missing_data_with_barplot_Seaborn_distplot.png", dpi=100)

In [None]:
#filling null values in country column by 'other'
df1['country']=df1['country'].fillna('other')
#filling null values of company,agent and children column with 0
df1[['company','agent','children']]=df1[['company','agent','children']].fillna(0.0)

In [None]:
# Editionally we drop those row whose children,adult,babies value is equal to 0
df1[df1['children']+df1['adults']+df1['babies']==0].shape #showing the number of those rows


In [None]:
df1.drop(df1[df1['adults']+df1['babies']+df1['children'] == 0].index, inplace = True) # to frop those rows

### What did you know about your dataset?

### **It's a Hotel booking data which consist of 32 columns and 119390 rows in which there are 31944 rows are duplicate values. We can clearly see too many datatypes in this dataset (datetime64[ns], float64(4), int64(16), object(11)), also we can clearly see the null value conataining columns(country,children,agent and company)**

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df1.columns

In [None]:
# Dataset Describe
df1.describe()

### Variables Description 

- hotel: Name of hotel ( City or Resort)
- is_canceled: Whether the booking is canceled or not (0 for no canceled and 1 for canceled)
- lead_time: time (in days) between booking transaction and actual arrival.
- arrival_date_year: Year of arrival
- arrival_date_month: month of arrival
- arrival_date_week_number: week number of arrival date.
- arrival_date_day_of_month: Day of month of arrival date
- stays_in_weekend_nights: No. of weekend nights spent in a hotel
- stays_in_week_nights: No. of weeknights spent in a hotel
- adults: No. of adults in single booking record.
- children: No. of children in single booking record.
- babies: No. of babies in single booking record. 
- meal: Type of meal chosen 
- country: Country of origin of customers (as mentioned by them)
- market_segment: What segment via booking was made and for what purpose.
- distribution_channel: Via which medium booking was made.
- is_repeated_guest: Whether the customer has made any booking before(0 for No and 1 for yes)
- previous_cancellations: No. of previous canceled bookings.
- previous_bookings_not_canceled: No. of previous non-canceled bookings.
- reserved_room_type: Room type reserved by a customer.
- assigned_room_type: Room type assigned to the customer.
- booking_changes: No. of booking changes done by customers
- deposit_type: Type of deposit at the time of making a booking (No deposit/ Refundable/ No refund)
- agent: Id of agent for booking
- company: Id of the company making a booking
- days_in_waiting_list: No. of days on waiting list.
- customer_type: Type of customer(Transient, Group, etc.)
- adr: Average Daily rate.
- required_car_parking_spaces: No. of car parking asked in booking
- total_of_special_requests: total no. of special request.
- reservation_status: Whether a customer has checked out or canceled,or not showed 
- reservation_status_date: Date of making reservation status.Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df1.nunique()

In [None]:
# creating a function to extract all variables unique value
def unique_val(data):
   return df1[data].unique()
  

In [None]:
list_variables=list(df1.columns)                            # Abstracting variable name in a list type variablel
lst=[]                                                      #creating empty list to store unique value of each variable 
for i in list_variables:
  lst.append(unique_val(i))                                 #appending unique value in lst variable
for j in range(0,len(list_variables)):                      
  print(f"UNIQUE VALUE OF {list_variables[j]} : {lst[j]}")  #printing all unique corresponding to their variables

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# changing float data type to int(64)
df1[['children','agent','company']]=df1[['children','agent','company']].astype('int64')

In [None]:
#adding two important column in data set (Total stay,Total people)
df1['total_stay']=df1['stays_in_week_nights']+df1['stays_in_weekend_nights']  #adding total stay column
df1['total_people']=df1['children']+df1['adults']+df1['babies']               #addig total people column

In [None]:
df1.describe()

 **Detacting and removing the outlier from the data set**

In [None]:
plt.figure(figsize=(20,10))
sns.boxplot(df1['lead_time'])

In [None]:
#removing outlier from lead_time
indices_to_drop=df1[df1['lead_time']>365].index
df1.drop(indices_to_drop,inplace=True)

In [None]:
plt.figure(figsize=(20,10))
sns.boxplot(df1['adr'])
plt.xlim(0,600)

In [None]:
#removing outlier from adr
indices_to_drop=df1[(df1['adr']<0) | (df1['adr']>400)].index
df1.drop(indices_to_drop,inplace=True)

In [None]:
sns.boxplot(df1['total_stay'])

In [None]:
#removing vague value
df1.drop(df1[df1['adults']+df1['babies']+df1['children'] == 0].index, inplace = True)

### What all manipulations have you done and insights you found?

1.   Create a copy of Data frame (df1).
2.   Identify the duplicate value and drop it.
3.   Remove all null value to specific values needed.
4.   Changing float64 data type value to int64.
5.   Adding two important column in data set (Total stay,Total people).
6.   Identify outliers and remove it from df1.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

## Q1.  Which agent makes the most number of bookings?

In [None]:
# Chart - 1 visualization code 
# Creating dataframe 
d1 = pd.DataFrame(df1['agent'].value_counts()).reset_index().rename(columns = {'index':'agent','agent':'num_of_bookings'}).sort_values(by = 'num_of_bookings', ascending = False)
# dropping 0 booking agents
d1.drop(d1[d1['agent']==0].index,inplace=True)
# top 10 agent
d1=d1[:10]

plt.figure(figsize=(10,5))
sns.barplot(x='agent',y='num_of_bookings',data=d1,order=d1.sort_values(by='num_of_bookings',ascending=False).agent)

##### 1. Why did you pick the specific chart?

It is a useful way to visualize the distribution of a categorical variable and is also useful in comparing the Frequency of different agents's.

##### 2. What is/are the insight(s) found from the chart?

By visualizing the data we can cleary see that the agent "9" made most number of booking, around 30000 booking have done by agnet 9.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Getting information about number of booking made by individual agent, we can understand that the who is most experienced agent and having good booking rate. 

Who can easily deal with the coustmers and provide the room according to their requirements.

We can provide some perks to that agent to make him more progressive and it also provide compition thinking among the agents. 

#### Chart - 2

## Q2. Wich room type is in most demand and which room type generates highest adr?

In [None]:
# Chart - 2 visualization code
grp_by_room=df1.groupby('assigned_room_type')
d1['num_of_bookings']=grp_by_room.size()
fig, axes = plt.subplots(1, 2, figsize=(18, 8))
sns.countplot(ax = axes[0], x = df1['assigned_room_type'])
sns.boxplot(ax = axes[1], x = df1['assigned_room_type'], y = df1['adr'])


##### 1. Why did you pick the specific chart?

Cout plot is best to use when comparing between diffrent entities.
second we use box plot for comapring adr across different room type.

##### 2. What is/are the insight(s) found from the chart?

By seeing count plot we can clearly say that the "A" type room is more demanding than others.

But by seeing box plot the room who generates most adr is "h" type room.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels should increase the no. of room types A and H to maximise revenue.

#### Chart - 3

## Q3) Which meal type is most preffered meal of customers?

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10,8))
sns.countplot(df1['meal'])
plt.show()

##### 1. Why did you pick the specific chart?

It is a useful way to visualize the distribution of a categorical variable and is also useful in comparing the Frequency of different meal's.

##### 2. What is/are the insight(s) found from the chart?

BB type meal is most preffered meal amomg coustmers.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Always make available the BB type food because it's most prefferable food of coustomers. 

#### Chart - 4

##  Q4) What is percentage of bookings in each hotel?

In [None]:
# Chart - 4 visualization code
grouped_by_hotel=df1.groupby('hotel')
d2 = pd.DataFrame((grouped_by_hotel.size()/df1.shape[0])*100).reset_index().rename(columns = {0:'Booking %'})
plt.figure(figsize=(8,5))
plt.pie(d2['Booking %'], labels=d2['hotel'], autopct='%.0f%%')
plt.show()

##### 1. Why did you pick the specific chart?

Pie charts are very good when it come to analys percentage values.

##### 2. What is/are the insight(s) found from the chart?

By visulizing the chart we can clearly see that city hotel have more number of booking compare to resort hotel. 

Around 61% of booking is City hotel Booking.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

By seeing the choice of hotel, we can suggest our coustomer to make more investment among city hotel and gain more profit.

#### Chart - 5

## Q5) Which hotel has higher lead time?

In [None]:
# Chart - 5 visualization code
d3=grouped_by_hotel['lead_time'].median().reset_index().rename(columns={'lead_time':'meadian_lead_time'})
plt.figure(figsize = (8,5))
sns.barplot(x = d3['hotel'], y = d3['meadian_lead_time'] )

##### 1. Why did you pick the specific chart?

It is a useful way to visualize between two entities.|

##### 2. What is/are the insight(s) found from the chart?

City hotel has slightly higher median lead time.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Median lead time is significantly higher in each case, this means customers generally plan their hotel visits way to early.

#### Chart - 6

## Q6) What is preferred stay length in each hotel?

In [None]:
# Chart - 6 visualization code
not_canceled = df1[df1['is_canceled'] == 0]
s1 = not_canceled[not_canceled['total_stay'] < 15]
plt.figure(figsize = (10,5))
sns.countplot(x = s1['total_stay'], hue = s1['hotel'])
plt.show()

##### 1. Why did you pick the specific chart?

It is a useful way to visualize this type of problems.

##### 2. What is/are the insight(s) found from the chart?

Most common stay length is less than 4 days.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Generally people prefer City hotel for short stay, but for long stays, Resort Hotel is preferred.

#### Chart - 7

## Q7) Which is the most common channel for booking hotels and which channel is mostly used for early booking of hotels?

In [None]:
# Chart - 7 visualization code
# For most common channel for booking hotels
group_by_dc = df1.groupby('distribution_channel')
d1 = pd.DataFrame(round((group_by_dc.size()/df1.shape[0])*100,2)).reset_index().rename(columns = {0: 'Booking_%'})
plt.figure(figsize = (8,8))
data = d1['Booking_%']
labels = d1['distribution_channel']
plt.pie(x=data, autopct="%.2f%%", explode=[0.05]*5, labels=labels, pctdistance=0.5)
plt.title("Booking % by distribution channels", fontsize=14);

In [None]:
# For which channel is mostly used for early booking of hotels
group_by_dc = df1.groupby('distribution_channel')
d2 = pd.DataFrame(round(group_by_dc['lead_time'].median(),2)).reset_index().rename(columns = {'lead_time': 'median_lead_time'})
plt.figure(figsize = (7,5))
sns.barplot(x = d2['distribution_channel'], y = d2['median_lead_time'])
plt.show()

##### 1. Why did you pick the specific chart?


Pie charts are very good when it come to analys percentage values, and  barplot is very handy and easy to use for this type of the data.

##### 2. What is/are the insight(s) found from the chart?

TA/TO is mostly used for planning Hotel visits ahead of time. But for sudden visits other mediums are most preferred.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

TA/TO is use for pre booking, so we hotel get more time for prepartion according to the coustmers needs.

#### Chart - 8

## Q8) Which channel has longer average waiting time and  which distribution channel brings better revenue generating deals for hotels?

In [None]:
# Chart - 8 visualization code
# For which channel has longer average waiting time
d4 = pd.DataFrame(round((group_by_dc['days_in_waiting_list']).mean(),2)).reset_index().rename(columns = {'days_in_waiting_list': 'avg_waiting_time'})
plt.figure(figsize = (7,5))
sns.barplot(x = d4['distribution_channel'], y = d4['avg_waiting_time'])
plt.show()

In [None]:
# For which distribution channel brings better revenue generating deals for hotels
group_by_dc_hotel = df1.groupby(['distribution_channel', 'hotel'])
d5 = pd.DataFrame(round((group_by_dc_hotel['adr']).agg(np.mean),2)).reset_index().rename(columns = {'adr': 'avg_adr'})
plt.figure(figsize = (7,5))
sns.barplot(x = d5['distribution_channel'], y = d5['avg_adr'], hue = d5['hotel'])
plt.ylim(40,140)
plt.show()

##### 1. Why did you pick the specific chart?

Acoording to data this are most handy and useful charts for analysing the data.


##### 2. What is/are the insight(s) found from the chart?

While booking via TA/TO one may have to wait a little longer to confirm booking of rooms.


GDS channel brings higher revenue generating deals for City hotel, in contrast to that most bookings come via TA/TO.

Resort hotel has more revnue generating deals by direct and TA/TO channel. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

TA/TO taking higher waiting time, so we make sure reduce the waiting time for coustomers experince.

City Hotel can work to increase outreach on GDS channels to get more higher revenue generating deals.

Resort Hotel need to increase outreach on GDS channel to increase revenue.

#### Chart - 9

## Q9.What is the adr per person month wise for diffrent hotel types?


In [None]:
# Chart - 9 visualization code
reindex = ['January', 'February','March','April','May','June','July','August','September','October','November','December']
df1['arrival_date_month'] = pd.Categorical(df1['arrival_date_month'],categories=reindex,ordered=True)
plt.figure(figsize = (15,8))
sns.boxplot(x = df1['arrival_date_month'],y = df1['adr'])
plt.show()

##### 1. Why did you pick the specific chart?

Best in this scenario.

##### 2. What is/are the insight(s) found from the chart?

Avg adr rises from beginning of year upto middle of year and reaches peak at August and then lowers to the end of year. But hotels do make some good deals with high adr at end of year also.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

## Q10. How does the length of a customer's stay vary by market segment and hotel type?

In [None]:
plt.figure(figsize=(20,10))
sns.boxplot(x='hotel', y='total_stay', hue='market_segment', data=df1)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

## Q11) what are number of arrival according to day of month?

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(15,6))

sns.countplot(data = df1, x = 'arrival_date_day_of_month', hue='hotel', palette='Paired')
plt.show()

##### 1. Why did you pick the specific chart?

Countplot best in that type of cases.

##### 2. What is/are the insight(s) found from the chart?

At the First,mid and last of the month, arrivals are less compare to other day.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Resort hotel arrival is almost same at every day but city hotel vary more.

#### Chart - 12

## Q12) How long do people stay at the hotels?

In [None]:
# Chart - 12 visualization code
filter = df1['is_canceled'] == 0
data = df1[filter]
data.head()

In [None]:
data['total_nights'] = data['stays_in_weekend_nights'] + data['stays_in_week_nights']
data.head()

In [None]:
stay = data.groupby(['total_nights', 'hotel']).agg('count').reset_index()
stay = stay.iloc[:, :3]
stay = stay.rename(columns={'is_canceled':'Number of stays'})
stay

In [None]:
plt.figure(figsize = (10,5))
sns.barplot(x = 'total_nights', y = 'Number of stays',data= stay,hue='hotel')

##### 1. Why did you pick the specific chart?

Barplot is good when comes for comapring two entities over one entities.

##### 2. What is/are the insight(s) found from the chart?

Most people prefer to stay at the hotels of <=5 days.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

In resort hotel stay length is higher than the city hotel we can say that log stay costumers prefers resort hotel. 

#### Chart - 13

##  Q13) From where the most guests are coming ?

In [None]:
# Chart - 13 visualization code
country_wise_guests = df1[df1['is_canceled'] == 0]['country'].value_counts().reset_index()
country_wise_guests.columns = ['country', 'No of guests']
country_wise_guests

In [None]:
grouped_by_country = df1.groupby('country')
d1 = pd.DataFrame(grouped_by_country.size()).reset_index().rename(columns = {0:'Count'}).sort_values('Count', ascending = False)[:10]
sns.barplot(x = d1['country'], y  = d1['Count'])
plt.show()

##### 1. What is/are the insight(s) found from the chart?

Most guest are from Portugal and other Europian contries.

##### 2. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
num_df1=df1[['lead_time','previous_cancellations','previous_bookings_not_canceled','booking_changes','days_in_waiting_list','adr','required_car_parking_spaces','total_of_special_requests','total_stay','total_people']]
#Correlation matrix
corrmat = num_df1.corr()
#Setting chart size 
f, ax = plt.subplots(figsize=(12, 7))
# code for heatmap chart
sns.heatmap(corrmat,annot=True,fmt='.2f',annot_kws={'size':10},vmax=.8,square=True)

##### 1. Why did you pick the specific chart?

For correlation between the variables heatmap chart is good to analyse.

##### 2. What is/are the insight(s) found from the chart?

*   Total stay length and lead time have slight correlation. it may means that for longer hotel stays people generally plan little before the the actual arrival.
*   adr is slightly correlated with total_people,which makes sense as more no. of people means more revenue, therefore more adr.







#### Chart - 15 - Pair Plot 

In [None]:
sns.pairplot(num_df1,kind="hist")

##### 1. What is/are the insight(s) found from the chart?

we can see all relation between numerical variables one by one.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***