<a href="https://colab.research.google.com/github/vishwapv/hotel-booking-analysis/blob/main/final_Copy_of_Sample_EDA_Submission_individual_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - 



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

The dataset contains hotel bookings data. One of the hotels is a resort hotel and the other is a city hotel. The dataset have the structure, with 31 variables describing the 40,060 observations of resort hotel and 79,330 observations of city hotel. Each observation represents a hotel booking. The dataset comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Since this is hotel real data, all data elements pertaining hotel or customer identification were deleted. The problem statement was to identify what impacts booking cancellation, from which country most guests are coming, who did the booking and whether customers repeating their bookings or not. The first step in the analysis involved taking initial look at the data, looking for any missing values and null values and tackling them.

The second step involved analyzing numerical type features, with the help of different visualization techniques such as heatmap, distplot, bar graphs, boxplots, pie charts, etc. Finding correlation between each variable and also finding the important features that had an impact on cancellation of bookings. The third step involved analyzing categorical variables such as hotel, arrival_date_month, country, reserved_room_type, reservation_status, deposit_type, distribution_channel, market_segment and finding any underlying pattern that affects the rate of cancellations. The final step was to point down the insights developed during the analysis of the data. Some observations draw were; increase in lead time increases rate of booking cancellation, increase in ADR also increases rate of booking cancellation, non-refund policy also increases rate of booking cancellation, majority of guests are from Western Europe, mostly couples booked the hotels and majority of customers are not repeating their bookings.



# **GitHub Link -**

https://github.com/vishwapv

# **Problem Statement**


The problem is to analyze hotel data using data science techniques in order to gain valuable insights and make data-driven decisions to improve various aspects of hotel operations and customer experience. The dataset includes information such as customer demographics, booking details, room types, pricing, customer reviews, and other relevant variables.

** **

#### **Define Your Business Objective?**

Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions! This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data. Explore and analyze the data to discover important factors that govern the bookings.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Dataset Loading

In [None]:

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df =pd.read_csv('/content/drive/MyDrive/Colab Notebooks/module/data visualization/Hotel Bookings.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
# checking null values by plotting heat map
fig,axes = plt.subplots(1,1,figsize=(20,10))
sns.heatmap(df.isna())
plt.show()

### What did you know about your dataset?

From this data we came to know that ,we have a  missing values in column name 'country','agent','company'.From heat map we can clearly see the missing value or null value.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description 

- Hotel - Type of hotel

- is canceled - it denoted 1 when it is canceled and 0 if booking was not canceled

- lead time - time between booking nd check in

- arrival date year - Year where customer arrived at hotel

- arrival date month - Month where customer arrived at hotel

- arrival date week -week where the customer arrived at hotel

- arrival date day of month - numbers of day where the customer arrived at hotel

- stay in weekend nights - number of night where the customer stay in weekend nights

- stays_in_week_nights - number of night where the customer stay in week nights

- adults - number of person who were adults

- deposite type - Indication on if the customer made a deposit to guarantee the booking. Three categories, No-deposit, Non-Refund, Refundable

- Adr - Average Daily rate as defined by the average rental revenue earned for an occupied room per day.

- required_car_parking_spaces - Number of car parking spaces required by the customer.

- previous cancellation - Number of previous bookings that were cancelled by the customer prior to the current booking.

- reserved room type - code of room type

- reservation status set - Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("number of unique value in ",i,"is",df[i].nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# checking for null values in each columns
df.isnull().sum().sort_values(ascending=False)

In [None]:
# percentage of null values in each columns
print(100*(df.isnull().sum()/len(df.index)).sort_values(ascending=False))

### What all manipulations have you done and insights you found?

**1.Columns which contains null values are 'agent','company','children' and country**

**2.There are 94% 'company' column and 13% 'agent' column filled with null values**

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
#creating a copy of data set
hotel_df = df.copy()

In [None]:
# Replacing null values of column agent and company with 0.
hotel_df[['agent','company']]=hotel_df[['agent','company']].fillna(0.0)

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# heat map 
fix,axes = plt.subplots(1,1,figsize=(20,10))
sns.heatmap(hotel_df.isna())
plt.show()

##### 1. Why did you pick the specific chart?

I pick this specific chat beacuse to find the correlation between the variables

##### 2. What is/are the insight(s) found from the chart?

I found that there is no missing values in the given data.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

If there is no missing values in the given data it's is to analyse the future data 

#### Chart - 2

In [None]:
# we replace missing values of column 'children' with rounded mean values as it contain the count of 'children'
hotel_df['children'].fillna(round(df['children'].mean()), inplace=True)

In [None]:
# we replace country column with mode
hotel_df['country'].fillna(df['country'].mode().to_string(), inplace=True)

In [None]:
# drop those row where children ,adults and babies is equal to 0
hotel_df = hotel_df.drop(df[(hotel_df.adults + hotel_df.children + hotel_df.babies)==0].index)

In [None]:
# lets check the shape of the data frame
hotel_df.shape

In [None]:
# Chart - 2 visualization code
#correlation heat map
plt.figure(figsize=(20,10))
sns.heatmap(hotel_df.corr(), cmap="coolwarm",annot=True)

##### 1. Why did you pick the specific chart?

To find the correlatiomn between the variables with help of seaborn

##### 2. What is/are the insight(s) found from the chart?

'is_canceled' is highly positively correlated with 'lead_time' as compared to other features.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

'Yes' there is a Positive  impact on the data set because there is a highly postive correlation between the 'is_canceled' and lead time 
 

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Pie plot to show types of hotels.

labels = hotel_df['hotel'].value_counts().index.tolist()
sizes = hotel_df['hotel'].value_counts().tolist()
explode = (0, 0.05)
colors = ['red', 'blue']

plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%',startangle =90, textprops={'fontsize': 14})
plt.show()

##### 1. Why did you pick the specific chart?

To check the percentage of the hotel where customer choose

##### 2. What is/are the insight(s) found from the chart?

Most of the customer choose the city hotel

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

'Yes' this help me to preduct the positive impact

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Bar plot showing count of canceled and not canceled bookings in both hotels.

plt.figure(figsize=(10,5))
sns.set_theme(style="whitegrid")
ax = sns.countplot(x="is_canceled", hue ='hotel',data=hotel_df)

##### 1. Why did you pick the specific chart?

Bar plot is to check the point cancelation of the booking

##### 2. What is/are the insight(s) found from the chart?

Most of the cancelation is done on he city hotel it's more then the 40000


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

it create the negative impact on data because most of coustmer prefer the city hotel an even most of them has canceled the booking

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# value_counts counts the number of times each values has appeared

hotel_df.arrival_date_month.value_counts(normalize=True)

In [None]:
month_df = hotel_df[hotel_df['is_canceled']==0]['arrival_date_month'].value_counts().reset_index().rename(columns = {'index':'month','arrival_date_month':'number_of_bookings'})

In [None]:
# Barplot of number of bookings in each month

plt.figure(figsize=(15,10))
ax = sns.barplot(x="month", y="number_of_bookings", data = month_df)

##### 1. Why did you pick the specific chart?

to check the number of booking happened in a month

##### 2. What is/are the insight(s) found from the chart?

We found that in August month we have more booking 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

from this graph we can conclude that during the month of the august we can provide more offers so that more customer can interract with the hotel

#### Chart - 6

In [None]:
month_hotel_type = hotel_df[hotel_df['is_canceled']==0].groupby(['arrival_date_month','hotel'])['hotel'].count().unstack()

In [None]:
# Chart - 6 visualization code
# Barplot of number of bookings in each month for both hotels.

ax = month_hotel_type.plot.bar(figsize = (15,10),fontsize = 14)

##### 1. Why did you pick the specific chart?

This is also same as the previous one with diffrent formate

##### 2. What is/are the insight(s) found from the chart?

It is found that, August is the most occupied month with 11.65% bookings and January is the least occupied month with 4.94% bookings.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

'Yes' booking more in the month of the august lead to bussines growth

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Value counts of top 10 countries from where maximum number of bookings happened

top_10_countries = hotel_df[hotel_df['is_canceled']==0]['country'].value_counts()[:10]

In [None]:
top_10_countries = top_10_countries.reset_index().rename(columns = {'index':'country','country':'number_of_bookings'})
top_10_countries['percentage'] = (top_10_countries['number_of_bookings']/top_10_countries['number_of_bookings'].sum())*100

In [None]:
# Bar plot of top 10 countries

plt.figure(figsize=(15,10))
ax = sns.barplot(x="country", y="percentage", data=top_10_countries)

##### 1. Why did you pick the specific chart?

To check which country is booking more percentage of the hotel.


##### 2. What is/are the insight(s) found from the chart?

PRT is booking more hotel compared to other country

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

'Yes' we can provide more offer for 'PRT" so that more customer come from that country

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# First we will take only not cancelled booking values

not_canceled_data = hotel_df[hotel_df['is_canceled']==0]

In [None]:
# Total number of week night stay.

week_nights = not_canceled_data[not_canceled_data['hotel'] == 'Resort Hotel']['stays_in_week_nights'].sum()

In [None]:
# Total number of weekend night stay.

weekend_nights = not_canceled_data[not_canceled_data['hotel'] == 'Resort Hotel']['stays_in_weekend_nights'].sum()

In [None]:
# Bar plot showing week nights and weekend nights stay for Resort hotels.

plt.figure(figsize=(10,10))
plt.bar(x=['Week nights','Weekend nights'],height = [week_nights,weekend_nights], color = ['red','blue'])
plt.xlabel('Night stay')
plt.ylabel('count of bookings')
plt.title('Number of bookings for week and weekend nights for Resort type hotel')

##### 1. Why did you pick the specific chart?

TO check weather the customer book room for nigth in week days or in weekends

##### 2. What is/are the insight(s) found from the chart?

In week days they book more compare to weekend night

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

'Yes' there will be more profit on weeknights

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# not_canceled dataframe that we have created earlier to get only those bookings which are not canceled.

not_canceled_data[not_canceled_data['adr'] == 0][not_canceled_data['market_segment']=='Complementary'].head()

In [None]:
# Let's filter our copied dataset and remove anamolies.

hotel_df= hotel_df.drop(hotel_df[(hotel_df['adr'] == 0) & (hotel_df['market_segment'] != 'Complementary')].index)

In [None]:
# Let's check distribution of adr column.

plt.figure(figsize=(10,5))
ax = sns.distplot(hotel_df[hotel_df['is_canceled']==0]['adr'])

##### 1. Why did you pick the specific chart?

i used this chat to check the average daily rate

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Boxplot of adr column.

sns.set_theme(style="whitegrid")
ax = sns.boxplot(x=hotel_df[hotel_df['is_canceled']==0]['adr'])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Lineplot of the adr for different hotel types.

plt.figure(figsize=(20,10))
sns.lineplot(x='arrival_date_month', y='adr', hue='hotel', data= hotel_df)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Countplot of number of customers repeated their bookings.

plt.figure(figsize=(12,8))
sns.countplot(data = hotel_df, x = 'is_repeated_guest').set_title('Graph showing whether guest is repeated guest', fontsize = 20)
plt.show()

##### 1. Why did you pick the specific chart?

To find the repeated guest for the hotel we choose bar chart

##### 2. What is/are the insight(s) found from the chart?

We came to know that this no more repeated guest who visited the hotel repetedly

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Filter the data on the basis of individual, couple and family.

individual = hotel_df[hotel_df['is_canceled']==0][(hotel_df['adults']==1) & (hotel_df['children'] == 0) & (hotel_df['babies'] == 0)]
couple = hotel_df[hotel_df['is_canceled']==0][(hotel_df['adults']==2) & (hotel_df['children'] == 0) & (hotel_df['babies'] == 0)]
family = hotel_df[hotel_df['is_canceled']==0][(hotel_df['adults'] )+ (hotel_df['children']) + (hotel_df['babies'] ) > 2]

In [None]:
# Shape of dataset containing only not cancelled bookings.

total_count = hotel_df[(hotel_df['is_canceled']==0)].shape[0]

In [None]:
# Calculating the percentage of booking of each type of accomodations.

percentage = [round(len(item)/total_count * 100) for item in [individual,couple,family]]

In [None]:
types_of_accomodation = ['Individual','Couple','Family']

In [None]:
# Dictionary to store types of accomodation and their percentage of bookings.

dict(zip(types_of_accomodation,percentage))

In [None]:
# Creating dataframe

data = pd.DataFrame({'types_of_accomodation':types_of_accomodation,'percentage':percentage})
data

In [None]:
# Barplot of different types of accomodations.

plt.figure(figsize=(15,10))
ax = sns.barplot(x="types_of_accomodation", y="percentage", data = data)

##### 1. Why did you pick the specific chart?

To check the type of accomodation in the Hotel

##### 2. What is/are the insight(s) found from the chart?

Couples accomodate more compare to others

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Total bookings per market segment.

plt.figure(figsize = (20,10))
segments=not_canceled_data["market_segment"].value_counts()

# pie plot

ax = plt.pie(segments,
             labels=segments.index,
             autopct='%1.1f%%',
             shadow=True, startangle=90
             )

plt.legend(ax, labels = segments.index, loc="best")

##### 1. Why did you pick the specific chart?

To check the total Bookings per market segment

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code
# Barplot to show which deposit type affects cancellation more.

deposit_df = hotel_df.groupby('deposit_type')['is_canceled'].describe()
plt.figure(figsize=(12, 8))
sns.barplot(x=deposit_df.index, y=deposit_df["mean"].values * 100)
plt.title("Effect of deposit on cancelation", fontsize=16)
plt.xlabel("Deposit Type", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)

##### 1. Why did you pick the specific chart?

To check the wether the hotel is refunding the amount while cancelling the booking

##### 2. What is/are the insight(s) found from the chart?


We found that there is  more refund amount after the canceltion

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

To improve the business objective, Advertising the hotel takes an important role,for more profit targeting months between may and aug is more important. Majority of thr hotels booking are at city hotel so more focus on the city Hotel.Showing virtuousness at the first time visting of the coustomer may imporve the repetation the coustomer.Number of booking is greater in case of couples so more focuse on special request made by couples

# **Conclusion**

1. Majority of the hotels booked are city hotel.
2. Non-Refund policies lead to a higher cancellation rates.
3. Target months between May to Aug. Those are peak months due to the summer period.
4. Majority of the guests are from Western Europe. So target this area for advertisements.
5. Since there are very few repeated guests, focus should be on retaining the customers after their first visit.
6. Increase in lead time increases the rate of cancellation.
7. Increase in ADR also increases the rate of cancellation.
8. Customer should do their booking in during the month November to January because in these months both hotels have cheaper average daily rate.
9. Number of booking is greater in case of couples so hotels should focus on special requests made by couple.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***