# **Project Name**    - 



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

Now a days people are spending more money on their vacations. The importance of a comfort staying in a fully serviced hotel is crucial for enjoying the vacation. So it is the duty of the hotel to know more about customers to serve them better. The dataset provided has 32 attributes and above 1 lakh of observations for the year of 2015. The above dataset will be analysed and we will get some insights which will be give a clear understanding of business to the hotel management team to serve customers better than usual.

# **GitHub Link -**

https://github.com/sambitpani-ds/Hotel-Booking-Analysis-AL.git

# **Problem Statement**


**Help the management team to understand the customer better based on the data.**

#### **Define Your Business Objective?**

Business is built upon customer and business crew. So strengthening crew will be automatically strengthen the relationship between them. The main objective of this project is to strengthen crew with some insights from the data which will be benefitted for the company and generate more revenue.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('C:\\Users\\SAMBIT\\Desktop\\Python\\Hotel Bookings.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df[df.duplicated()]

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

The dataset provided is from a hotel booking company. It has a broad data of the customers who had booked hotel rooms from all over the world. It contains the details of booking done by the customers. In view of the data we can infer the customer needs and satisfaction in order to generate constant flow of people to the hotel.
Dataset provided contains 119390 rows and 32 columns. Out of which 31994 rows have been duplicated and chicldren,country,agent and company has null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description 

1. hotel : Type of hotel i.e. City or Resort
2. is_canceled : Booking cancelled or not
3. lead_time : Booking date and staying date duration
4. arrival_date_year : arrival year
5. arrival_date_month : arrival month
6. arrival_date_week_number : arrival week number
7. arrival_date_day_of_month : arrival day
8. stays_in_weekend_nights : Weekend nights stay
9. stays_in_week_nights  : Week nights stay
10. adults : No of adult persons
11. children : No of children
12. babies : No of babies
13. meal : Kind of meal preferred
14. country : Guests are from which Country 
15. market_segment : Customer type segment
16. distribution_channel : How the customer booked in corporate/direct/TA/To etc
17. is_repeated_guest : New guest or Previous guest
18. previous_cancellations : Previously cancelled count
19. previous_bookings_not_canceled : previous not cancelled count
20. reserved_room_type : Room type reserved
21. assigned_room_type : Room type assigned
22. booking_changes : Times of booking changes made
23. deposit_type : Deposit type
24. agent : Booked through agent
25. company : Comany name
26. days_in_waiting_list : No of days in waiting list
27. customer_type : Type of customer
28. adr : Average Daily Rate
29. required_car_parking_spaces : Required car parking or not
30. total_of_special_requests : Additional special request
31. reservation_status : Status of reservation
32. reservation_status_date : Above status date

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# 1. Removing Duplicates
df.drop_duplicates(inplace = True)
df

In [None]:
# 2. Filling null values with 0
df.fillna(0,inplace=True)
df.isna().sum()

In [None]:
# 3. Create two new columns for total person stayed and total days stay
df['total_days'] = df['stays_in_week_nights'] + df['stays_in_weekend_nights']
df['total_guests'] = df['babies'] + df['children'] + df['adults']

In [None]:
# 4. Check for valid stay
df.drop(df[df['total_days'] == 0].index,inplace=True)
df[df['total_days'] == 0]

In [None]:
# 5. Changing the datatype to appropriate
df[['children', 'company', 'agent','total_guests']] = df[['children', 'company', 'agent','total_guests']].astype('int64')
from datetime import datetime
from datetime import date
df['reservation_status_date'] = df['reservation_status_date'].apply(lambda x : datetime.strptime(x,'%Y-%m-%d'))

### What all manipulations have you done and insights you found?

1. We have removed the duplicate rows so that our data will be unique.
2. We have checked for null values and found in chicldren,company,agent and country. So we assign children as 0 thinking there is no children so it kept empty. We assign agent and country and comapny as 0 to put it in separe category.
3. We create two new columns which will be useful later on.
4. We checked the validity of the stay. There is no stay if there is no persons.
5. We changed the datatype of the columns as per appropriate types.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.rcParams['figure.figsize'] = (15,5)
df_monthly_customers = df.groupby('arrival_date_month')['arrival_date_month'].count().sort_values()
plt.plot(df_monthly_customers,'b--')
plt.title('Count of booking month wise')
plt.ylabel('Count of booking')
plt.axhline(df_monthly_customers.mean(),color='r',label='{:5.0f}'.format(np.mean(df_monthly_customers)))
plt.legend(loc=0)

##### 1. Why did you pick the specific chart?

This graph shows in which part of the entire year more people are coming to stay. 

##### 2. What is/are the insight(s) found from the chart?

We found out that guest coming started at march and contnues to increase through the upcoming months and peak at august. After that guests number goes below avarage.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

In these times crew can prepare the inventory and management on those months. Also they need more staff to entertain the guest pool.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.scatterplot(y = 'adr', x = 'total_days', data = df[df['adr']<1000])
plt.axvline(df['total_days'].mean(),color='red',ls='--',label='{:5.0f}'.format(np.mean(df['total_days'])))
plt.axhline(df['adr'].mean(),color='green',ls='--',label='{:5.0f}'.format(np.mean(df['adr'])))
plt.legend(loc=0)

##### 1. Why did you pick the specific chart?

To get the average total day stay and what will be the adr wrt total day stay we need to to plot this graph.

##### 2. What is/are the insight(s) found from the chart?

It can be seen from the chart that on average most people will stay for 4 Days. Also more stay has low adr rate. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

We can reduce the adr to for stays more than 5-6 Days. Also It would be better to have offers for full week stay.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
sns.barplot(y = 'lead_time', x = 'total_days',hue='is_canceled' ,data = df[(df['lead_time']<650)&(df['total_days']<31)])
plt.axhline(df['lead_time'].mean(),color='red',ls='--',label='{:5.0f}'.format(np.mean(df['lead_time'])))
plt.legend(loc=0)

##### 1. Why did you pick the specific chart?

To know the average lead time for booking vs the total nomber of days stay.

##### 2. What is/are the insight(s) found from the chart?

It has been found that average lead time is 80. So its a good amount of time. Sp management team has to plan accordingly.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Lead time is low for one or two day booking. So charge more for one or two days.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
df_adr_month = df.groupby('arrival_date_day_of_month')['adr'].mean()
df_guest_month = df.groupby('arrival_date_day_of_month')['total_guests'].mean()
fig,ax = plt.subplots(2,1)
ax[0].plot(df_adr_month)
ax[0].set_title('Day wise adr vs total persons')
ax[0].set_ylabel('adr')
ax[1].plot(df_guest_month)
ax[1].set_ylabel('total avg guests')

##### 1. Why did you pick the specific chart?

To check the adr rate and number of average persons throughout the month i have choosen this graph.

##### 2. What is/are the insight(s) found from the chart?

It has been found out that at the starting and ending week more people are coming to stay. Also the adr is high at the last week.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

We can charge more during starting and ending weeks.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
df_country = df.groupby('country')['total_guests'].sum().sort_values(ascending=False)
df_country[0:10].plot.bar()
plt.ylabel('Total customers')

##### 1. Why did you pick the specific chart?

Its a plot that will tell us which country has more customers.

##### 2. What is/are the insight(s) found from the chart?

Here we can find that its portugal,US,France,Spain are the countries from which more guests are coming.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The management team can plan and do the cultural festive of the countries to attract more people from the countries. Also they can fill up their inventories with stuff preferred by those people.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
df_meal_pref = df.groupby(['meal','country'])['country'].count().sort_values(ascending=False)
df_meal_pref[0:10].plot(kind='bar')

##### 1. Why did you pick the specific chart?

To find out the most preferred meal plan for different countries.

##### 2. What is/are the insight(s) found from the chart?

It has been found out that BB is the most preferred meal plan for all the high travelling countries.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

We can set the default meal plan as BB. Also the material required or the invenotry can be set to BB meal plan.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
df_category = df.groupby('distribution_channel')['is_canceled'].count().drop(index='Undefined')
df_category_percent = df_category.to_frame()
df_category_percent['%age'] = df_category_percent['is_canceled']/df_category_percent['is_canceled'].sum()*100
# df_category_percent.drop(index=['is_canceled'],axis=1).plot(kind='bar')
df_category_percent.drop(columns='is_canceled').plot(kind='bar')

##### 1. Why did you pick the specific chart?

To get the customer booking type. We create this chart. 

##### 2. What is/are the insight(s) found from the chart?

It has been seen that most of the customers fall under Travel Agent/Travel Office. So more customers are booking from agents.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Just to know where customers booking from.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
df_category_adr = df.groupby(['distribution_channel', 'hotel'])
d5 = pd.DataFrame(round((df_category_adr['adr']).mean(),2)).reset_index().rename(columns = {'adr': 'avg_adr'})
plt.figure(figsize = (7,5))
sns.barplot(x = d5['distribution_channel'], y = d5['avg_adr'], hue = d5['hotel'])
plt.ylim(40,140)
plt.show()

##### 1. Why did you pick the specific chart?

To study between different channels and avg ard among different hotel types.

##### 2. What is/are the insight(s) found from the chart?

It has been found that GDS channel has higher average adr in resort than other channels. Also resorts have higher revenue than city hotels.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Its good to have resorts than city hotels in terms of adr.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
df_room = df.groupby(['hotel','distribution_channel','reserved_room_type'])['is_canceled'].count()
df_room.sort_values(ascending=False)[0:5].plot(kind='bar')

##### 1. Why did you pick the specific chart?

To find the most preferred room type for the customers in terms of bookings.

##### 2. What is/are the insight(s) found from the chart?

It has been inferred that City hotels room type 'A' are the most preferred rooms followed by Resort hotels room 'A'. room type 'D' is the second most preferred room type.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

More number of such rooms to be contructed for better customer attraction.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
df_room_person = df[(df['reserved_room_type'] == 'D') | (df['reserved_room_type'] == 'A') ].groupby(['arrival_date_month','reserved_room_type'])
d6 = pd.DataFrame(round((df_room_person['total_days']).sum(),2)).reset_index().sort_values(by='reserved_room_type')
plt.figure(figsize = (15,5))
d6
sns.barplot(x=d6['arrival_date_month'],y=d6['total_days'],hue=d6['reserved_room_type'])

##### 1. Why did you pick the specific chart?

To identify relation between booking of two most booked rooms.

##### 2. What is/are the insight(s) found from the chart?

Between A and D rooms we have seen that A room is booked for entire year with slight variation. whereas D room is booked gradually throughout the year.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Room type A should be contructed over D type.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
d7 = df[df['reserved_room_type']!=df['assigned_room_type']].groupby(['is_canceled','hotel']).count().reset_index()
plt.figure(figsize = (15,5))
sns.barplot(x=d7['hotel'],y=d7['total_days'],hue=d7['is_canceled'])

##### 1. Why did you pick the specific chart?

To study the cancelation reason due to room not assigned as per reservation.

##### 2. What is/are the insight(s) found from the chart?

It has been seen that cancellation doesn't depend on the room assigned type.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Room not assigned as per requirement doesn't impact the cancellation.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
d8 = df.groupby(['customer_type','hotel']).count().reset_index()
plt.figure(figsize = (15,5))
sns.barplot(x=d8['customer_type'],y=d8['total_days'],hue=d8['hotel'])

##### 1. Why did you pick the specific chart?

To understand the customer type.

##### 2. What is/are the insight(s) found from the chart?

It has been seen that Transient customers are more than other customers.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Advertisement should be done keeping the transient person into account so that more transient customers come for exciting offers. Also city hotels should be formed in terms of transient people comfort.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
d9 = df.groupby(['booking_changes','is_canceled']).count().reset_index()
plt.figure(figsize = (15,5))
sns.barplot(x=d9['booking_changes'],y=d9['total_days'],hue=d9['is_canceled'])


##### 1. Why did you pick the specific chart?

To understand whether changing the booking affect cancellation or not.

##### 2. What is/are the insight(s) found from the chart?

It has been seen that booking changes is very low. SO it should be limited to two times only for betterment.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Booking changes should be limited to two times only for betterment.

#### Chart - 14

In [None]:
# Chart - 14 visualization code
d10 = df.groupby(['total_of_special_requests','is_canceled']).count().reset_index()
d10
plt.figure(figsize = (15,5))
sns.barplot(x=d10['total_of_special_requests'],y=d10['total_guests'],hue=d10['is_canceled'])

##### 1. Why did you pick the specific chart?

To understand cancellation better.

##### 2. What is/are the insight(s) found from the chart?

The cancellation rate is lower if there will have at least one special requests.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

It could be strategy to freely give any special request which will be preferred by most persons to every 100th customers.

#### Chart - 15 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
df_new = df[['is_canceled','lead_time','arrival_date_week_number','arrival_date_day_of_month','total_days','total_guests','is_repeated_guest','booking_changes','adr','total_of_special_requests']]
df_corr = df_new.corr()
sns.heatmap(df_new.corr(),vmin=-1,annot=True)

##### 1. Why did you pick the specific chart?

It is required to see the correlation between each columns with other columns.

##### 2. What is/are the insight(s) found from the chart?

It has been seen that total guests and adr is 39% positively correlated and total days and lead time are 32% positively correlated.

#### Chart - 16 - Pair Plot 

In [None]:
# Pair Plot visualization code
pair = df[(df['total_guests']<10) & (df['lead_time']<365) & (df['adr']<1000) & (df['total_days']<31)]
sns.pairplot(pair[['lead_time','adr','total_days','total_guests']])

##### 1. Why did you pick the specific chart?

To find the relationship between pair of attributes/columns.

##### 2. What is/are the insight(s) found from the chart?

Nothing much.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

*   More hotel crew staff should be hired for temporary basis from march to august so as to manage the pool of customers.
*   Reduce the adr to for stays more than 5-6 Days. Also It would be better to have offers for full week stay.
*   Lead time is low for one or two day booking. So charge more for less days.
*   Charge more on first and last week.
*   The management team can plan and do the cultural festive of the countries to attract more people from the countries. Also they can fill up their inventories with stuff preferred by those people.
*   They can set the default meal plan as BB. Also the material required or the invenotry can be set to BB meal plan.
*   Its good to have resorts than city hotels in terms of adr.
*   More number of such rooms to be contructed for better customer attraction.
*   Room type A should be contructed over D type. Also we can get the ration between these room types.
*   Room not assigned as per requirement doesn't impact the cancellation.
*   Advertisement should be done keeping the transient person into account so that more transient customers come for exciting offers. Also city hotels should be formed in terms of transient people comfort.
*   Booking changes should be limited to two times only for betterment.
*   It could be strategy to freely give any special request which will be preferred by most persons to every 100th customers.


# **Conclusion**

*   Guest coming started at march and contnues to increase through the upcoming 
months and peak at august. After that guests number goes below avarage.
*   On average most people will stay for 4 Days. Also more stay has low adr rate. 
*   Average lead time is 80. So its a good amount of time. Sp management team has to plan accordingly.
*   At the starting and ending week more people are coming to stay. Also the adr is high at the last week.
*   Here we can find that its portugal,US,France,Spain are the countries from which more guests are coming.
*    BB is the most preferred meal plan for all the high travelling countries.
*   most of the customers fall under Travel Agent/Travel Office. So more customers are booking from agents.
*   GDS channel has higher average adr in resort than other channels. Also resorts have higher revenue than city hotels.
*   City hotels room type 'A' are the most preferred rooms followed by Resort hotels room 'A'. room type 'D' is the second most preferred room type.
*   Between A and D rooms we have seen that A room is booked for entire year with slight variation. whereas D room is booked gradually throughout the year.
*   It has been seen that cancellation doesn't depend on the room assigned type.
*   It has been seen that Transient customers are more than other customers.
*   booking changes is very low. SO it should be limited to two times only for betterment.
*   The cancellation rate is lower if there will have at least one special requests.
*   It has been seen that total guests and adr is 39% positively correelated and total days and lead time are 32% positively correlated.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***