<a href="https://colab.research.google.com/github/thepankaj018/EDA-Hotel-Booking/blob/main/EDA_Hotel_Booking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Hotel Booking Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


## <b> Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions!

## <b>This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data. </b>



#### **Define Your Business Objective?**

###Explore and analyze the data to discover important factors that govern the bookings.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')


In [None]:
pd.set_option('display.max_columns',None)

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
file_path = '/content/drive/MyDrive/hotel_bookings.csv'
df = pd.read_csv(file_path)


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'Number of rows are {df.shape[0]}')
print(f'Number of columns are {df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(f' Number of duplicates values in df are {df.duplicated().sum()}')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Checking percent NULL Values
(df.isnull().sum()*100/len(df)).to_frame().sort_values(by = 0,ascending = False).rename(columns = {0:'Percent NULL Values'})

In [None]:
# Visualizing the missing values
plt.figure(figsize=(25, 10))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False,cmap = 'plasma')
plt.xlabel("Name Of Columns")
plt.title("Places of missing values in column");

### What did you know about your dataset?

####1.Dataset has 119390 rows & 32 columns.
####2.Dataset has 31994 duplicates values.
####2.Feature COMPANY has highest number of NULL Values around 94% followed by AGENT that contains around 14% NULL Values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe().T.style.background_gradient(cmap='RdPu')

### Variables Description 



1. **hotel** : *Hotel(Resort Hotel or City Hotel)* 

2. **is_canceled** : *Value indicating if the booking was canceled (1) or not (0)*

3. **lead_time** :* Number of days that elapsed between the entering date of the booking into the PMS and the arrival date*

4. **arrival_date_year** : *Year of arrival date*

5. **arrival_date_month** : *Month of arrival date*

6. **arrival_date_week_number** : *Week number of year for arrival date*

7. **arrival_date_day_of_month** : *Day of arrival date*

8. **stays_in_weekend_nights** : *Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel*

9. **stays_in_week_nights** : *Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel*

10. **adults** : *Number of adults*

11. **children** : *Number of children*

12. **babies** : *Number of babies*

13. **meal** : *Type of meal booked. Categories are presented in standard hospitality meal packages:*

14. **country** : *Country of origin.`*

15. **market_segment** : *Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”*

16. **distribution_channel** : *Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”*

17. **is_repeated_guest** : *Value indicating if the booking name was from a repeated guest (1) or not (0)*

18. **previous_cancellations** : *Number of previous bookings that were cancelled by the customer prior to the current booking*

19. **previous_bookings_not_canceled** : *Number of previous bookings not cancelled by the customer prior to the current booking*

20. **reserved_room_type** : *Code of room type reserved. Code is presented instead of designation for anonymity reasons.*

21. **assigned_room_type** : *Code for the type of room assigned to the booking.* 

22. **booking_changes** : *Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation*

23. **deposit_type** : *Indication on if the customer made a deposit to guarantee the booking.*

24. **agent** : *ID of the travel agency that made the booking*

25. **company** : *ID of the company/entity that made the booking or responsible for paying the booking.* 

26. **days_in_waiting_list** : *Number of days the booking was in the waiting list before it was confirmed to the customer*

27. **customer_type** : *Type of booking, assuming one of four categories*


28. **adr** : *Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights*

29. **required_car_parking_spaces** : *Number of car parking spaces required by the customer*

30. **total_of_special_requests** :* Number of special requests made by the customer (e.g. twin bed or high floor)*

31. **reservation_status** : *Reservation last status, assuming one of three categories*
* Canceled – booking was canceled by the customer
* Check-Out – customer has checked in but already departed
* No-Show – customer did not check-in and did inform the hotel of the reason why





32. **reservation_status_date** : *Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel*

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for elem in df.columns:
  print(f'Unique Values present in {elem} are')
  print("-"*50)
  print(df[elem].unique())
  print("*"*100)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Removing duplicates values present in df
df.drop_duplicates(inplace = True)

In [None]:
print(f' Now number of duplicates values are {df.duplicated().sum()}')

In [None]:
# Droping column COMPANY which has 94% NULL Values.
df.drop(columns = ['company'],inplace = True)

In [None]:
# Filling NULL Values in column AGENT with 0.
# 0 indicates booking was not made by any agent instead done through some other channels.
df['agent'].fillna(value = 0 ,inplace = True)

In [None]:
# Filling NULL Values present in column CHILDREN with 0.
# 0 indicates no children present.
df['children'].fillna(value = 0,inplace = True)

In [None]:
# droppping all those rows in which addtion of of adlults ,children and babies is 0. That simply means  no bookings were made.
df = df[df['adults'] + df['children'] + df['babies'] != 0]

In [None]:
# Filling NULL Values present in column COUNTRY with Mode 
# Repacing with mode because data is missing completely at random
df['country'] = df['country'].replace(np.NaN,df['country'].mode()[0])

In [None]:
# Extracting categorical columns
categorical = list(set(df.columns)-set(df.describe().columns))
categorical

In [None]:
# Extracting Numerical columns
numerical = list(set(df.columns)-set(categorical))
numerical

In [None]:
df.shape[1]

In [None]:
len(categorical) + len(numerical)

In [None]:
# Checking the unique values in categorical columns.
for elem in categorical:
  if elem in ['arrival_date_month','reservation_status_date','country']:
    continue
  else:
    print(f'Unique values in {elem} are {df[elem].unique()}')

In [None]:
# Now doing some feature construction.
df['total_people'] = df['adults'] + df['children'] + df['babies']
df['total_stay'] = df['stays_in_week_nights'] + df['stays_in_weekend_nights']

In [None]:
df.shape

### What all manipulations have you done and insights you found?

####Following Data Manipulation are done:
1.Removal of duplicates rows.

2.column which have very large number of NULL Values are dropped and impution with appropriate value are done to those features which have considerable amount of NULL Values.

3.Separation of categorical and numerical features.

4.Feature constuction.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

##UNIVARIATE ANALYSIS

#### Chart - 1

In [None]:
# Chart - 1 visualization code
colors = ['#ff9999','#66b3ff']
df['hotel'].value_counts().plot.pie(explode = [0,0.1],autopct='%1.1f%%',shadow=True, startangle=90,colors = colors,figsize=(12,8),fontsize=15,labels=None)
labels = df['hotel'].value_counts().index
plt.title('% Distribution of Hotel Preference')
plt.legend(bbox_to_anchor=(0.85, 1), loc='upper left', labels=labels)
plt.show()

##### 1. Why did you pick the specific chart?

I used pie chart because the selected feature has only two unique values and also it gives percentage wise data distribution which is quite helpful for comparison purpose.

##### 2. What is/are the insight(s) found from the chart?

City hotel has the most number of bookings and it is the most preferred hotel by the visitors.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

As far as City Hotel is concerned they have a good market share but for Resort Hotel they have to attract more number of guests in order to put  their market share comparable to City Hotel.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.catplot(data = df,x = 'meal',kind = 'count',aspect = 2.5,height = 9,palette = 'husl')
plt.title("Preferred Meal Type")
plt.show()

Types of meal in hotels:
* BB - (Bed and Breakfast)
* HB- (Half Board)
* FB- (Full Board)
* SC- (Self Catering)



##### 1. Why did you pick the specific chart?

I have used countplot because it gives the frequency of the item present in a particular column which help us to analyze most frequent data.

##### 2. What is/are the insight(s) found from the chart?


#####The most preferred meal type by the guests is BB( Bed and Breakfast) 
#####HB- (Half Board) and SC- (Self Catering) are equally preferred.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

#####BB(Bed and Breakfast) is preferred by most number of guests so Hotels must include this.

#### Chart - 3

In [None]:
df['country'].nunique()

In [None]:
# Chart - 3 visualization code
# plotting top 20 where the most of the guest belongs
plt.figure(figsize = (15,6))
df['country'].value_counts()[:20].plot(kind = 'bar')
plt.title('Country Wise Visitors')
plt.show()

##### 1. Why did you pick the specific chart?

Bar chat is useful to compare if the number of variables are more also we have categorical data on x-axis and numerical data on y-axis,so bar chart is most suited.

##### 2. What is/are the insight(s) found from the chart?

Most visitors are from Portugal(PRT) more than 25000 followed by United Kingdom(GBR) around 10000.

PRT- Portugal

GBR- United Kingdom

FRA- France

ESP- Spain

DEU - Germany

ITA -Itlay

IRL - Ireland

BEL -Belgium

BRA -Brazil

NLD-Netherlands

USA-United States

CHE-Switzerland

CN-China

AUT-Austria

SWE-Sweden

CHN-China

POL-Poland

RUS-Russia

NOR-Norway

ROU-Romania






##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Hotel industry must try to attract guest from the country which are at lower positions in the above chart.

#### Chart - 4

In [None]:
df['market_segment'].unique()

In [None]:
prefer_dist_channel = df['distribution_channel'].value_counts().reset_index().rename(columns = {'index':'distribution_channel','distribution_channel':'counts'})
#adding percentage columns to the distribution_channel_df.
prefer_dist_channel['count_percent'] = round(prefer_dist_channel['counts']*100/len(df),2)
prefer_dist_channel

In [None]:
# Chart - 4 visualization code
# Creating labels
labels = prefer_dist_channel['distribution_channel'].values.tolist()
# Creating sizes
sizes = prefer_dist_channel['count_percent'].values.tolist()


colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99','#ff9900']


# creating pie chart
prefer_dist_channel['count_percent'].plot.pie(explode=[0.05, 0.05,0.05,0.05,0.05], shadow=True, figsize=(15,8),fontsize=10,labels=None)

# setting legends with the percentage values
labels = [f'{l}, {s}%' for l, s in zip(labels, sizes)]


plt.legend(bbox_to_anchor=(0.85, 1), loc='upper left', labels=labels)

plt.title('Used Distribution Channel for Hotel Bookings ')
plt.show()


##### 1. Why did you pick the specific chart?

To make comparison among categorical data in terms of percentage where number of category are less then pie chart is useful.

##### 2. What is/are the insight(s) found from the chart?

'TA/TO' is mostly used for booking hotels.

#### Chart - 5

In [None]:
df['reserved_room_type'].unique()

In [None]:
# Chart - 5 visualization code
sns.catplot(data = df,x = 'reserved_room_type',kind = 'count',height = 11,aspect = 2.5)
plt.title('Most Preffered Room Types')
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Most Prefered Room type is "A"

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels should have more number of rooms having type "A" because this room type is mostly preferred by guests. 

#### Chart - 6

In [None]:
df['customer_type'].unique()

In [None]:
# Chart - 6 visualization code
sns.catplot(data = df,x = 'customer_type',kind = 'count',height = 9,aspect = 1.5)
plt.title('Bar plot of customer_type')
plt.show()

**1. Contract** 
>when the booking has an allotment or other type of contract associated to it

**2. Group**
> when the booking is associated to a group

**3. Transient**
>when the booking is not part of a group or contract, and is not associated to other transient booking

**4. Transient-party**
>when the booking is transient, but is associated to at least other transient booking



##### 2. What is/are the insight(s) found from the chart?

Transient Customer type is maximum While Group is minimum.

#### Chart - 7

In [None]:
df['required_car_parking_spaces'].unique()

In [None]:
# Chart - 7 visualization code
sns.catplot(data = df,x = 'required_car_parking_spaces',kind = 'count',height = 6,aspect = 1.5)
plt.title('Bar plot of Car Parking Space')
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Most guest does not need parking space at all only few guest asked for parking space of one car.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels that does not have parking space more or less has no affect on their business.

#### Chart - 8

In [None]:
df['booking_changes'].unique()

In [None]:
booking_change_df=df['booking_changes'].value_counts().reset_index().rename(columns = {'index':'number_booking_change','booking_changes':'counts'})
booking_change_df['counts_percent'] = booking_change_df['counts']*100/len(df)
booking_change_df

In [None]:
# Chart - 8 visualization code
plt.figure(figsize = (12,8))
sns.barplot(x = booking_change_df['number_booking_change'], y = booking_change_df['counts_percent'])
plt.title("% of Booking change")
plt.show()


### 0 Means no changes made in the booking
### 1 Means 1 changes made in the booking
### 2 Means 2 changes made in the booking & So on.

##### 2. What is/are the insight(s) found from the chart?


Most of the guests around 80% does't made any changes to the booking and around 12% of guest made one changes in their booking.

#BIVARIATE ANALYSIS

#### Chart - 9

In [None]:
# Creating a dataframe where the booking is canceled by the guests.
canceled_df = df[df['is_canceled']==1][['hotel','is_canceled']]
canceled_df.head()

In [None]:
# Checking How many times the hotels were cancelled.
canceled_count = canceled_df.groupby('hotel')['is_canceled'].sum().reset_index().rename(columns = {'is_canceled':'cancellation_count'})
canceled_count

In [None]:
# Checking How many times the respective hotel is booked.
total_booking = df.groupby('hotel')['hotel'].agg({'count'}).reset_index().rename(columns = {'count':'total_booking_counts'})
total_booking

In [None]:
# Now concatenating the above two dataframes.
concatenated_df = pd.merge(canceled_count,total_booking)
concatenated_df

In [None]:
# Now adding percent_canceled feature in concatenated_df
concatenated_df['percent_canceled'] = (concatenated_df['cancellation_count']*100)/concatenated_df['total_booking_counts']
concatenated_df

In [None]:
df.shape

In [None]:
# Chart - 9 visualization code
sns.catplot(data = concatenated_df, x = 'hotel', y = 'percent_canceled', kind = 'bar' ,height = 7 , aspect = 1.5)
plt.title("Percentage of booking cancellation")
plt.show()

In [None]:
# Overall Cncellation vs Non cancellation
colors = ['#ff9999','#66b3ff']
df['is_canceled'].value_counts().plot.pie(explode = [0,0.1],autopct='%1.1f%%',shadow=True, startangle=90,colors = colors,figsize=(12,8),fontsize=15,labels=None)
labels = df['is_canceled'].value_counts().index
plt.title('Cancellation vs Non_cancellation')
plt.legend(bbox_to_anchor=(0.85, 1), loc='upper left', labels=labels)
plt.show()

### 0 = Not Cancelled
### 1 = Cancelled

##### 2. What is/are the insight(s) found from the chart?

Total 27.5% of booking were cancelled out of which 30% of cancellation comes from City Hotel and 24% of cancellation comes from Resort Hotel.
So,City Hotel has higher rate of cancellation. 

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize = (10,8))
sns.barplot(x = df.groupby('hotel')['lead_time'].mean().index,y = df.groupby('hotel')['lead_time'].mean().values)
plt.title("Avg ADR of each Hotel type")
plt.show()

###Booking or Reservation Lead Time is the period of time (most typically measured in calendar days) between when a guest makes the reservation and the actual check-in/arrival date.

##### 2. What is/are the insight(s) found from the chart?

Resort hotel has slightly higher lead time in comparison to City hotel.

#### Chart - 11

In [None]:
df['arrival_date_year'].unique()

In [None]:
# Chart - 11 visualization code
sns.catplot(data = df,x = 'arrival_date_year',kind = 'count',height = 11,aspect = 2.5)
plt.title('Number of bookings across Year')
plt.show()


sns.catplot(data = df,x = 'arrival_date_year',kind = 'count',hue = 'hotel',height = 11,aspect = 2.5)
plt.title('Number of bookings across Year')
plt.show()

##### 2. What is/are the insight(s) found from the chart?

1.2016 had been the best year for these hotels in terms of booking

2.Year 2015 City Hotel and Resort Hotel had excatly same number of bookings but after 2015 City Hotel has more number of bookings.

3.Year 2016 City Hotel had more than 250000 bookings while Resort Hotel has around 15000 bookings whereas in 2017 City Hotel had around 20000 bookings while Resort Hotel had around 12000 bookings.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Resort Hotel which has around same number booking as City Hotel in 2015 but after 2015 Resort hotel had less booking in comparison with City Hotel although the Resort hotel had also Positive rate of change in terms of booking. 

#### Chart - 12

In [None]:
# Applying groupby on month & Hotels so that we can find total number of bookings in each month from each hotel
month_df = df.groupby(['arrival_date_month','hotel']).size().reset_index().rename(columns = {0:'Bookings'})

# Sorting order of month in accordance with calander
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

month_df['arrival_date_month'] = pd.Categorical(month_df['arrival_date_month'], categories=months, ordered=True)
month_df = month_df.sort_values(by = 'arrival_date_month')
month_df

In [None]:
# applying groupby on month 
month_wise_booking = month_df.groupby('arrival_date_month')['Bookings'].sum().reset_index()
month_wise_booking

In [None]:
# Chart - 12 visualization code
sns.set(font_scale=2)
sns.relplot(data = month_wise_booking,x = 'arrival_date_month',y = 'Bookings',kind = 'line',height = 11,aspect = 2.5,lw = 4)  # lw = line_width
plt.title('Number of bookings across each month')
plt.show()

sns.relplot(data = month_df,x = 'arrival_date_month',y = 'Bookings',kind = 'line',hue = 'hotel',height = 11,aspect = 2.5,lw = 4) 
plt.title('Number of bookings across each month')
plt.show()

##### 1. Why did you pick the specific chart?

I picked line chart because line chart is best to find the trend available in the data.

##### 2. What is/are the insight(s) found from the chart?

1.July and August months had the most Bookings.

2.We can find similar trend in each month in each hotel.If City hotel has positive rate of change then resort hotel has also positive rate of change and if City hotel has negative rate of change then Resort Hotel has also negative rate of change although City Hotel has higher rate of change.

#### Chart - 13

In [None]:
# Creating a new DataFrame 
df_1 = df[['hotel','is_repeated_guest']]
df_1.head()

0-New Guest

1-Repeated Guest

In [None]:
# applying groupby on hotel & taking sum which will give number of repeated guest.
repeat_guest_df = df_1.groupby('hotel').sum().reset_index().rename(columns = {'is_repeated_guest':'repeated guest_booking'})
repeat_guest_df


In [None]:
# fetching total number of boookings of each hotel
total_booking = df.groupby('hotel')['hotel'].agg({'count'}).reset_index().rename(columns = {'count':'total_booking_counts'})
total_booking

In [None]:
# merging above two dataframe
merge_df = pd.merge(repeat_guest_df,total_booking,how = 'inner',on = 'hotel')
# adding a new column which will show percentage of repeated guest
merge_df['% repeated guest'] = merge_df['repeated guest_booking']*100/merge_df['total_booking_counts']
merge_df

In [None]:
# Chart - 13 visualization code
sns.catplot(data = merge_df,x = 'hotel', y = '% repeated guest',kind = 'bar',aspect = 1.5,height = 7,palette = "husl")
plt.title('Bar Chart of Repeated guest')
plt.show()

In [None]:
# Visualization of overall repeated guests
colors = ['#ff9999','#66b3ff']
df['is_repeated_guest'].value_counts().plot.pie(explode = [0,0.1],autopct='%1.1f%%',shadow=True, startangle=90,colors = colors,figsize=(12,8),fontsize=15,labels=None)
labels = df['is_repeated_guest'].value_counts().index
plt.title('% Percentgae (%) of repeated guests')
plt.legend(bbox_to_anchor=(0.85, 1), loc='upper left', labels=labels)
plt.show()

0-New Guest

1-Repeated Guest

##### 2. What is/are the insight(s) found from the chart?

Both the hotel has lower retention rate i.e 3.9% in which City hotel has retention rate of 3.11% and Resort hotel has retentionn rate of 5.02%.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Both the hotel will have to look why their retention rate is too low.They will have to make a strategy so that guest who visit hotel should have come again & again.

####Chart - 14

In [None]:
# Applying groupby on arrival_date_month & hotel and evaluating Mean on adr. 
bookings_by_months_df = df.groupby(['arrival_date_month','hotel'])['adr'].mean().reset_index()

#create month list
month = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# Creating order of the month acording to month list.
bookings_by_months_df['arrival_date_month'] = pd.Categorical(bookings_by_months_df['arrival_date_month'],categories = month,ordered= True)

# Now framing the dataframe according to the order of months
bookings_by_months_df = bookings_by_months_df.sort_values('arrival_date_month')
bookings_by_months_df

In [None]:
# Chart - 14 visualization code
sns.relplot(data = bookings_by_months_df,x = 'arrival_date_month' , y = 'adr',hue = 'hotel',kind = 'line',height = 9,aspect = 2.5,lw = 3,palette = "husl")
plt.title('ADR across each month')
plt.show()


##### 2. What is/are the insight(s) found from the chart?

For Resort Hotel adr starts to increase from May and increases upto July after that is starts to decrease.

For City Hotel adr starts to increase fron March and increases upto April after that is starts to decrease.

Chart - 15

In [None]:
# Chart - 15 Visualisation Code
temp_df = df.groupby('lead_time')['is_canceled'].describe()
sns.relplot(data = temp_df,x = temp_df.index,y = temp_df['mean']*100,kind = 'scatter',aspect =1.5,height = 6)
plt.show()

##### 1. Why did you pick the specific chart?

Sctter Plot is used to find the relationship between two Numerical Variables.

##### 2. What is/are the insight(s) found from the chart?

Lead time has a positive correlation with the Cancellation mean as lead time increases the chances of booking get cancelled also increases.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels must try to decrease lead time to avoid booking cancellation from the guest.

Chart - 16

In [None]:
# Chart 16 Visualisation Code
sns.catplot(data = df,x = 'deposit_type',kind = 'count',hue = 'is_canceled',height = 7,aspect = 1.5,palette = "husl")
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Around 25% of bookings were cancelled by guests with no deposit. These numbers are huge if the hotels were not able to replace the cancelled bookings in time. So it's obvious that guests who do not pay any deposit while booking are likely to cancel more reservations. Also it is interesting to note that non-refundable deposits had more cancellation than refundable deposits. Logically one would have assumed that refundable deposits have more cancellation as hotel rates are usually higher for refundable deposit type rooms and customers pay more in anticipation of cancellation.

Chart - 17

In [None]:
# Applying groupby on distribution_channel & hotel and taking sum on 'is_canceled'
canceled_df = df.groupby(['distribution_channel'])['is_canceled'].sum().reset_index().rename(columns = {'is_canceled':"total_canceled"})
canceled_df

In [None]:
# Applying groupby on distribution_channel so that we can find number of boooking made through each channel.
book_df = df.groupby('distribution_channel').size().reset_index().rename(columns = {0:'bookings'})
book_df

In [None]:
# Merging above two dataframe
merged_df = pd.merge(canceled_df,book_df,how = 'inner',on = 'distribution_channel')
# Adding a column % cancellation
merged_df['% cancellation'] = merged_df['total_canceled']*100/merged_df['bookings']
merged_df

In [None]:
# Chart 16 Visualisation Code
sns.catplot(data = df,x = 'distribution_channel',kind = 'count',hue = 'is_canceled',height = 11,aspect = 2.5,palette = "husl")
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Total 69028 bookins were made by TA/TO in which 21400 were cancelled which is approximately 31%.

Chart-18

In [None]:
# Applying groupby on market_segment & hotel and taking sum on 'is_canceled'.
canceled_df_ = df.groupby(['market_segment'])['is_canceled'].sum().reset_index().rename(columns = {'is_canceled':"total_canceled"})
canceled_df_

In [None]:
# Applying groupby on market_segment so that we can find number of boooking made through each channel.
book_df_ = df.groupby('market_segment').size().reset_index().rename(columns = {0:'bookings'})
book_df_

In [None]:
# Merging above two dataframe
merged_df_ = pd.merge(canceled_df_,book_df_,how = 'inner',on = 'market_segment')
# Adding a column % cancellation
merged_df_['% cancellation'] = merged_df_['total_canceled']*100/merged_df_['bookings']
merged_df_

In [None]:
# Chart 18 Visualisation Code
sns.catplot(data = df,x = 'market_segment',kind = 'count',hue = 'is_canceled',height = 11,aspect = 2.5,palette = "husl")
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Online TA has around 35% cancellation rate followed by Groups which has around 27% Cancellation rate.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***