# Description

This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.

All personally identifying information has been removed from the data.

## Gathering

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
from warnings import filterwarnings
filterwarnings('ignore')

df=pd.read_csv('../input/hotel-booking-demand/hotel_bookings.csv')
df.head()

In [None]:
df.shape

## Assessing

In [None]:
# Check for null values
df.isnull().sum()

In [None]:
df.iloc[0]

In [None]:
df.customer_type.value_counts()

In [None]:
df.dtypes

In [None]:
df.company.value_counts()

In [None]:
#Notice that adults and children and babies can not be zero at the same time 
#So, we need to check that
print(df.adults.unique())
print(df.children.unique())
print(df.babies.unique())


In [None]:
df.columns

## Assessing Results

- there are null values 
- is_canceled column should be categorial not integer
- arrival_date_week_number column needed to be removed as it's not important
- agent should be a string not integer
- adults and children and babies can not be zero at the same time                          

# Cleaning

In [None]:
df_copy=df.copy()

## Define

there are null values in country and agent and company columns

- when agent data is null this means that booking is done without the help of a travel agency
- when company is null this means that it may be private and since 90% of the data is null we will drop the column
- The null values in company will be changed to not mentioned


In [None]:
df_copy.isnull().sum()

## Code

In [None]:
df_copy.agent.fillna('0',inplace=True)
df_copy.drop('company',axis=1,inplace=True)
df_copy.country.fillna('not mentioned',inplace=True)
df_copy.children.fillna(df.children.median(),inplace=True)


## Test

In [None]:
df_copy.isnull().sum()

## Define

- adults and children and babies can not be zero at the same time                          

## Code

In [None]:
Filter= (df_copy.adults==0) & (df_copy.children==0) & (df_copy.babies==0) 
df_copy[Filter]

In [None]:
df=df[~Filter]

## Test

In [None]:
df_copy[Filter]

## Define

- is_canceled column should be categorial not integer
- arrival_date_week_number column needed to be removed as it's not important
- agent should be a string not integer

## Code

In [None]:
df_copy.is_canceled=df_copy.is_canceled.astype('category')
df_copy.agent=df_copy.agent.astype(str)
df_copy.drop('arrival_date_week_number',axis=1,inplace=True)

## Test

In [None]:
df_copy.info()

In [None]:
df_copy.dtypes

# Analysis

## What type of Hotel has more bookings?

In [None]:
df_copy['hotel'].value_counts().index.tolist()

In [None]:
df_copy['hotel'].value_counts().plot(kind='pie',figsize=(6,6),fontsize=13,autopct='%1.1f%%',explode=(0, 0.1));

#### it's obvious that City hotels contains the majority of bookings

## Where do the guests come from?

In [None]:
#This package imports definitions for all of Plotly's graph objects.
#the module graph_objs is to provide a clearer API for users.
import plotly.graph_objs as go 

#Plotly Offline allows you to create graphs offline and save them locally.
#Instead of saving the graphs to a server, your data and graphs will remain in your local system.
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)
#Plotly Express is the easy-to-use, high-level interface to Plotly
import plotly.express as px 
    

In [None]:
#Reset the index of the DataFrame, and use the default one instead.
countries_data=df_copy[df_copy['is_canceled']==0]['country'].value_counts().reset_index()
countries_data.columns=['country','no of guests']
countries_data.head()

In [None]:
#A Choropleth Map is a map composed of colored polygons. 
#It is used to represent spatial variations of a quantity

px.choropleth(countries_data,locations=countries_data['country'],
              color=countries_data['no of guests'],
             hover_name=countries_data['country'],
             title='Home country of guests')

In [None]:
df_copy['hotel'].value_counts()

In [None]:
Resort=df_copy[(df_copy['hotel']=='Resort Hotel') & (df_copy['is_canceled']== 0)]
City=df_copy[(df_copy['hotel']=='City Hotel') & (df_copy['is_canceled']== 0)]
print(Resort.shape)
print(City.shape)

In [None]:
Resort['country'].value_counts()[:15].plot(kind='bar');

In [None]:
City['country'].value_counts()[:15].plot(kind='bar') 


**from the map above and the bar plots we can conclude that most of the guests comes from Europe especially in PRT(Portugal)** 

## How much guests pay for a room per night?

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(x='reserved_room_type',y='adr',
            data=df_copy[df_copy.is_canceled == 0] ,hue='hotel')
plt.title('Price of the room per night',fontsize=15)
plt.xlabel('Room type',fontsize=12)
plt.ylabel('price in Euro',fontsize=12)


**we can conclude that the type A room has the highest price among all other room types ( as it has the highest outlier) and in type G room it's obvious that this type of room are more costly than the others**

## How does price of hotel vary across the year?

In [None]:
!pip install sort-dataframeby-monthorweek


In [None]:
!pip install sorted-months-weekdays


In [None]:
Resort.head()

In [None]:
City.head()

In [None]:
Resort_Hotel=Resort.groupby('arrival_date_month')['adr'].mean().reset_index()
City_Hotel=City.groupby('arrival_date_month')['adr'].mean().reset_index()

In [None]:
Resort_Hotel.head()

In [None]:
City_Hotel.head()

In [None]:
final=Resort_Hotel.merge(City_Hotel,on='arrival_date_month')
final.columns=['month','resort_price','city_price']
final

In [None]:
# now we need to sort the months 
import sort_dataframeby_monthorweek as sd

final=sd.Sort_Dataframeby_Month(final,'month')
final.head()

In [None]:
#A line chart is a graphical representation of an asset's historical price
#action that connects a series of data points with a continuous line. 
px.line(final,x='month',y=['resort_price','city_price'],title='Room price per night over the year')

**it's obvious that the room price in Resort hotel is at it's peak in Auguest while the City hotel room price is at it's peak in Auguest and may**

## Analysing Preference of guests, what they basically prefer?

In [None]:
df_copy['meal'].value_counts()

In [None]:
px.pie(df_copy,values=df_copy['meal'].value_counts(),names=df_copy['meal'].value_counts().index)

**we can conclude from here that most of the customers prefer BB (Bed & Breakfast)**

In [None]:
df_copy.total_of_special_requests

In [None]:
# countplot() method is used to Show the counts of observations
#in each categorical bin using bars
sns.countplot(df_copy['total_of_special_requests']);

**we can conclude that most of the customers ( nearly 50% ) doesn't have special request**

## What is the most busy month?

In [None]:
rush_resort=Resort.arrival_date_month.value_counts().reset_index()
rush_resort.columns=['month','no of guests']
rush_resort.head()

In [None]:
rush_city=City.arrival_date_month.value_counts().reset_index()
rush_city.columns=['month','no of guests']
rush_city.head()

In [None]:
final_rush=rush_resort.merge(rush_city,on='month')
final_rush.columns=['month','num of Resort guests','num of City guests']
final_rush.head()

In [None]:
# now we nedd to sort the months 
import sort_dataframeby_monthorweek as sd

final_rush=sd.Sort_Dataframeby_Month(final_rush,'month')
final_rush.head()

In [None]:
px.line(final_rush,x='month',y=['num of Resort guests','num of City guests'])

**City Hotel has more guests during spring and autumn althought the prices are high also**

**in Resort hotel there are a slightly decrease in vistors in june and september and the highest months in vistors are Augest and july**

**Augest is the highest month in vistors and in prices in both hotels**

## bookings by market segment

In [None]:
the_filter=df_copy['is_canceled']==0
clean_data=df_copy[the_filter]

In [None]:
clean_data.is_canceled.unique()

In [None]:
#to get the total numbers of days of the week
clean_data['Total_nights']=clean_data['stays_in_weekend_nights']+ clean_data['stays_in_week_nights']

In [None]:
clean_data.market_segment.value_counts()

In [None]:
px.pie(clean_data,values=clean_data.market_segment.value_counts(),
       names=clean_data.market_segment.value_counts().index,
      title='bookings by market segment')

**we conclude that about of 47.5% of customers make bookings by online TA (Travel Agents)**

## Analysis of cancellation

In [None]:
df_copy['customer_type'].value_counts()

In [None]:
df_copy['is_canceled'].value_counts()

In [None]:
#barplot shows the relationship between a numeric and a categoric variable. 
sns.barplot(x=df_copy['customer_type'],y=df_copy['is_canceled'].astype('int64'))

**Transient customer type are more likely to cancel the booking more than the other types**

In [None]:
sns.barplot(y=df_copy['days_in_waiting_list'],x=df_copy['is_canceled'].astype('int64'))

**as days in waiting list increase the booking is more likely to be cancelled**

In [None]:
sns.barplot(x=df_copy['deposit_type'],y=df_copy['is_canceled'].astype('int64'))

**Booking with Non Refund deposit type are more likely to cancel the booking**