# Introduction
## Dataset Introduction

According the publication from [Hotel booking demand datasets Article](https://www.sciencedirect.com/science/article/pii/S2352340918315191), the 2 hotels are located in Portugal: H1 at the resort region of Algarve while H2 at the city of Lisbon.

Total rows of this dataset are 119390. It included 40060 (resort hotel) and 79330(city hotel).

For the following analysis is based on  4 factors at table below :

| Factors              | Attributes                                                                                                                                                                                                                                                                           |
|---------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Category                        | 'hotel', 'is_canceled','reservation_status'                                                                                                                                                                                                              |
| Time                            |       'arrival_date_year', 'arrival_date_day_of_month',     'stays_in_weekend_nights', 'stays_in_week_nights'                                                                                                      |
| Demografic             | 'adults', 'children', 'babies' ,Country                                                                                                                                                                                                                                                     |
| Marketing and Customer Behavior |  'market_segment', 'distribution_channel' , 'is_repeated_guest'   |
| Hotel Services                  | meal,  'required_car_parking_spaces'                                                                                                                                                                                                                   |

## Sections of Data Analysis

### **Overall View**
1. Input Data
2. Category
3. Demographic

### **Marketing Strategy**
3. Marketing
4. Time
5. Services


6. Conclusion

# Input Data

In [None]:
import os
import pandas as pd
import plotly.express as px

pd.set_option("display.precision", 2)

# Important code block for plotly graph display at Kaggle
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
init_notebook_mode(connected=True) 


for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
ds=pd.read_csv('/kaggle/input/hotel-booking-demand/hotel_bookings.csv')



print('ds.info')
print('='*30)
ds.info()
print('ds.describe')
print('='*30)
ds.describe() 
 

# Missing Value Check : Country, Children, Agent and Company columns are contain NULL value
# print('Missing value')
# print('='*30)
# df_missing_value=ds.isnull()
# for i in df_missing_value.columns:
#     d=df_missing_value[i].value_counts()
 
#     print(d)
#     print()
#     print('-'*30)



# Category

The category section is intended to provide an overall view of the reservation status by using the hotels. It included data analysis of reservation status by hotels.

## Total Number of Guest by Reservation Status

According the first bar, it shows that check-out person is the most where is 75166. It included 46228 from city hotel and 28938 person from resort hotel. While 'no show' (people canceled booking wthout reason) has the least persons. Canceled booking persons has 43017 persons. It included 32186 persons from city hotel and 10831 persons from resort hotel.

In conclusion, city hotel has big ratio at canceled and check-out compare to resort hotel. But at another edge, resort hotel has low canceled rate compare to city hotel.

In [None]:
# Total guest by hotel. 
df_city=ds.loc[ds['hotel'] == 'City Hotel']
df_resort=ds.loc[ds['hotel'] == 'Resort Hotel']

# Get reservation status count for city hotel and resort hotel,
# Method: Group by 2 column and get the count
df_whole=ds.groupby(['reservation_status']).size().reset_index().rename(columns={0:'guest_count'})
df_guest_count=ds.groupby(['hotel', 'reservation_status']).size().reset_index().rename(columns={0:'guest_count'})

fig_w = px.bar(df_whole, 
             x="reservation_status", 
             y="guest_count", 
             title='Total Number of Guests by Reservation Status',
             text='guest_count',
            )
fig_w.update_traces(textposition='outside')

fig_w.show()

fig_h= px.bar(df_guest_count, 
             x="reservation_status", 
             y="guest_count", 
             color='hotel',
             title='Total Number of Guests by Reservation Status and Hotels',
             text='guest_count',
            )
fig_h.update_traces(textposition='outside')

fig_h.show()






## Demographic

This section contained data analysis of reservation status by country and total actual guest by age group.

## Country 

It total has total 176 countries in this data set.

### Bubble Map 

Bubble map below showing the different views from canceled, check out and no show (canceled without reason) by the total guest. User can slide the slider to choose the desired option.For the canceled view, the biggest bubble size is green bubble. According the green bubble,we can know that Portugal has the most canceled booking person which is 26756 persons. For check-out view, it has 5 big size bubble which is Portugal (21071),United Kingdom(9676),Frances(8481),Spain (6391), and Germany (6069). This meaning that the top 5 booking countries are this 5 countries.



In [None]:
import pycountry

df_country=ds.groupby([ 'reservation_status','country']).size().reset_index().rename(columns={0:'guest_count'})

list_alpha_2 = [i.alpha_2 for i in list(pycountry.countries)]
list_alpha_3 = [i.alpha_3 for i in list(pycountry.countries)]    

def country_flag(df):
    if (len(df['country'])==2 and df['country'] in list_alpha_2):
        return pycountry.countries.get(alpha_2=df['country']).name
    elif (len(df['country'])==3 and df['country'] in list_alpha_3):
        return pycountry.countries.get(alpha_3=df['country']).name
    else:
        return 'Invalid Code'

df_country['country_name']=df_country.apply(country_flag, axis = 1)

fig = px.scatter_geo(df_country, locations="country", color="country",
                     hover_name="country_name", size="guest_count",
                   
                     animation_frame="reservation_status",
                     projection="natural earth")


fig.update_layout(
        title_text = 'Total Guest Count By Country',

    showlegend = True,
        margin = dict(t=0, l=0, r=0, b=0),
    
       
    )



fig.show()


### Sunburst chart: Segmentation of Countries and Reservation Status By Hotel
- User can click the segmnet of the sunburst chart to see further detail.

The sunburst chart below is proposed to show the 'Segmentation of Countries and Reservation Status By Hotel'.

According the sunburst chart below, city hotel has the big ratio which is 67% compare to resort hotel (33%). But the checkout and canceled ratio of city hotel is 5:5.Besides, Portugal has the highest canceled rate (61%) compare to check out rate (24%). This indicated that Portugal guest has high probability to canceled their booking.

For resort hotel, has the opposite visualzation compare to city hotel. It only has 27% canceled rate and has the highest 72% check out rate. At the other hand, it has same result as city hotel. Porugal guest has high probability canceled their booking. 

In [None]:
# Sunburst chart for countries
df_country_hotel=ds.groupby([ 'hotel','reservation_status','country']).size().reset_index().rename(columns={0:'guest_count'})
df_country_hotel['country_name']=df_country_hotel.apply(country_flag, axis = 1)

fig =px.sunburst(
    df_country_hotel,
    path=['hotel','reservation_status', 'country_name'],
    values='guest_count',
    color_continuous_scale='RdBu',
    color='guest_count',

)
fig.update_layout(
    margin = dict(t=10, l=10, r=10, b=10)
)



fig.update_traces(go.Sunburst(hovertemplate='<b>%{label} </b> <br><br>%{value:,.0f}',textinfo='label+percent parent'))



fig.show()

## Age Group

### Age Group Distribution of Pie Chart

Based on the illustration, the most of guest are adults.

In [None]:
adults = ds['adults'].sum()
children=ds['children'].sum()
babies=ds['babies'].sum()


age_group={'age_group':['adults','children','babies'],
           'counts':[adults,children,babies]}
df_age_group=pd.DataFrame(age_group,columns=['age_group','counts'])


fig = px.pie(df_age_group, 
             values='counts', 
             names='age_group',
             title='Guest by Age Groups',
             hover_data=['age_group'], labels={'age_group':'Age Group'}
            
            )
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig.show()


# Marketing

This section will introduced the data analysis of loyalty guest by countries, market segmentation by hotel and reservation status, and distribution channel.

## Sunburst: Segmentation of Countries Repeat Guest

According chart below, city hotel is doing well in customer ralationship management than resort hotel. This can be proof by the sunburst chart below. It stated that the city hotel has 67% (79306 person) of repeated guest but resort hotel is only has 33% (39596) of repeated guest. 

Most of loyalty guest of city hotel are from Portugal,Frances,Germany,UK and Spain. While, resort hotel loyaltly guests are from Portugal,UK,Spain,Ireland and Frances.




In [None]:

df_repeated_guest=ds.groupby([ 'hotel','country'])['is_repeated_guest'].size().reset_index()



df_repeated_guest['country_name']=df_repeated_guest.apply(country_flag, axis = 1)

fig=px.sunburst(
    df_repeated_guest,
    path=['hotel','country_name'],
    values='is_repeated_guest',
    color_continuous_scale='RdBu',
    color='is_repeated_guest',
    maxdepth=2
)
fig.update_traces(go.Sunburst(hovertemplate='<b>%{label} </b> <br><br>%{value:,.0f}',textinfo='label+percent parent'))
fig.update_layout(

    margin = dict(t=0, l=0, r=0, b=0)
)

fig.show()

## Market Segment

Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”

## Sunburst Chart: Segmentation of Distribution Channel by Reservation Status and Hotel

According 2 sunburst chart below, the main distribution channel used by these 2 hotel is TA/TO. 

In [None]:
df_market_segment=ds.groupby(['hotel','market_segment','reservation_status']).size().reset_index().rename(columns={0:'guest_count'})
df_distribution_channel=ds.groupby([ 'hotel','reservation_status','distribution_channel']).size().reset_index().rename(columns={0:'guest_count'})


fig_m=px.sunburst(
    df_market_segment,
    path=['reservation_status','hotel','market_segment'],
    values='guest_count',
    color_continuous_scale='RdBu',
    color='guest_count',
    maxdepth=3
)


fig_d=px.sunburst(
    df_distribution_channel,
    path=['reservation_status','hotel','distribution_channel'],
    values='guest_count',
    color_continuous_scale='RdBu',
    color='guest_count',
    maxdepth=3,
    
)

fig_m.update_traces(go.Sunburst(hovertemplate='<b>%{label} </b> <br><br>%{value:,.0f}',textinfo='label+percent parent'))
fig_d.update_traces(go.Sunburst(hovertemplate='<b>%{label} </b> <br><br>%{value:,.0f}',textinfo='label+percent parent'))


fig_m.update_layout(
  
    margin = dict(t=0, l=0, r=0, b=0)
    
)


fig_d.update_layout(

    margin = dict(t=0, l=0, r=0, b=0)
)

fig_m.show()
fig_d.show()

# Analysis by Time factor

## Line Graph :Total Actual Guest over Time

The line graph is show the total actual guest over years from 2015 until 2017. The trend is dynamic. It hit a lowest point at 13,854k.However, it increased sharply until 36.37k at 2016. Then drop to 24,942k.

At the end,the peak time is located at 2016 with 36.37k.

## Pie Chart : Percentage of Stay in Weekend or Weekday by Total Actual Guest

The pie chart illustrated that most of guest is like to stay in weekday by 72.6% compare to stay in weekend by 27.4%. It is not a small gap. Therefore, we can inferenced that, weekday has less room rate for stay. So people are like to stay in weekday.


## Bar Chart : Total Actual Guest by Length Stay in Weekend and Weekday

These 2 bar charts is shows that most guest was willing to have a 1 or 2 day stay in weekend or weekday.

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Actual Guest by year, weekday and weekend 

# Weekend and weekday
df_actual_guest=ds[ds['is_canceled']==0]
weekend_count=df_actual_guest['stays_in_weekend_nights'].sum()
weekday_count=df_actual_guest['stays_in_week_nights'].sum()

df_stay_length_wk=df_actual_guest.groupby('stays_in_weekend_nights').size().reset_index().rename(columns={0:'guest_count'})
df_stay_length_wd=df_actual_guest.groupby('stays_in_week_nights').size().reset_index().rename(columns={0:'guest_count'})

df_actualguest_by_year=df_actual_guest.groupby(['is_canceled','arrival_date_year']).size().reset_index().rename(columns={0:'guest_count'})


fig = make_subplots(rows=1, cols=2, specs=[[{},{"type": "pie"}]])

fig.add_trace(go.Scatter(x=list(df_actualguest_by_year.arrival_date_year),
                         y=list(df_actualguest_by_year.guest_count)),
                        
              row=1, col=1)

fig.add_trace(go.Pie(
     values=[weekday_count,weekend_count],
     labels=['Stay in weekday','Stay in weekend'],
     ), 
     row=1, col=2)



fig.update_layout(
                  title_text="Total Actual Guest by Time Factor")
fig.show()


fig_length_stay = make_subplots(rows=1, cols=2, specs=[[{"type": "xy"},{"type": "xy"}]])

fig_length_stay.add_trace(go.Bar(x=list(df_stay_length_wk.stays_in_weekend_nights), y=list(df_stay_length_wk.guest_count),name='Total Count of Stay in Weekend Nights'
                   ),
           
     row=1, col=1)


fig_length_stay.add_trace(go.Bar(x=list(df_stay_length_wd.stays_in_week_nights), y=list(df_stay_length_wd.guest_count),name='Total Count of Stay in Week Nights'
                   ),
           
     row=1, col=2)



fig_length_stay.update_layout(
                  title_text="Total Actual Guest by Length Stay in Weekend and Weekday")
fig_length_stay.show()



# Meal 

Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner)

## Bar Chart: Popular Meal Selection By Hotel

According Bar chart below,it indicated that most of guests was like the BB (Bed and Breakfast Package) whether is city hotel or resort hotel. 

The second picker of city hotel is no meal package, while resort hotel is Half board (breakfast and one other meal – usually dinner).



In [None]:
df_meal=ds.groupby(['hotel','meal']).size().reset_index().rename(columns={0:'guest_count'})


fig = px.bar(df_meal, x="hotel", y="guest_count", color='meal', barmode='group',
             height=400,text='guest_count'
         )
fig.update_layout(title_text='Popular Meal Package Selection By Hotel')
fig.show()

## Bar Chart: Number of Car Parking Spaces Requirement By Hotel

According Bar chart below,it indicated that most of guests were like to have no car parking spaces whether is city hotel or resort hotel.  It only has fewer guests need to have one car parking spaces.

In [None]:
df_parking_lot=ds.groupby(['hotel','required_car_parking_spaces']).size().reset_index().rename(columns={0:'guest_count'})


fig = px.bar(df_parking_lot, x="required_car_parking_spaces", y="guest_count", color='hotel',
             height=500,text='guest_count'
         )
fig.update_layout(title_text='Number of Car Parking Spaces Requirement By Hotel',barmode='group')
fig.show()

# Conclusion



In conclusion,although the 2 of hotels are mainly using the same  distribution channel which is TA/TO.Yet city hotel is the first picker by guests.But city hotel has higher canceled rate compare to check out rate. While Resort hotel has lower canceled rate compare to check out rate. For demographic, the most of guests are come from Portugal (21071),United Kingdom(9676),Frances(8481),Spain (6391), and Germany (6069). Portugal's guests has high probability canceled their booking for city hotel and resort hotel. Furthermore, the bigger proportion of guests are adults. It can said that most of guests are couple, man and wife or tour group. It has least family group guests as well.

For marketing strategy,City hotel is doing well in customer ralationship management than resort hotel. Because they have a lot of loyalty customer compare to resort hotel.For Most of guest is like to stay in weekday by 72.6% compare to stay in weekend by 27.4%. It is not a small gap. In addition, they was willing to have a 1 or 2 day stay in weekend or weekday whether is city hotel and resort hotel.For meal package, the most of guests was like to pick the BB (Bed and Breakfast Package) whether is city hotel or resort hotel. The second meal package picker of city hotel is no meal package, while resort hotel is Half board (breakfast and one other meal – usually dinner). Most of guests were like to have no car parking spaces whether is city hotel or resort hotel.It only has fewer guests need to have one car parking spaces. 


Lastly, city hotel should put effort to solve the higher cancelation rate problem. While resort hotel should put effort on customer relation management to keep their guests loyaltly. Besides,this data anlaysis is still inadequate for analyze the booking cancelation and room rate.



