In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
hotel_data = pd.read_csv("../input/hotel-booking-demand/hotel_bookings.csv")

In [None]:
hotel_data.head()

In [None]:
type(hotel_data)

In [None]:
hotel_data.shape

In [None]:
hotel_data.describe()

In [None]:
hotel_data.info()

**Missing Values in Children (float), country (object), agent (float) and company (float)**

# Data Cleaning

In [None]:
hotel_data.isnull().sum()

**Dropping Company Column because more than 90% of data is missing and arrival_dat_week_number as it is unnecessary**

In [None]:
hotel_data.drop('company',inplace=True,axis=1)
hotel_data.drop('arrival_date_week_number',inplace=True,axis=1)

In [None]:
hotel_data.shape

# Data Imputation and Manipulation

**Filling children and agent missing values with median values**

In [None]:
def impute_median(series):
    return series.fillna(series.median())

In [None]:
hotel_data.children = hotel_data['children'].transform(impute_median)
hotel_data.agent = hotel_data['agent'].transform(impute_median)

In [None]:
hotel_data.isnull().sum()

**Filling country missing values with mode**

In [None]:
print(hotel_data['country'].mode())

In [None]:
hotel_data['country'].fillna(str(hotel_data['country'].mode().values[0]),inplace=True)

In [None]:
hotel_data.isnull().sum()

Done with the missing values.

**I want my arrival date to be categorical**

In [None]:
hotel_data['arrival_date_year'] = hotel_data['arrival_date_year'].apply(lambda x: str(x))

# Data Visualization

**We will try to answer the following questions:** 
* What type of hotel has more bookings?
* Which are the most busy months?
* Cancellation rates in the two types of hotels.
* Types of visitors? (No. of adults, children, babies)
* Repeated guests.

# 1. What type of hotel has more bookings?

In [None]:
# Enlarging the pie chart
plt.rcParams['figure.figsize'] = 8,8

# assigning labels and converting them to list 
labels = hotel_data['hotel'].value_counts().index.tolist()

# assigning magnitude and converting to list
sizes = hotel_data['hotel'].value_counts().tolist()

# assigning pie chart color
colors = ["darkorange","lightskyblue"]

# creating pie chart
# autopct enables you to display the percent value using Python string formatting. .1f% will round off to the tenth place.
# startangle will allow the percentage to rotate counter-clockwise. Lets say we have 4 portions: 10%, 30%, 20% and 40%. The pie will rotate from smallest to the biggest (counter clockwise). 10% -> 20% -> 30% -> 40%
# We have only 2 sections so anglestart does not matter
# textprops will adjust the size of text
plt.pie(sizes,labels=labels,colors=colors,autopct='%1.1f%%',startangle=90, textprops={'fontsize': 14})

**To answer our question, majority of the bookings were made in city hotels. This could be because city hotels tend to be cheaper, everything is more accessible and more suitable for individuals or small groups of visitors.**

# 2. Which are the most busy months?

In [None]:
# We can simply use a countplot as we sre visualising categorical data
plt.figure(figsize=(20,5))

# data we will use in a list
l1 = ['hotel','arrival_date_month']

# plotting
sns.countplot(data = hotel_data[l1],x= "arrival_date_month",hue="hotel",order=["January","February","March","April","May","June",
                                                                              "July","August","September","October","November","December"]).set_title(
'Illustration of Number of Visitors Each Month')
plt.xlabel('Month')
plt.ylabel('Count')

**From our visualisation, we can deduce that August is the busiest month for both City hotels and Resort hotels whereas bookings are lowest during January for both types of hotels. 
This could be because of weather conditions as people prefer going for vacation during more comfortable seasons such as Spring/Summer and not during Winter.**

# 3. Cancellation rates in the two types of hotels.

In [None]:
# First we will check proportion of bookings that were cancelled

# Replacing the 1s and 0s in the is_cancelled column to cancelled and not cancelled. 
hotel_data['is_canceled'] = hotel_data.is_canceled.replace([1,0],["Cancelled","Not Cancelled"])
cancelled_data = hotel_data['is_canceled']

# Plotting a countplot
sns.countplot(cancelled_data).set_title("Cancellation Overview")
plt.xlabel("Bookings Cancelled")

**We can see that more than 60% of the bookings were not cancelled.**

In [None]:
# Let's look into how much of bookings were cancelled in each type of hotel
lst1 = ['is_canceled', 'hotel']
type_of_hotel_canceled = hotel_data[lst1]
canceled_hotel = type_of_hotel_canceled[type_of_hotel_canceled['is_canceled'] == 'Cancelled'].groupby(['hotel']).size().reset_index(name = 'count')
canceled_hotel
#sns.barplot(data = canceled_hotel, x = 'hotel', y = 'count').set_title('Graph showing cancellation rates in city and resort hotel')

**We can see city hotels have nearly three times more cancellations than resort hotels and that is partially because city hotels have more bookings as we have analysed earlier.**

# 4. Types of visitors? (No. of adults, children, babies)

In [None]:
# We will just look at number of adults that visit each hotel. We will use a countplot as data is categorical.
sns.countplot(data=hotel_data,x='adults',hue='hotel').set_title("Illustration of number of adults visiting each hotel")

In [None]:
# We'll do the same for children and babies as adults
sns.countplot(data=hotel_data,x='children',hue='hotel').set_title("Illustration of number of children")

In [None]:
sns.countplot(data=hotel_data,x='babies',hue='hotel').set_title("Illustration of number of babies")

**From what we can see in the three plots, among adults mostly couples/two people make reservations in each hotel. For both hotels, it is common for poeple to not bring children or babies along but if they do, at most 1-2 children or 1 baby.**

# 5. Repeated guests.

In [None]:
# We will again use a countplot as we will only see how many guests visited back in each hotel.
sns.countplot(data=hotel_data,x="is_repeated_guest",hue="hotel").set_title("Illustration of number of repeated guests")

**0 means not repeated and 1 means repeated. So we can see that most guests didn't return for visit.**

# Final Analysis

**Here's what the hotels can do to improve business in the future:**

* Resort hotels tend to have less bookings in comparison to city hotels so they need to work on their marketing strategy and promote the hotels more, especially on social media.
* Resort hotels could also reduce prices to increases booking percentages.
* May-August happens to be the busiest months but so the hotels should target more customers and try to do more business during these times.
* Although city hotels have more bookings, they also tend to have more cancellations so to prevent this they could take advance money during vacation. This would ensure most bookings to not being cancelled. They could also apply no-refund policies or make the refund policies rather strict so the customers choose not to cancel.
* It is quite clear most customers travel in pairs and bringing children or babies along are very rare so the hotels could advertise in ways that attract couples more and also business travellers.
* Most guests do not return but as these customers have already visited once, advertisements should be targeted in such ways so they are bound to return the next time they visit. The customers could also be offered special benefits if they do return to stay.