# Introduction
> The hotel-booking dataset contains 119390 observations for a City Hotel and a Resort Hotel. Each observation represents a hotel booking between the 1st of July 2015 and 31st of August 2017, including booking that effectively arrived and booking that were canceled. <br>
The data is originally from the article Hotel Booking Demand Datasets, written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.

In this notebook, I have done some exercises with this dataset given from [Filoger](filoger.com)'s AI Bootcamp which I had signed up before.

# Dataset's Content
> Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted.
Four Columns, 'name', 'email', 'phone number' and 'credit_card' have been artificially created and added to the dataset.

In [None]:
import numpy as np
import pandas as pd

# Q1 - Reading the dataset's CSV file

In [None]:
df = pd.read_csv("../input/hotel-booking/hotel_booking.csv");

By using Pandas library and read_csv() function, the dataset's csv file is called and added as a dataframe into a variable named 'df'.

# Q2 - Dataset's Base information

In [None]:
df.info()

The info() function displays the base informations of the dataframe which contains type of the dataframe, total number of records and range of indexes, columns name, total number of not null (non-null) values and types of columns and the memory usage of the dataframe.


# Q3 - Dataset's number of rows

In [None]:
pd.DataFrame([df.shape],columns=['number_of_rows','number_of_columns'],index=['0'])

The shape attribute represents the total number of columns and rows as a tuple. <br>
I decided to show them as tabular data, so I putted them in a dataframe function.

# Q4 - Does the dataset have any missing data? If the answer is positive, which column have the most missing data?
Solution source: [yun.ir/dxig38](yun.ir/dxig38)

In [None]:
# Q4-1 - Checking missing data
checkMD = df.isnull().values.any();

# Q4-2 - The column with most missing data
columnMD = df.isnull().sum().idxmax();

# Get the output
pd.DataFrame({
    'check_missing_data_existance': checkMD,
    'column_with_maximum_missing_data': columnMD.upper()
},index=['0'])

For checking the existance of missing data in any record, I used the 'isnull()' function. The result will show "NaN" (Not a number). To present the result as boolean, the 'values' attribute is added. <br>
Now, we had found the records with missing data. There are several records founded whereas only one is enough to check for the existance. In this case, I used the 'any()' function to say if there is atleast one missing data, return true, else return false. At last, the result is stored in 'checkMD' variable.

To find the column with the most missing data, I used the 'isnull()' again but this time, I summed the missing data exist in a column using 'sum()' function. To show the column name which have the most missing data, I used the 'idxmax()' function.

# Q5 - Drop 'company' column from DataFrame.

In [None]:
df.drop('company',axis=1).columns.tolist()

Using the 'drop()' function, the specified column (at here is 'company' column) will be removed. It could be removed permanently by using 'inplace=true' attribute in the 'drop()' function, but I wanted to keep the column. <br>
I added 'columns' attribute to display the columns as an array while the company column is removed. Also, I added 'tolist()' function to display the columns as a list.

# Q6 - Which country have the most passengers? Define the top 5 countries.

In [None]:
df['total_passengers'] = (df['adults'] + df['children'] + df['babies']) - df['is_canceled'];

df[['country','adults','children','babies','is_canceled','total_passengers']].groupby('country').sum().nlargest(5,'total_passengers')

The question wants the top 5 contries with the most total passengers. <br>
At first, I summed the 'adults', 'children' and 'babies' columns together. Since some passengers in the dataset had canceled their tickets, the 'is_canceled' column should be subtracted from the summation. The calculation's result is stored in a new column named 'total_passengers'.

Then, beside displaying the results, they need to be grouped by the 'country' column and sum the columns with the 'sum()' function.

If I use the 'max()' function to display the top 5 contries, since I displayed the dataframe with more than 1 column, it doesn't know which column should be sorted by. In this case, I used the 'nlargest()' function to tell it I want the *top 5* records sorted by the *'total_passengers'* column.

Here we go. The top 5 countries with the most total passengers are displayed in a dataframe.

# Q7 - Define the name of the passenger who have the maximum average daily rate (ADR). How much is its price?

In [None]:
name = df['name'][df['adr'].idxmax()];
price = df['adr'].max();

dict = {
    'name': name,
    'price': price
};

pd.DataFrame(dict,index=['Max(ADR)'])

The question wants to know who have the maximum ADR. For this, I first searched for the location of the maximum ADR using 'idxmax()' function. Then, I searched for the name of that passenger in the 'name' column and stored it in the 'name' variable. <br>
The question also wants to know the price. For this case, I found the maximum ADR and displayed it as the price.

At last, I represented the results in a dataframe.

# Q8 - Average of total ADRs with 2 decimals.

In [None]:
df['adr'].mean().round(2)

# Q9 - Define the average of number of nights stayed.

In [None]:
df['total_stays_in_nights'] = df['stays_in_week_nights'] + df['stays_in_weekend_nights'];

df['total_stays_in_nights'].mean().round(2)

# Q10 - Define the name and e-mail of people who had 5 special requests.

In [None]:
df[df['total_of_special_requests'] == 5][['name','email']]

# Q11 - Which first names have the most frequency of last name? Define 5 most frequent family.

In [None]:
df['name'].apply(lambda lname:lname.split()[1]).value_counts().head().to_frame('value_counts')

# Q12 - Define the people whom reserved a hotel with most number of babies and children.

In [None]:
df['total_babies_and_children'] = df['babies'] + df['children'];

df[['name','email','phone-number','total_babies_and_children']].iloc[df['total_babies_and_children'].idxmax()].to_frame('content')

# Q13 - Define the phone number of regions which had the most reservations.

In [None]:
df['phone-number'].apply(lambda phone_code:phone_code[:3]).value_counts().nlargest(3).to_frame('value_counts')