# Hotel Booking Demand (Data Cleaning)

In [None]:
import numpy as np
import pandas as pd 

In [None]:
# Importing the dataset
missing_value=["Undefined"]
data_path = "../input/hotel-booking-demand/hotel_bookings.csv"
hotel = pd.read_csv(data_path, na_values=missing_value)

In [None]:
hotel.head()

In [None]:
hotel.shape

In [None]:
hotel.info()

**Converting the object datatype of "reservation_status_date" to datetime datatype**

In [None]:
hotel['reservation_status_date'] = pd.to_datetime(hotel['reservation_status_date'])

In [None]:
type(hotel['reservation_status_date'][0])

**Creating a new column by combining the year, month and date of arrival together.**

In [None]:
hotel['arrival_date'] = pd.to_datetime(hotel.arrival_date_year.astype(str) + '/' + hotel.arrival_date_month.astype(str) + '/' + hotel.arrival_date_day_of_month.astype(str))

In [None]:
hotel['arrival_date']

In [None]:
hotel.shape

In [None]:
# Checking for the converted datatype
hotel['arrival_date'][0]

**1) Finding the number of missing values**

**Checking how many missing values each column contains**

In [None]:
np.sum(hotel.isnull())

**2) To find the indexes of the missing value**

**Eg. Finding 4 missing value indexes of children column**

In [None]:
hotel.children[hotel.children != hotel.children].index.values

**3) Removing the unwanted columns**

Usually if more than 70% of values in a column are missing and there is no way to fill in the missing values, then the column can be dropped completely from the dataset. 70% of 110390 is 83573

In [None]:
for col in hotel.columns:
    if np.sum(hotel[col].isnull())>(hotel.shape[0] * 0.7):
        hotel.drop(columns=col, inplace=True, axis=1)
print(hotel.shape)

The "arrival_date_week_number" column is of no use as we already have 3 columns of year, month and day giving us
the date of arrival.Also as we have created a new column showing the date, we no more need the 3 seperate
columns. Hence lets remove these columns

In [None]:
hotel.drop(columns=["arrival_date_week_number", "arrival_date_year", "arrival_date_month", "arrival_date_day_of_month"],
           inplace=True, axis=1)


In [None]:
hotel.shape

**4) Removing the unwanted rows**

Depending upon what values we are predicting, we can either remove the entire agent id column or remove the rows
having empty values.
As I gave already showed how to remove a column, lets remove all rows having a missing value in the agent column

In [None]:
hotel.dropna(subset=["agent"], inplace=True)
hotel.shape

**5) Filling the missing values of columns**

First let us fill the children column
Here I have used mean as a value to be replaced with missing values
As mean can be a float, I have taken the lowest nearest integer using the floor method

In [None]:
hotel["children"].fillna(value = hotel["children"].mean(), inplace=True)
hotel["children"] = hotel["children"].apply(np.floor)
print(f"Total missing values in children column after filling = {np.sum(hotel.children.isnull())}")


Now, lets fill market_segment and distribution_channel
As the method of filling used is same for both, a for loop is used

In [None]:
arr=["market_segment", "distribution_channel", "meal", "country"]
print("No of missing values are")
for x in arr:
    hotel[x].fillna(method="bfill", inplace=True)
    print(f"{x}: {np.sum(hotel[x].isnull())}")

Checking if the columns contain any null value

In [None]:
np.sum(hotel.isnull())

*Thus, we did some column alterations, filled the missing values, reduced some unwanted rows, changed some datatypes into the appropriate ones and now our data is pretty cleaned to be fed to our model*