<div style="
    font-weight: bold; 
    font-size: 28px;
    ">
    Booking Cancellation Confirmation
</div>

<br>

<div style="font-size:15px; line-height:1.8">
    
**Name**: Salman Siddiqui <br>
**Email**: SalmanSiddiqui172002@gmail.com <br>

</div>

---

<div style="font-size:15px; line-height:1.8;">

**Description**

...

**Table of Content**

<ol>
    <li style="margin-bottom: 8px; font-weight: bold;">Exploratory Data Analysis (EDA)</li>
    <li style="margin-bottom: 8px; font-weight: bold;">Preprocessing Steps</li>
    <p>
        A. Feature Engineering <br> 
        B. Handle Missing Values <br>
        C. Handle Noisy Data <br>
        D. Handle Categorical Variables <br>
    </p>
    <li style="margin-bottom: 8px; font-weight: bold;">Model Building & Comparison</li>
    <p>
        A. Decision Trees <br> 
        B. Random Forests <br>
        C. XGBoost <br>
        D. Compare Models <br>
    </p>
    <li style="margin-bottom: 8px; font-weight: bold;">Pipeline Building</li>
</ol>

</div>

---

<div style="font-size:15px; line-height:1.8;">

First, we'll start with some standard imports and add any extras as we go.

</div>

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

<div style="font-size:15px; line-height:1.8;">

We can now load our dataset and check out the features we'll be working with.

</div>

In [2]:
data_raw = pd.read_csv("../data/data.csv")

data_raw.sample(3)

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date,name,email,phone-number,credit_card
108320,City Hotel,0,116,2017,March,12,24,0,2,2,...,Transient-Party,100.0,0,1,Check-Out,2017-03-26,Alyssa Lee,ALee@yahoo.com,432-726-1377,************5634
97108,City Hotel,0,28,2016,September,38,11,1,0,2,...,Transient,139.0,0,1,Check-Out,2016-09-12,Virginia Munoz,Virginia.M@mail.com,875-049-4403,************2441
93926,City Hotel,0,130,2016,July,31,26,0,3,3,...,Transient,96.9,0,0,Check-Out,2016-07-29,Colin Lee,Lee_Colin62@protonmail.com,692-894-3455,************9741


<div style="font-size:15px; line-height:1.8;">

Here’s a table that summarizes the features and their meanings.

</div>

<div style="font-size:15px; line-height: 1.8;">

<table>
    <tr>
        <th>Feature</th>
        <th>Description</th>
    </tr>
    <tbody>
        <tr>
            <td>hotel</td>
            <td>Type of hotel (Resort Hotel, City Hotel)</td>
        </tr>
        <tr>
            <td>is_canceled</td>
            <td>
                Reservation cancellation status <br> 
                - 0: not canceled <br>
                - 1: canceled <br>
            </td>
        </tr>
        <tr>
            <td>...</td>
            <td>...</td>
        </tr>
    </tbody>
</table>
    
</div>

<h2 style="font-weight: bold; font-size: 22px;">1. Exploratory Data Analysis (EDA)</h2>

---

<div style="font-size:15px; line-height:1.8;">

Let's begin with a quick overview of our data.

</div>

In [3]:
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 36 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

<div style="font-size:15px; line-height:1.8;">

**Inferences**

<ul>
    <li style="margin-bottom: 8px; font-weight: bold;"># of Entries & Features</li>
    <p>
        - The dataset consists of 119,390 entries. <br> 
        - The dataset consists of 36 features. <br>
    </p>
    <li style="margin-bottom: 8px; font-weight: bold;">Data Types</li>
    <p>
        - 16 columns are of type <code style='font-size:14px; font-weight:bold;'>object</code> representing strings or categorical data. <br>
        - 16 columns are of type <code style='font-size:14px; font-weight:bold;'>int64</code>, representing integer values. <br>
        - 4 columns are of type <code style='font-size:14px; font-weight:bold;'>float64</code>, representing decimal values. <br>
    </p>
    <li style="margin-bottom: 8px; font-weight: bold;">Missing Values</li>
    <p>
        - The column children is missing 4 values.  <br>
        - The column country is missing 488 values.  <br>
        - The column agent is missing 16,340 values.  <br>
        - The column company is missing 112,593 values.  <br>
    </p>
</ul>

</div>

<div style="font-size:15px; line-height:1.8;">

**Note**: Based on the above inference and the provided descriptions, we can conclude that the following columns are categorical in nature. We need to ensure they have the correct data type before moving forward.

<ul>
    <li><code style='font-size:14px; font-weight:bold;'>hotel</code></li>
    <li><code style='font-size:14px; font-weight:bold;'>...</code></li>
    <li><code style='font-size:14px; font-weight:bold;'>...</code></li>
    <li><code style='font-size:14px; font-weight:bold;'>...</code></li>
    <li><code style='font-size:14px; font-weight:bold;'>...</code></li>
    <li><code style='font-size:14px; font-weight:bold;'>...</code></li>
</ul>

</div>