# Data Cleaning Project

# **Data Cleaning Process**

 1. Import Libraries

 2. Readability of Dataset

 3. Drop Duplicates

 4. Reset Index

 5. Finding Missing Values

 6. Dealing with Missing Values

 7. Uniformity of Data

 8. Delete Unnecessary Columns

 9. Time Series Formatting

--- 

## **1. Import Libraries**
Import necessary Python libraries such as **Pandas**, **NumPy**, and others to work with the dataset.  
This step ensures that we have the required tools for data manipulation and cleaning.

---

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
import os

---

## **2. Readability of Dataset**
Examine the dataset to understand its structure, format, and basic information before proceeding with cleaning.  
Use methods like `.head()`, `.info()`, and `.describe()` to get an overview of the data.

---

In [6]:
# Read the dataset
data = pd.read_csv(r"C:\Users\rahim\OneDrive\Desktop\Hotel dataset\Manipulated Data\hotel_bookings for explaination.csv")

# Display basic info about the dataset
data.info()

df = data 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

In [7]:
# Print the first 10 rows to check the data
print(df.head(10))

# Print the data types of columns to ensure correct types
print(df.dtypes)

# Print all column names to ensure they are correct
print(df.columns)
df.describe()

          hotel  is_canceled  lead_time  arrival_date_year arrival_date_month  \
0  Resort Hotel            0        342               2015               July   
1  Resort Hotel            0        737               2015               July   
2  Resort Hotel            0          7               2015               July   
3  Resort Hotel            0         13               2015               July   
4  Resort Hotel            0         14               2015               July   
5  Resort Hotel            0         14               2015               July   
6  Resort Hotel            0          0               2015               July   
7  Resort Hotel            0          9               2015               July   
8  Resort Hotel            1         85               2015               July   
9  Resort Hotel            1         75               2015               July   

   arrival_date_week_number  arrival_date_day_of_month  \
0                        27                       

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,agent,company,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests
count,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119386.0,119390.0,119390.0,119390.0,119390.0,119390.0,103050.0,6797.0,119390.0,119390.0,119390.0,119390.0
mean,0.370416,104.011416,2016.156554,27.165173,15.798241,0.927599,2.500302,1.856403,0.10389,0.007949,0.031912,0.087118,0.137097,0.221124,86.693382,189.266735,2.321149,101.831122,0.062518,0.571363
std,0.482918,106.863097,0.707476,13.605138,8.780829,0.998613,1.908286,0.579261,0.398561,0.097436,0.175767,0.844336,1.497437,0.652306,110.774548,131.655015,17.594721,50.53579,0.245291,0.792798
min,0.0,0.0,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0,0.0,-6.38,0.0,0.0
25%,0.0,18.0,2016.0,16.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,62.0,0.0,69.29,0.0,0.0
50%,0.0,69.0,2016.0,28.0,16.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,179.0,0.0,94.575,0.0,0.0
75%,1.0,160.0,2017.0,38.0,23.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,229.0,270.0,0.0,126.0,0.0,1.0
max,1.0,737.0,2017.0,53.0,31.0,19.0,50.0,55.0,10.0,10.0,1.0,26.0,72.0,21.0,535.0,543.0,391.0,5400.0,8.0,5.0


---

## **3. Drop Duplicates**
Remove any **duplicate rows** from the dataset to ensure that the data contains only **unique records**.  
Duplicates can inflate results, so it’s essential to clean them before further analysis.

---

In [8]:
df  = df.drop_duplicates()

df

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,01/07/2015
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,01/07/2015
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.00,0,0,Check-Out,02/07/2015
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.00,0,0,Check-Out,02/07/2015
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.00,0,1,Check-Out,03/07/2015
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,City Hotel,0,23,2017,August,35,30,2,5,2,...,No Deposit,394.0,,0,Transient,96.14,0,0,Check-Out,06/09/2017
119386,City Hotel,0,102,2017,August,35,31,2,5,3,...,No Deposit,9.0,,0,Transient,225.43,0,2,Check-Out,07/09/2017
119387,City Hotel,0,34,2017,August,35,31,2,5,2,...,No Deposit,9.0,,0,Transient,157.71,0,4,Check-Out,07/09/2017
119388,City Hotel,0,109,2017,August,35,31,2,5,2,...,No Deposit,89.0,,0,Transient,104.40,0,0,Check-Out,07/09/2017



---

## **4. Reset Index**
**Reset the index** of the DataFrame after dropping duplicates to ensure the indices are **sequential**.  
This avoids keeping any unnecessary indices from the previous dataset.

---

In [9]:
# Reset the index and drop the old index
df.reset_index(drop=True, inplace=True)

# Display the first few rows to confirm the reset
df.head()

# Check the updated shape
df.shape

(87396, 32)

In [10]:
df

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,01/07/2015
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,01/07/2015
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.00,0,0,Check-Out,02/07/2015
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.00,0,0,Check-Out,02/07/2015
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.00,0,1,Check-Out,03/07/2015
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87391,City Hotel,0,23,2017,August,35,30,2,5,2,...,No Deposit,394.0,,0,Transient,96.14,0,0,Check-Out,06/09/2017
87392,City Hotel,0,102,2017,August,35,31,2,5,3,...,No Deposit,9.0,,0,Transient,225.43,0,2,Check-Out,07/09/2017
87393,City Hotel,0,34,2017,August,35,31,2,5,2,...,No Deposit,9.0,,0,Transient,157.71,0,4,Check-Out,07/09/2017
87394,City Hotel,0,109,2017,August,35,31,2,5,2,...,No Deposit,89.0,,0,Transient,104.40,0,0,Check-Out,07/09/2017


---

## **5. Finding Missing Values**
Identify columns with **missing data** using functions like `.isna()` or `.isnull()`.  
Finding missing values is crucial for addressing gaps in the dataset, ensuring it’s ready for analysis.

---

In [11]:
df_missing = df.isnull().sum()
type(df_missing)


pandas.core.series.Series

In [12]:
df_missing.tail(10) # Display the last 10 rows of missing values

deposit_type                       0
agent                          12193
company                        82137
days_in_waiting_list               0
customer_type                      0
adr                                0
required_car_parking_spaces        0
total_of_special_requests          0
reservation_status                 0
reservation_status_date            0
dtype: int64

In [13]:
df_missing

hotel                                 0
is_canceled                           0
lead_time                             0
arrival_date_year                     0
arrival_date_month                    0
arrival_date_week_number              0
arrival_date_day_of_month             0
stays_in_weekend_nights               0
stays_in_week_nights                  0
adults                                0
children                              4
babies                                0
meal                                  0
country                             452
market_segment                        0
distribution_channel                  0
is_repeated_guest                     0
previous_cancellations                0
previous_bookings_not_canceled        0
reserved_room_type                    0
assigned_room_type                    0
booking_changes                       0
deposit_type                          0
agent                             12193
company                           82137



---

## **6. Dealing with Missing Values**
Handle missing values in the following columns:
- **Children (4)**: Replace with the **mode** (most frequent value) to maintain consistency in demographic data.
- **Country (452)**: Replace missing values with the **mode** to avoid losing demographic information, ensuring the dataset remains useful.
- **Agent (12k)**: Since using the **mode** might distort analysis (e.g., agent 240 having the most sales), **create a new agent number** for missing values, treating them as anomalies to be handled separately.
- **Company (87k)**: **Remove the column** entirely as it has too many missing values and offers no utility for analysis.

---

##

In [14]:
children_mode = df['children'].mode()[0]  # Get the mode of the 'children' column
# Fill missing values in 'children' column with the mode

df.loc[:, 'children'] = df['children'].fillna(children_mode)

In [15]:
type(df)

pandas.core.frame.DataFrame

This code is for making country full name from abbrevation

In [16]:
children_mode = df['country'].mode()[0] 
# Get the mode of the 'country' column
# Fill missing values in 'country' column with the mode

df.loc[:, 'country'] = df['country'].fillna(children_mode)

In [17]:
df.loc[:, 'agent'] = df['agent'].fillna(9999)

# To confirm the changes, let's display the first few rows of the DataFrame
df.head()

# Optionally, you can track how many rows had their 'agent' value replaced by the new agent (255)
missing_agent_count = df[df['agent'] == 9999].shape[0]
print(f"Number of missing 'agent' values replaced: {missing_agent_count}")
df.head()

Number of missing 'agent' values replaced: 12193


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,9999.0,,0,Transient,0.0,0,0,Check-Out,01/07/2015
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,9999.0,,0,Transient,0.0,0,0,Check-Out,01/07/2015
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,9999.0,,0,Transient,75.0,0,0,Check-Out,02/07/2015
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,02/07/2015
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,03/07/2015


In [18]:
# Removing the 'company' column
df.drop(columns=['company'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['company'], inplace=True)


In [19]:
type(df)

pandas.core.frame.DataFrame

In [20]:
df_check = df.isnull().sum()
df_check

hotel                             0
is_canceled                       0
lead_time                         0
arrival_date_year                 0
arrival_date_month                0
arrival_date_week_number          0
arrival_date_day_of_month         0
stays_in_weekend_nights           0
stays_in_week_nights              0
adults                            0
children                          0
babies                            0
meal                              0
country                           0
market_segment                    0
distribution_channel              0
is_repeated_guest                 0
previous_cancellations            0
previous_bookings_not_canceled    0
reserved_room_type                0
assigned_room_type                0
booking_changes                   0
deposit_type                      0
agent                             0
days_in_waiting_list              0
customer_type                     0
adr                               0
required_car_parking_spaces 


---

## **7. Uniformity of Data**
**Standardize** data formats and ensure **consistency** across the dataset, such as ensuring consistent **case formatting** or removing extra spaces.

---


In [21]:
country_map = {
    'AFG': 'Afghanistan',
    'ALB': 'Albania',
    'DZA': 'Algeria',
    'AND': 'Andorra',
    'AGO': 'Angola',
    'ATG': 'Antigua and Barbuda',
    'ARG': 'Argentina',
    'ARM': 'Armenia',
    'AUS': 'Australia',
    'AUT': 'Austria',
    'AZE': 'Azerbaijan',
    'BHS': 'Bahamas',
    'BHR': 'Bahrain',
    'BGD': 'Bangladesh',
    'BRB': 'Barbados',
    'BLR': 'Belarus',
    'BEL': 'Belgium',
    'BLZ': 'Belize',
    'BEN': 'Benin',
    'BTN': 'Bhutan',
    'BOL': 'Bolivia',
    'BES': 'Bonaire, Sint Eustatius and Saba',
    'BIH': 'Bosnia and Herzegovina',
    'BWA': 'Botswana',
    'BVT': 'Bouvet Island',
    'BRA': 'Brazil',
    'BRN': 'Brunei Darussalam',
    'BGR': 'Bulgaria',
    'BFA': 'Burkina Faso',
    'BDI': 'Burundi',
    'KHM': 'Cambodia',
    'CMR': 'Cameroon',
    'CAN': 'Canada',
    'CPV': 'Cape Verde',
    'CYM': 'Cayman Islands',
    'CAF': 'Central African Republic',
    'TCD': 'Chad',
    'CHL': 'Chile',
    'CHN': 'China',
    'CN': 'China',
    'CXR': 'Christmas Island',
    'CCK': 'Cocos (Keeling) Islands',
    'COL': 'Colombia',
    'COM': 'Comoros',
    'COG': 'Congo',
    'COD': 'Congo (Democratic Republic of the)',
    'COK': 'Cook Islands',
    'CRI': 'Costa Rica',
    'CIV': 'Côte d\'Ivoire',
    'HRV': 'Croatia',
    'CUB': 'Cuba',
    'CUW': 'Curaçao',
    'CYP': 'Cyprus',
    'CZE': 'Czech Republic',
    'DNK': 'Denmark',
    'DJI': 'Djibouti',
    'DMA': 'Dominica',
    'DOM': 'Dominican Republic',
    'ECU': 'Ecuador',
    'EGY': 'Egypt',
    'SLV': 'El Salvador',
    'GNQ': 'Equatorial Guinea',
    'ERI': 'Eritrea',
    'EST': 'Estonia',
    'SWZ': 'Eswatini',
    'ETH': 'Ethiopia',
    'FLK': 'Falkland Islands (Malvinas)',
    'FRO': 'Faroe Islands',
    'FJI': 'Fiji',
    'FIN': 'Finland',
    'FRA': 'France',
    'GAB': 'Gabon',
    'GMB': 'Gambia',
    'GEO': 'Georgia',
    'DEU': 'Germany',
    'GHA': 'Ghana',
    'GIB': 'Gibraltar',
    'GRC': 'Greece',
    'GRL': 'Greenland',
    'GRD': 'Grenada',
    'GUM': 'Guam',
    'GTM': 'Guatemala',
    'GIN': 'Guinea',
    'GNB': 'Guinea-Bissau',
    'GUY': 'Guyana',
    'HTI': 'Haiti',
    'HMD': 'Heard Island and McDonald Islands',
    'HND': 'Honduras',
    'HKG': 'Hong Kong',
    'HUN': 'Hungary',
    'ISL': 'Iceland',
    'IND': 'India',
    'IDN': 'Indonesia',
    'IRN': 'Iran',
    'IRQ': 'Iraq',
    'IRL': 'Ireland',
    'ISR': 'Israel',
    'ITA': 'Italy',
    'JAM': 'Jamaica',
    'JPN': 'Japan',
    'JOR': 'Jordan',
    'KAZ': 'Kazakhstan',
    'KEN': 'Kenya',
    'KIR': 'Kiribati',
    'KWT': 'Kuwait',
    'KGZ': 'Kyrgyzstan',
    'LAO': 'Lao People\'s Democratic Republic',
    'LVA': 'Latvia',
    'LBN': 'Lebanon',
    'LSO': 'Lesotho',
    'LBR': 'Liberia',
    'LBY': 'Libya',
    'LIE': 'Liechtenstein',
    'LTU': 'Lithuania',
    'LUX': 'Luxembourg',
    'MAC': 'Macao',
    'MDG': 'Madagascar',
    'MWI': 'Malawi',
    'MYS': 'Malaysia',
    'MDV': 'Maldives',
    'MLI': 'Mali',
    'MLT': 'Malta',
    'MHL': 'Marshall Islands',
    'MTQ': 'Martinique',
    'MRT': 'Mauritania',
    'MUS': 'Mauritius',
    'MYT': 'Mayotte',
    'MEX': 'Mexico',
    'FSM': 'Micronesia (Federated States of)',
    'MDA': 'Moldova (Republic of)',
    'MCO': 'Monaco',
    'MNG': 'Mongolia',
    'MNE': 'Montenegro',
    'MSR': 'Montserrat',
    'MAR': 'Morocco',
    'MOZ': 'Mozambique',
    'MMR': 'Myanmar',
    'NAM': 'Namibia',
    'NRU': 'Nauru',
    'NPL': 'Nepal',
    'NLD': 'Netherlands',
    'NCL': 'New Caledonia',
    'NZL': 'New Zealand',
    'NIC': 'Nicaragua',
    'NER': 'Niger',
    'NGA': 'Nigeria',
    'NIU': 'Niue',
    'NFK': 'Norfolk Island',
    'NOR': 'Norway',
    'OMN': 'Oman',
    'PAK': 'Pakistan',
    'PLW': 'Palau',
    'PSE': 'Palestine, State of',
    'PAN': 'Panama',
    'PNG': 'Papua New Guinea',
    'PRY': 'Paraguay',
    'PER': 'Peru',
    'PHL': 'Philippines',
    'PCN': 'Pitcairn Islands',
    'POL': 'Poland',
    'PRT': 'Portugal',
    'PRI': 'Puerto Rico',
    'QAT': 'Qatar',
    'ROU': 'Romania',
    'RUS': 'Russian Federation',
    'RWA': 'Rwanda',
    'REU': 'Réunion',
    'KNA': 'Saint Kitts and Nevis',
    'LCA': 'Saint Lucia',
    'VCT': 'Saint Vincent and the Grenadines',
    'WSM': 'Samoa',
    'SMR': 'San Marino',
    'STP': 'Sao Tome and Principe',
    'SAU': 'Saudi Arabia',
    'SEN': 'Senegal',
    'SYC': 'Seychelles',
    'SLE': 'Sierra Leone',
    'SGP': 'Singapore',
    'SXM': 'Sint Maarten (Dutch part)',
    'SVK': 'Slovakia',
    'SVN': 'Slovenia',
    'SLB': 'Solomon Islands',
    'SOM': 'Somalia',
    'ZAF': 'South Africa',
    'SGS': 'South Georgia and the South Sandwich Islands',
    'SSD': 'South Sudan',
    'ESP': 'Spain',
    'LKA': 'Sri Lanka',
    'SDN': 'Sudan',
    'SUR': 'Suriname',
    'SJM': 'Svalbard and Jan Mayen',
    'SWE': 'Sweden',
    'CHE': 'Switzerland',
    'SYR': 'Syrian Arab Republic',
    'TWN': 'Taiwan',
    'TJK': 'Tajikistan',
    'TZA': 'Tanzania (United Republic of)',
    'THA': 'Thailand',
    'TGO': 'Togo',
    'TKL': 'Tokelau',
    'TON': 'Tonga',
    'TTO': 'Trinidad and Tobago',
    'TUN': 'Tunisia',
    'TUR': 'Turkey',
    'TKM': 'Turkmenistan',
    'TCA': 'Turks and Caicos Islands',
    'TUV': 'Tuvalu',
    'UGA': 'Uganda',
    'UKR': 'Ukraine',
    'ARE': 'United Arab Emirates',
    'GBR': 'United Kingdom',
    'USA': 'United States of America',
    'URY': 'Uruguay',
    'UZB': 'Uzbekistan',
    'VUT': 'Vanuatu',
    'VEN': 'Venezuela (Bolivarian Republic of)',
    'VNM': 'Viet Nam',
    'VGB': 'Virgin Islands (British)',
    'VIR': 'Virgin Islands (U.S.)',
    'WLF': 'Wallis and Futuna',
    'ESH': 'Western Sahara',
    'YEM': 'Yemen',
    'ZMB': 'Zambia',
    'ZWE': 'Zimbabwe',
    'SBR': 'Government of Serbia',
}


In [22]:
df.loc[:, 'country'] = df['country'].map(country_map).fillna(df['country'])


---

## **8. Delete Unnecessary Columns**
**Drop columns** that are not useful or relevant to the analysis, such as "company" in this case, to streamline the dataset.

---

In [23]:
df = df.drop(columns=['days_in_waiting_list', "previous_bookings_not_canceled", "previous_cancellations", "is_repeated_guest", 'required_car_parking_spaces','booking_changes', 'agent'])

In [24]:
type(df)


pandas.core.frame.DataFrame


---

## **9. Time Series Formatting**
Ensure that **date and time data** are in the correct format for time-based analysis, making it easier to conduct time series operations.

---

In [25]:
# Convert 'reservation_status_date' to datetime format
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'], errors='coerce')  # errors='coerce' will handle invalid date formats
df.loc[:, 'reservation_status_date'] = df['reservation_status_date'].dt.strftime('%Y-%m-%d')

In [26]:
df.loc[:, 'arrival_date'] = pd.to_datetime(df['arrival_date_year'].astype(str) + '-' + 
                                            df['arrival_date_month'] + '-' + 
                                            df['arrival_date_day_of_month'].astype(str), 
                                            errors='coerce')

# Format the 'arrival_date' column to 'YYYY-MM-DD'
df.loc[:, 'arrival_date'] = df['arrival_date'].dt.strftime('%Y-%m-%d')

In [27]:
# Drop the individual arrival-related columns (since we now have 'arrival_date')
df = df.drop(columns=['arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month'])

In [28]:
df.columns

Index(['hotel', 'is_canceled', 'lead_time', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'reserved_room_type', 'assigned_room_type', 'deposit_type',
       'customer_type', 'adr', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date', 'arrival_date'],
      dtype='object')


---

## **10. Data into CSV**
Ensure that **date and time data** are in the correct format for time-based analysis, making it easier to conduct time series operations.

---

In [29]:
df.dtypes

hotel                                object
is_canceled                           int64
lead_time                             int64
stays_in_weekend_nights               int64
stays_in_week_nights                  int64
adults                                int64
children                            float64
babies                                int64
meal                                 object
country                              object
market_segment                       object
distribution_channel                 object
reserved_room_type                   object
assigned_room_type                   object
deposit_type                         object
customer_type                        object
adr                                 float64
total_of_special_requests             int64
reservation_status                   object
reservation_status_date      datetime64[ns]
arrival_date                 datetime64[ns]
dtype: object

Everything is fine All numerical data is either float or integer

In [30]:
df

Unnamed: 0,hotel,is_canceled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,...,distribution_channel,reserved_room_type,assigned_room_type,deposit_type,customer_type,adr,total_of_special_requests,reservation_status,reservation_status_date,arrival_date
0,Resort Hotel,0,342,0,0,2,0.0,0,BB,Portugal,...,Direct,C,C,No Deposit,Transient,0.00,0,Check-Out,2015-01-07,2015-07-01
1,Resort Hotel,0,737,0,0,2,0.0,0,BB,Portugal,...,Direct,C,C,No Deposit,Transient,0.00,0,Check-Out,2015-01-07,2015-07-01
2,Resort Hotel,0,7,0,1,1,0.0,0,BB,United Kingdom,...,Direct,A,C,No Deposit,Transient,75.00,0,Check-Out,2015-02-07,2015-07-01
3,Resort Hotel,0,13,0,1,1,0.0,0,BB,United Kingdom,...,Corporate,A,A,No Deposit,Transient,75.00,0,Check-Out,2015-02-07,2015-07-01
4,Resort Hotel,0,14,0,2,2,0.0,0,BB,United Kingdom,...,TA/TO,A,A,No Deposit,Transient,98.00,1,Check-Out,2015-03-07,2015-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87391,City Hotel,0,23,2,5,2,0.0,0,BB,Belgium,...,TA/TO,A,A,No Deposit,Transient,96.14,0,Check-Out,2017-06-09,2017-08-30
87392,City Hotel,0,102,2,5,3,0.0,0,BB,France,...,TA/TO,E,E,No Deposit,Transient,225.43,2,Check-Out,2017-07-09,2017-08-31
87393,City Hotel,0,34,2,5,2,0.0,0,BB,Germany,...,TA/TO,D,D,No Deposit,Transient,157.71,4,Check-Out,2017-07-09,2017-08-31
87394,City Hotel,0,109,2,5,2,0.0,0,BB,United Kingdom,...,TA/TO,A,A,No Deposit,Transient,104.40,0,Check-Out,2017-07-09,2017-08-31


In [31]:
df.to_csv(r'C:\Users\rahim\OneDrive\Desktop\Hotel dataset\cleaned_hotel_bookings.csv', index=False)
