# **Room Pricing Model Booking Data Preprocessing**



First, import all the required python libraries for this project.

In [92]:
import os
import pandas as pd
import zipfile
import warnings

In [93]:
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore",category=FutureWarning)

Then extract all files contained within the bookings.zip folder.

In [94]:
with zipfile.ZipFile('bookings.zip', 'r') as zip_ref:
    zip_ref.extractall('book_data')

Afterwards, print all file names along with their paths that are present within the previously extracted folder.

In [95]:
dataframes=[]
folder_path = 'book_data'
for filename in os.listdir(folder_path):
    if filename.endswith('.csv'):
        file_path = os.path.join(folder_path, filename)
        print(file_path)

book_data/bookings_2024_february.csv
book_data/booking_2024_april.csv
book_data/bookings_2023_august.csv
book_data/bookings_2023_december.csv
book_data/bookings_2023_october.csv
book_data/bookings_2023_july.csv
book_data/bookings_2023_june.csv
book_data/bookings_2024_january.csv
book_data/bookings_2023_september.csv
book_data/bookings_2024_march.csv
book_data/bookings_2023_november.csv


Within the book_data folder, there are bookings data from June 2023 to April 2024 that will undergo preprocessing stages.

# **Preprocess Booking Data June 2023**

The task involves reading the booking data for June using the pandas library and then displaying the top 5 records from the June CSV file using head() function

In [69]:
june_data = pd.read_csv("book_data/bookings_2023_june.csv", skiprows=1)
june_data.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Bookings YoY,Booking value,Booking value YoY,Nights booked,...,Average daily rate,Average daily rate YoY,Average length of stay,Average length of stay YoY,Average booking window,Average booking window YoY,View to contact rate,View to contact rate YoY,Contact to book rate,Contact to book rate YoY
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,0%,
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,


The next step is to assess the data to understand the information it contains, then check the number of duplicates in the data, and describe the data using the `describe()` function.

In [70]:
print("Data Info:")
print(june_data.info())

num_duplicates = june_data.duplicated().sum()
print("\nNumber of Duplicates:", num_duplicates)

print("\nData Description:")
print(june_data.describe())

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Listing ID                  1100 non-null   int64  
 1   Listing title               1100 non-null   object 
 2   Internal name               1077 non-null   object 
 3   Region                      1084 non-null   object 
 4   Currency                    1100 non-null   object 
 5   Bookings                    1100 non-null   int64  
 6   Bookings YoY                247 non-null    object 
 7   Booking value               1100 non-null   float64
 8   Booking value YoY           279 non-null    object 
 9   Nights booked               1100 non-null   int64  
 10  Nights booked YoY           279 non-null    object 
 11  Average daily rate          200 non-null    float64
 12  Average daily rate YoY      86 non-null     object 
 13  Average length of stay

From the information available in the data, it can be determined that the June booking dataset does not contain duplicate data and consists of a total of 21 feature columns. Based on the information available in the data, there are several features that are not necessary. Therefore, those features can be removed.

In [71]:
columns = ["Bookings YoY","Booking value YoY","Nights booked YoY","Average length of stay YoY","Average booking window","Average booking window YoY","Average daily rate YoY" ,"View to contact rate","View to contact rate YoY","View to contact rate YoY","Contact to book rate","Contact to book rate YoY"]
june_data = june_data.drop(columns,axis=1)
june_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1100 non-null   int64  
 1   Listing title           1100 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1084 non-null   object 
 4   Currency                1100 non-null   object 
 5   Bookings                1100 non-null   int64  
 6   Booking value           1100 non-null   float64
 7   Nights booked           1100 non-null   int64  
 8   Average daily rate      200 non-null    float64
 9   Average length of stay  1100 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 86.1+ KB


After the dropping process, only 10 features remain for use. The next step is to check for missing values, whereby from the information provided, it is known that the "internal name," "region," and "average daily rate" columns still have missing values.

In [72]:
june_data.isna().sum()

Listing ID                  0
Listing title               0
Internal name              23
Region                     16
Currency                    0
Bookings                    0
Booking value               0
Nights booked               0
Average daily rate        900
Average length of stay      0
dtype: int64

The next step is to check the data that has missing values for "internal name" and "region."

In [73]:
missing_internal_name = june_data[june_data['Internal name'].isnull()]
print("Data with missing 'Internal name':")
missing_internal_name.head(23)

Data with missing 'Internal name':


Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,0.0,0,,0.0
170,1128515481162622561,Private room in Kecamatan Kuta Selatan,,Bali,SGD,0,0.0,0,,0.0
259,1128521303534453919,in Indonesia,,,SGD,0,0.0,0,,0.0
328,746627653405405488,in Indonesia,,,SGD,0,0.0,0,,0.0
349,1128517652026827480,Entire home/apt in Indonesia,,,SGD,0,0.0,0,,0.0
408,1084367686268950206,in Indonesia,,,SGD,0,0.0,0,,0.0
414,565343273164277913,Mock Property Room 1 TESTING,,Bali,SGD,0,0.0,0,,0.0
473,696639001174912560,Mock Property Room 3,,Jakarta,SGD,0,0.0,0,,0.0
487,664020938214053986,Flawless Villa for Group w/ Insane Ocean View ...,,Bali,SGD,0,0.0,0,,0.0
552,1128519127519650932,in Indonesia,,,SGD,0,0.0,0,,0.0


In [74]:
missing_region = june_data[june_data['Region'].isnull()]
print("Data with missing 'Region':")
missing_region.head(16)

Data with missing 'Region':


Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
259,1128521303534453919,in Indonesia,,,SGD,0,0.0,0,,0.0
328,746627653405405488,in Indonesia,,,SGD,0,0.0,0,,0.0
349,1128517652026827480,Entire home/apt in Indonesia,,,SGD,0,0.0,0,,0.0
408,1084367686268950206,in Indonesia,,,SGD,0,0.0,0,,0.0
552,1128519127519650932,in Indonesia,,,SGD,0,0.0,0,,0.0
578,1128520621533774906,in Indonesia,,,SGD,0,0.0,0,,0.0
583,993261305858911431,in Indonesia,,,SGD,0,0.0,0,,0.0
656,943861352150397215,in Indonesia,,,SGD,0,0.0,0,,0.0
694,970575070818084733,in Indonesia,,,SGD,0,0.0,0,,0.0
767,1129831168233843711,in Indonesia,,,SGD,0,0.0,0,,0.0


Since there are many missing values, one option is to drop the data that has missing values for the "internal name" column.

In [75]:
june_data_cleaned = june_data.dropna(subset=['Internal name'])
print("Shape of dataframe after dropping rows with missing 'Internal name':", june_data_cleaned.shape)
june_data_cleaned.info()

Shape of dataframe after dropping rows with missing 'Internal name': (1077, 10)
<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      200 non-null    float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


Next, we will check how many types of currencies are used. If there is only one type, then we will keep the average daily rate column.

In [77]:
currency_count = june_data_cleaned['Currency'].value_counts()

print("Number of different currencies:", len(currency_count))
print("\nList of currencies and their counts:")
print(currency_count)

Number of different currencies: 1

List of currencies and their counts:
Currency
SGD    1077
Name: count, dtype: int64


The next step is to fill in the null values for columns with the average daily rate using the mean of the existing data.

In [78]:
average_daily_rate_mean = june_data_cleaned['Average daily rate'].mean()
june_data_cleaned['Average daily rate'].fillna(average_daily_rate_mean, inplace=True)
june_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      1077 non-null   float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


Now, let's check the June booking data to ensure that all data has been cleaned.

In [79]:
june_data_cleaned.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,0,0.0,0,107.71125,0.0
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,0.0,0,107.71125,0.0
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,0.0,0,107.71125,0.0
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,0.0,0,107.71125,0.0
5,743045318174537800,Nusa Dua's Tranquil Sanctuary with Immense Garden,Green D'Mel Nusa Dua - Suite 303,Bali,SGD,0,0.0,0,107.71125,0.0


# **Preprocess Booking Data July 2023**

The task involves reading the booking data for July using the pandas library and then displaying the top 5 records from the July CSV file using head() function

In [59]:
july_data = pd.read_csv("book_data/bookings_2023_july.csv", skiprows=1)
july_data.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Bookings YoY,Booking value,Booking value YoY,Nights booked,...,Average daily rate,Average daily rate YoY,Average length of stay,Average length of stay YoY,Average booking window,Average booking window YoY,View to contact rate,View to contact rate YoY,Contact to book rate,Contact to book rate YoY
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,93.55%,
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,


The next step is to assess the data to understand the information it contains, then check the number of duplicates in the data, and describe the data using the describe() function.

In [81]:
print("Data Info:")
print(july_data.info())

num_duplicates = july_data.duplicated().sum()
print("\nNumber of Duplicates:", num_duplicates)

print("\nData Description:")
print(july_data.describe())

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Listing ID                  1100 non-null   int64  
 1   Listing title               1100 non-null   object 
 2   Internal name               1077 non-null   object 
 3   Region                      1084 non-null   object 
 4   Currency                    1100 non-null   object 
 5   Bookings                    1100 non-null   int64  
 6   Bookings YoY                334 non-null    object 
 7   Booking value               1100 non-null   float64
 8   Booking value YoY           345 non-null    object 
 9   Nights booked               1100 non-null   int64  
 10  Nights booked YoY           345 non-null    object 
 11  Average daily rate          202 non-null    float64
 12  Average daily rate YoY      96 non-null     object 
 13  Average length of stay

From the information available in the data, it can be determined that the July booking dataset does not contain duplicate data and consists of a total of 21 feature columns. Based on the information available in the data, there are several features that are not necessary. Therefore, those features can be removed.

In [82]:
columns = ["Bookings YoY","Booking value YoY","Nights booked YoY","Average length of stay YoY","Average booking window","Average booking window YoY","Average daily rate YoY" ,"View to contact rate","View to contact rate YoY","View to contact rate YoY","Contact to book rate","Contact to book rate YoY"]
july_data = july_data.drop(columns,axis=1)
july_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1100 non-null   int64  
 1   Listing title           1100 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1084 non-null   object 
 4   Currency                1100 non-null   object 
 5   Bookings                1100 non-null   int64  
 6   Booking value           1100 non-null   float64
 7   Nights booked           1100 non-null   int64  
 8   Average daily rate      202 non-null    float64
 9   Average length of stay  1100 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 86.1+ KB


After the dropping process, only 10 features remain for use. The next step is to check for missing values, whereby from the information provided, it is known that the "internal name," "region," and "average daily rate" columns still have missing values.

In [83]:
july_data.isna().sum()

Listing ID                  0
Listing title               0
Internal name              23
Region                     16
Currency                    0
Bookings                    0
Booking value               0
Nights booked               0
Average daily rate        898
Average length of stay      0
dtype: int64

The next step is to check the data that has missing values for "internal name" and "region."

In [84]:
missing_internal_name = july_data[july_data['Internal name'].isnull()]
print("Data with missing 'Internal name':")
missing_internal_name.head(23)

Data with missing 'Internal name':


Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,0.0,0,,0.0
170,1128515481162622561,Private room in Kecamatan Kuta Selatan,,Bali,SGD,0,0.0,0,,0.0
259,1128521303534453919,in Indonesia,,,SGD,0,0.0,0,,0.0
328,746627653405405488,in Indonesia,,,SGD,0,0.0,0,,0.0
349,1128517652026827480,Entire home/apt in Indonesia,,,SGD,0,0.0,0,,0.0
408,1084367686268950206,in Indonesia,,,SGD,0,0.0,0,,0.0
414,565343273164277913,Mock Property Room 1 TESTING,,Bali,SGD,0,0.0,0,,0.0
473,696639001174912560,Mock Property Room 3,,Jakarta,SGD,0,0.0,0,,0.0
487,664020938214053986,Flawless Villa for Group w/ Insane Ocean View ...,,Bali,SGD,0,0.0,0,,0.0
552,1128519127519650932,in Indonesia,,,SGD,0,0.0,0,,0.0


In [86]:
missing_region = july_data[july_data['Region'].isnull()]
print("Data with missing 'Region':")
missing_region.head(16)

Data with missing 'Region':


Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
259,1128521303534453919,in Indonesia,,,SGD,0,0.0,0,,0.0
328,746627653405405488,in Indonesia,,,SGD,0,0.0,0,,0.0
349,1128517652026827480,Entire home/apt in Indonesia,,,SGD,0,0.0,0,,0.0
408,1084367686268950206,in Indonesia,,,SGD,0,0.0,0,,0.0
552,1128519127519650932,in Indonesia,,,SGD,0,0.0,0,,0.0
578,1128520621533774906,in Indonesia,,,SGD,0,0.0,0,,0.0
583,993261305858911431,in Indonesia,,,SGD,0,0.0,0,,0.0
656,943861352150397215,in Indonesia,,,SGD,0,0.0,0,,0.0
694,970575070818084733,in Indonesia,,,SGD,0,0.0,0,,0.0
767,1129831168233843711,in Indonesia,,,SGD,0,0.0,0,,0.0


Since there are many missing values, one option is to drop the data that has missing values for the "internal name" column.

In [87]:
july_data_cleaned = july_data.dropna(subset=['Internal name'])
print("Shape of dataframe after dropping rows with missing 'Internal name':", july_data_cleaned.shape)
july_data_cleaned.info()

Shape of dataframe after dropping rows with missing 'Internal name': (1077, 10)
<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      202 non-null    float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


Next, we will check how many types of currencies are used. If there is only one type, then we will keep the average daily rate column.

In [88]:
currency_count = july_data_cleaned['Currency'].value_counts()

print("Number of different currencies:", len(currency_count))
print("\nList of currencies and their counts:")
print(currency_count)

Number of different currencies: 1

List of currencies and their counts:
Currency
SGD    1077
Name: count, dtype: int64


The next step is to fill in the null values for columns with the average daily rate using the mean of the existing data.

In [90]:
average_daily_rate_mean = july_data_cleaned['Average daily rate'].mean()
july_data_cleaned['Average daily rate'].fillna(average_daily_rate_mean, inplace=True)
july_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      1077 non-null   float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


Now, let's check the July booking data to ensure that all data has been cleaned.

In [91]:
july_data_cleaned.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,0,0.0,0,123.783267,0.0
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,0.0,0,123.783267,0.0
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,0.0,0,123.783267,0.0
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,0.0,0,123.783267,0.0
5,743045318174537800,Nusa Dua's Tranquil Sanctuary with Immense Garden,Green D'Mel Nusa Dua - Suite 303,Bali,SGD,0,0.0,0,123.783267,0.0


# **Preprocess Booking Data August 2023**

The task involves reading the booking data for August using the pandas library and then displaying the top 5 records from the August CSV file using head() function

In [96]:
august_data = pd.read_csv("book_data/bookings_2023_august.csv", skiprows=1)
august_data.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Bookings YoY,Booking value,Booking value YoY,Nights booked,...,Average daily rate,Average daily rate YoY,Average length of stay,Average length of stay YoY,Average booking window,Average booking window YoY,View to contact rate,View to contact rate YoY,Contact to book rate,Contact to book rate YoY
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,221.43%,247.96%
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,


The next step is to assess the data to understand the information it contains, then check the number of duplicates in the data, and describe the data using the describe() function.

In [97]:
print("Data Info:")
print(august_data.info())

num_duplicates = august_data.duplicated().sum()
print("\nNumber of Duplicates:", num_duplicates)

print("\nData Description:")
print(august_data.describe())

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Listing ID                  1100 non-null   int64  
 1   Listing title               1100 non-null   object 
 2   Internal name               1077 non-null   object 
 3   Region                      1084 non-null   object 
 4   Currency                    1100 non-null   object 
 5   Bookings                    1100 non-null   int64  
 6   Bookings YoY                356 non-null    object 
 7   Booking value               1100 non-null   float64
 8   Booking value YoY           379 non-null    object 
 9   Nights booked               1100 non-null   int64  
 10  Nights booked YoY           379 non-null    object 
 11  Average daily rate          201 non-null    float64
 12  Average daily rate YoY      97 non-null     object 
 13  Average length of stay

From the information available in the data, it can be determined that the August booking dataset does not contain duplicate data and consists of a total of 21 feature columns. Based on the information available in the data, there are several features that are not necessary. Therefore, those features can be removed.

In [98]:
columns = ["Bookings YoY","Booking value YoY","Nights booked YoY","Average length of stay YoY","Average booking window","Average booking window YoY","Average daily rate YoY" ,"View to contact rate","View to contact rate YoY","View to contact rate YoY","Contact to book rate","Contact to book rate YoY"]
august_data = august_data.drop(columns,axis=1)
august_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1100 non-null   int64  
 1   Listing title           1100 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1084 non-null   object 
 4   Currency                1100 non-null   object 
 5   Bookings                1100 non-null   int64  
 6   Booking value           1100 non-null   float64
 7   Nights booked           1100 non-null   int64  
 8   Average daily rate      201 non-null    float64
 9   Average length of stay  1100 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 86.1+ KB


After the dropping process, only 10 features remain for use. The next step is to check for missing values, whereby from the information provided, it is known that the "internal name," "region," and "average daily rate" columns still have missing values.



In [99]:
august_data.isna().sum()

Listing ID                  0
Listing title               0
Internal name              23
Region                     16
Currency                    0
Bookings                    0
Booking value               0
Nights booked               0
Average daily rate        899
Average length of stay      0
dtype: int64

The next step is to check the data that has missing values for "internal name" and "region."

In [101]:
missing_internal_name = august_data[august_data['Internal name'].isnull()]
print("Data with missing 'Internal name':")
missing_internal_name.head(23)

Data with missing 'Internal name':


Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,0.0,0,,0.0
170,1128515481162622561,Private room in Kecamatan Kuta Selatan,,Bali,SGD,0,0.0,0,,0.0
259,1128521303534453919,in Indonesia,,,SGD,0,0.0,0,,0.0
328,746627653405405488,in Indonesia,,,SGD,0,0.0,0,,0.0
349,1128517652026827480,Entire home/apt in Indonesia,,,SGD,0,0.0,0,,0.0
408,1084367686268950206,in Indonesia,,,SGD,0,0.0,0,,0.0
414,565343273164277913,Mock Property Room 1 TESTING,,Bali,SGD,0,0.0,0,,0.0
473,696639001174912560,Mock Property Room 3,,Jakarta,SGD,0,0.0,0,,0.0
487,664020938214053986,Flawless Villa for Group w/ Insane Ocean View ...,,Bali,SGD,0,0.0,0,,0.0
552,1128519127519650932,in Indonesia,,,SGD,0,0.0,0,,0.0


In [102]:
missing_region = august_data[august_data['Region'].isnull()]
print("Data with missing 'Region':")
missing_region.head(16)

Data with missing 'Region':


Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
259,1128521303534453919,in Indonesia,,,SGD,0,0.0,0,,0.0
328,746627653405405488,in Indonesia,,,SGD,0,0.0,0,,0.0
349,1128517652026827480,Entire home/apt in Indonesia,,,SGD,0,0.0,0,,0.0
408,1084367686268950206,in Indonesia,,,SGD,0,0.0,0,,0.0
552,1128519127519650932,in Indonesia,,,SGD,0,0.0,0,,0.0
578,1128520621533774906,in Indonesia,,,SGD,0,0.0,0,,0.0
583,993261305858911431,in Indonesia,,,SGD,0,0.0,0,,0.0
656,943861352150397215,in Indonesia,,,SGD,0,0.0,0,,0.0
694,970575070818084733,in Indonesia,,,SGD,0,0.0,0,,0.0
767,1129831168233843711,in Indonesia,,,SGD,0,0.0,0,,0.0


Since there are many missing values, one option is to drop the data that has missing values for the "internal name" column.



In [103]:
august_data_cleaned = august_data.dropna(subset=['Internal name'])
print("Shape of dataframe after dropping rows with missing 'Internal name':", august_data_cleaned.shape)
august_data_cleaned.info()

Shape of dataframe after dropping rows with missing 'Internal name': (1077, 10)
<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      201 non-null    float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


Next, we will check how many types of currencies are used. If there is only one type, then we will keep the average daily rate column.

In [104]:
currency_count = august_data_cleaned['Currency'].value_counts()

print("Number of different currencies:", len(currency_count))
print("\nList of currencies and their counts:")
print(currency_count)

Number of different currencies: 1

List of currencies and their counts:
Currency
SGD    1077
Name: count, dtype: int64


The next step is to fill in the null values for columns with the average daily rate using the mean of the existing data.

In [105]:
average_daily_rate_mean = august_data_cleaned['Average daily rate'].mean()
august_data_cleaned['Average daily rate'].fillna(average_daily_rate_mean, inplace=True)
august_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      1077 non-null   float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


Now, let's check the August booking data to ensure that all data has been cleaned.

In [106]:
august_data_cleaned.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,0,0.0,0,130.937264,0.0
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,0.0,0,130.937264,0.0
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,0.0,0,130.937264,0.0
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,0.0,0,130.937264,0.0
5,743045318174537800,Nusa Dua's Tranquil Sanctuary with Immense Garden,Green D'Mel Nusa Dua - Suite 303,Bali,SGD,0,0.0,0,130.937264,0.0


# **Preprocess Booking Data September 2023**

The task involves reading the booking data for September using the pandas library and then displaying the top 5 records from the September CSV file using head() function

In [107]:
september_data = pd.read_csv("book_data/bookings_2023_september.csv", skiprows=1)
september_data.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Bookings YoY,Booking value,Booking value YoY,Nights booked,...,Average daily rate,Average daily rate YoY,Average length of stay,Average length of stay YoY,Average booking window,Average booking window YoY,View to contact rate,View to contact rate YoY,Contact to book rate,Contact to book rate YoY
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,


The next step is to assess the data to understand the information it contains, then check the number of duplicates in the data, and describe the data using the describe() function.

In [108]:
print("Data Info:")
print(september_data.info())

num_duplicates = september_data.duplicated().sum()
print("\nNumber of Duplicates:", num_duplicates)

print("\nData Description:")
print(september_data.describe())

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Listing ID                  1100 non-null   int64  
 1   Listing title               1100 non-null   object 
 2   Internal name               1077 non-null   object 
 3   Region                      1084 non-null   object 
 4   Currency                    1100 non-null   object 
 5   Bookings                    1100 non-null   int64  
 6   Bookings YoY                362 non-null    object 
 7   Booking value               1100 non-null   float64
 8   Booking value YoY           386 non-null    object 
 9   Nights booked               1100 non-null   int64  
 10  Nights booked YoY           386 non-null    object 
 11  Average daily rate          209 non-null    float64
 12  Average daily rate YoY      103 non-null    object 
 13  Average length of stay

The next step is dropping the unused column

In [109]:
columns = ["Bookings YoY","Booking value YoY","Nights booked YoY","Average length of stay YoY","Average booking window","Average booking window YoY","Average daily rate YoY" ,"View to contact rate","View to contact rate YoY","View to contact rate YoY","Contact to book rate","Contact to book rate YoY"]
september_data = september_data.drop(columns,axis=1)
september_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1100 non-null   int64  
 1   Listing title           1100 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1084 non-null   object 
 4   Currency                1100 non-null   object 
 5   Bookings                1100 non-null   int64  
 6   Booking value           1100 non-null   float64
 7   Nights booked           1100 non-null   int64  
 8   Average daily rate      209 non-null    float64
 9   Average length of stay  1100 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 86.1+ KB


After the dropping process, only 10 features remain for use. The next step is to check for missing values, whereby from the information provided, it is known that the "internal name," "region," and "average daily rate" columns still have missing values.


In [110]:
september_data.isna().sum()

Listing ID                  0
Listing title               0
Internal name              23
Region                     16
Currency                    0
Bookings                    0
Booking value               0
Nights booked               0
Average daily rate        891
Average length of stay      0
dtype: int64

Since the data pattern is the same as previous months, for the next step, we can proceed with dropping data that has missing values in the 'internal name' column.

In [112]:
september_data_cleaned = september_data.dropna(subset=['Internal name'])
print("Shape of dataframe after dropping rows with missing 'Internal name':", september_data_cleaned.shape)
september_data_cleaned.info()

Shape of dataframe after dropping rows with missing 'Internal name': (1077, 10)
<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      209 non-null    float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


The next step is to fill in missing values for the average daily rate using the mean value.

In [113]:
average_daily_rate_mean = september_data_cleaned['Average daily rate'].mean()
september_data_cleaned['Average daily rate'].fillna(average_daily_rate_mean, inplace=True)
september_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      1077 non-null   float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


Now, let's check the September booking data to ensure that all data has been cleaned.

In [114]:
september_data_cleaned.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,0,0.0,0,115.131579,0.0
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,0.0,0,115.131579,0.0
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,0.0,0,115.131579,0.0
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,0.0,0,115.131579,0.0
5,743045318174537800,Nusa Dua's Tranquil Sanctuary with Immense Garden,Green D'Mel Nusa Dua - Suite 303,Bali,SGD,0,0.0,0,115.131579,0.0


# **Preprocess Booking Data October 2023**

The task involves reading the booking data for October using the pandas library and then displaying the top 5 records from the October CSV file using head() function

In [115]:
october_data = pd.read_csv("book_data/bookings_2023_october.csv", skiprows=1)
october_data.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Bookings YoY,Booking value,Booking value YoY,Nights booked,...,Average daily rate,Average daily rate YoY,Average length of stay,Average length of stay YoY,Average booking window,Average booking window YoY,View to contact rate,View to contact rate YoY,Contact to book rate,Contact to book rate YoY
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,12,1100%,2467.95,2708.2%,33,...,74.79,131.47%,3.0,'-95.08%,0.12,'-99.48%,4.88%,47.75%,139.47%,1573.68%
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,


The next step is to assess the data to understand the information it contains, then check the number of duplicates in the data, and describe the data using the describe() function.

In [118]:
print("Data Info:")
print(october_data.info())

num_duplicates = october_data.duplicated().sum()
print("\nNumber of Duplicates:", num_duplicates)

print("\nData Description:")
print(october_data.describe())

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Listing ID                  1100 non-null   int64  
 1   Listing title               1100 non-null   object 
 2   Internal name               1077 non-null   object 
 3   Region                      1084 non-null   object 
 4   Currency                    1100 non-null   object 
 5   Bookings                    1100 non-null   int64  
 6   Bookings YoY                381 non-null    object 
 7   Booking value               1100 non-null   float64
 8   Booking value YoY           398 non-null    object 
 9   Nights booked               1100 non-null   int64  
 10  Nights booked YoY           398 non-null    object 
 11  Average daily rate          199 non-null    float64
 12  Average daily rate YoY      108 non-null    object 
 13  Average length of stay

The next step is dropping the unused column

In [121]:
columns = ["Bookings YoY","Booking value YoY","Nights booked YoY","Average length of stay YoY","Average booking window","Average booking window YoY","Average daily rate YoY" ,"View to contact rate","View to contact rate YoY","View to contact rate YoY","Contact to book rate","Contact to book rate YoY"]
october_data = october_data.drop(columns,axis=1)
october_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1100 non-null   int64  
 1   Listing title           1100 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1084 non-null   object 
 4   Currency                1100 non-null   object 
 5   Bookings                1100 non-null   int64  
 6   Booking value           1100 non-null   float64
 7   Nights booked           1100 non-null   int64  
 8   Average daily rate      199 non-null    float64
 9   Average length of stay  1100 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 86.1+ KB


After the dropping process, only 10 features remain for use. The next step is to check for missing values, whereby from the information provided, it is known that the "internal name," "region," and "average daily rate" columns still have missing values.




In [124]:
october_data.isna().sum()

Listing ID                  0
Listing title               0
Internal name              23
Region                     16
Currency                    0
Bookings                    0
Booking value               0
Nights booked               0
Average daily rate        901
Average length of stay      0
dtype: int64

Since the data pattern is the same as previous months, for the next step, we can proceed with dropping data that has missing values in the 'internal name' column.

In [127]:
october_data_cleaned = october_data.dropna(subset=['Internal name'])
print("Shape of dataframe after dropping rows with missing 'Internal name':", october_data_cleaned.shape)
october_data_cleaned.info()

Shape of dataframe after dropping rows with missing 'Internal name': (1077, 10)
<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      199 non-null    float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


The next step is to fill in missing values for the average daily rate using the mean value.

In [130]:
average_daily_rate_mean = october_data_cleaned['Average daily rate'].mean()
october_data_cleaned['Average daily rate'].fillna(average_daily_rate_mean, inplace=True)
october_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      1077 non-null   float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


Now, let's check the October booking data to ensure that all data has been cleaned.

In [133]:
october_data_cleaned.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,12,2467.95,33,74.79,3.0
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,0.0,0,107.659296,0.0
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,0.0,0,107.659296,0.0
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,0.0,0,107.659296,0.0
5,743045318174537800,Nusa Dua's Tranquil Sanctuary with Immense Garden,Green D'Mel Nusa Dua - Suite 303,Bali,SGD,0,0.0,0,107.659296,0.0


# **Preprocess Booking Data November 2023**

The task involves reading the booking data for November using the pandas library and then displaying the top 5 records from the November CSV file using head() function

In [116]:
november_data = pd.read_csv("book_data/bookings_2023_november.csv", skiprows=1)
november_data.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Bookings YoY,Booking value,Booking value YoY,Nights booked,...,Average daily rate,Average daily rate YoY,Average length of stay,Average length of stay YoY,Average booking window,Average booking window YoY,View to contact rate,View to contact rate YoY,Contact to book rate,Contact to book rate YoY
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,24,,4845.77,267.59%,73,...,66.38,105.46%,3.04,,0.14,,3.85%,84.64%,50.69%,246.41%
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,


The next step is to assess the data to understand the information it contains, then check the number of duplicates in the data, and describe the data using the describe() function.

In [119]:
print("Data Info:")
print(november_data.info())

num_duplicates = november_data.duplicated().sum()
print("\nNumber of Duplicates:", num_duplicates)

print("\nData Description:")
print(november_data.describe())

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Listing ID                  1100 non-null   int64  
 1   Listing title               1100 non-null   object 
 2   Internal name               1077 non-null   object 
 3   Region                      1084 non-null   object 
 4   Currency                    1100 non-null   object 
 5   Bookings                    1100 non-null   int64  
 6   Bookings YoY                324 non-null    object 
 7   Booking value               1100 non-null   float64
 8   Booking value YoY           354 non-null    object 
 9   Nights booked               1100 non-null   int64  
 10  Nights booked YoY           354 non-null    object 
 11  Average daily rate          181 non-null    float64
 12  Average daily rate YoY      86 non-null     object 
 13  Average length of stay

The next step is dropping the unused column

In [122]:
columns = ["Bookings YoY","Booking value YoY","Nights booked YoY","Average length of stay YoY","Average booking window","Average booking window YoY","Average daily rate YoY" ,"View to contact rate","View to contact rate YoY","View to contact rate YoY","Contact to book rate","Contact to book rate YoY"]
november_data = november_data.drop(columns,axis=1)
november_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1100 non-null   int64  
 1   Listing title           1100 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1084 non-null   object 
 4   Currency                1100 non-null   object 
 5   Bookings                1100 non-null   int64  
 6   Booking value           1100 non-null   float64
 7   Nights booked           1100 non-null   int64  
 8   Average daily rate      181 non-null    float64
 9   Average length of stay  1100 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 86.1+ KB


After the dropping process, only 10 features remain for use. The next step is to check for missing values, whereby from the information provided, it is known that the "internal name," "region," and "average daily rate" columns still have missing values.



In [125]:
november_data.isna().sum()

Listing ID                  0
Listing title               0
Internal name              23
Region                     16
Currency                    0
Bookings                    0
Booking value               0
Nights booked               0
Average daily rate        919
Average length of stay      0
dtype: int64

Since the data pattern is the same as previous months, for the next step, we can proceed with dropping data that has missing values in the 'internal name' column.

In [128]:
november_data_cleaned = november_data.dropna(subset=['Internal name'])
print("Shape of dataframe after dropping rows with missing 'Internal name':", november_data_cleaned.shape)
november_data_cleaned.info()

Shape of dataframe after dropping rows with missing 'Internal name': (1077, 10)
<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      181 non-null    float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


The next step is to fill in missing values for the average daily rate using the mean value.

In [131]:
average_daily_rate_mean = november_data_cleaned['Average daily rate'].mean()
november_data_cleaned['Average daily rate'].fillna(average_daily_rate_mean, inplace=True)
november_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      1077 non-null   float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


Now, let's check the November booking data to ensure that all data has been cleaned.

In [134]:
november_data_cleaned.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,24,4845.77,73,66.38,3.04
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,0.0,0,99.380331,0.0
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,0.0,0,99.380331,0.0
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,0.0,0,99.380331,0.0
5,743045318174537800,Nusa Dua's Tranquil Sanctuary with Immense Garden,Green D'Mel Nusa Dua - Suite 303,Bali,SGD,0,0.0,0,99.380331,0.0


# **Preprocess Booking Data December 2023**

The task involves reading the booking data for December using the pandas library and then displaying the top 5 records from the December CSV file using head() function

In [117]:
december_data = pd.read_csv("book_data/bookings_2023_december.csv", skiprows=1)
december_data.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Bookings YoY,Booking value,Booking value YoY,Nights booked,...,Average daily rate,Average daily rate YoY,Average length of stay,Average length of stay YoY,Average booking window,Average booking window YoY,View to contact rate,View to contact rate YoY,Contact to book rate,Contact to book rate YoY
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,25,2400%,4941.97,249.6%,75,...,65.89,96.53%,2.92,46%,0.16,'-99.48%,2.4%,'-13.57%,87.21%,'-7.16%
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,


The next step is to assess the data to understand the information it contains, then check the number of duplicates in the data, and describe the data using the describe() function.

In [120]:
print("Data Info:")
print(december_data.info())

num_duplicates = december_data.duplicated().sum()
print("\nNumber of Duplicates:", num_duplicates)

print("\nData Description:")
print(december_data.describe())

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Listing ID                  1100 non-null   int64  
 1   Listing title               1100 non-null   object 
 2   Internal name               1077 non-null   object 
 3   Region                      1084 non-null   object 
 4   Currency                    1100 non-null   object 
 5   Bookings                    1100 non-null   int64  
 6   Bookings YoY                356 non-null    object 
 7   Booking value               1100 non-null   float64
 8   Booking value YoY           372 non-null    object 
 9   Nights booked               1100 non-null   int64  
 10  Nights booked YoY           372 non-null    object 
 11  Average daily rate          192 non-null    float64
 12  Average daily rate YoY      91 non-null     object 
 13  Average length of stay

The next step is dropping the unused column

In [123]:
columns = ["Bookings YoY","Booking value YoY","Nights booked YoY","Average length of stay YoY","Average booking window","Average booking window YoY","Average daily rate YoY" ,"View to contact rate","View to contact rate YoY","View to contact rate YoY","Contact to book rate","Contact to book rate YoY"]
december_data = december_data.drop(columns,axis=1)
december_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1100 non-null   int64  
 1   Listing title           1100 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1084 non-null   object 
 4   Currency                1100 non-null   object 
 5   Bookings                1100 non-null   int64  
 6   Booking value           1100 non-null   float64
 7   Nights booked           1100 non-null   int64  
 8   Average daily rate      192 non-null    float64
 9   Average length of stay  1100 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 86.1+ KB


After the dropping process, only 10 features remain for use. The next step is to check for missing values, whereby from the information provided, it is known that the "internal name," "region," and "average daily rate" columns still have missing values.



In [126]:
december_data.isna().sum()

Listing ID                  0
Listing title               0
Internal name              23
Region                     16
Currency                    0
Bookings                    0
Booking value               0
Nights booked               0
Average daily rate        908
Average length of stay      0
dtype: int64

Since the data pattern is the same as previous months, for the next step, we can proceed with dropping data that has missing values in the 'internal name' column.

In [129]:
december_data_cleaned = december_data.dropna(subset=['Internal name'])
print("Shape of dataframe after dropping rows with missing 'Internal name':", december_data_cleaned.shape)
december_data_cleaned.info()

Shape of dataframe after dropping rows with missing 'Internal name': (1077, 10)
<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      192 non-null    float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


The next step is to fill in missing values for the average daily rate using the mean value.

In [132]:
average_daily_rate_mean = december_data_cleaned['Average daily rate'].mean()
december_data_cleaned['Average daily rate'].fillna(average_daily_rate_mean, inplace=True)
december_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      1077 non-null   float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


Now, let's check the December booking data to ensure that all data has been cleaned.

In [135]:
december_data_cleaned.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,25,4941.97,75,65.89,2.92
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,0.0,0,119.955677,0.0
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,0.0,0,119.955677,0.0
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,0.0,0,119.955677,0.0
5,743045318174537800,Nusa Dua's Tranquil Sanctuary with Immense Garden,Green D'Mel Nusa Dua - Suite 303,Bali,SGD,0,0.0,0,119.955677,0.0


# **Preprocess Booking Data January 2024**

The task involves reading the booking data for January using the pandas library and then displaying the top 5 records from the January CSV file using head() function

In [138]:
january_data = pd.read_csv("book_data/bookings_2024_january.csv", skiprows=1)
january_data.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Bookings YoY,Booking value,Booking value YoY,Nights booked,...,Average daily rate,Average daily rate YoY,Average length of stay,Average length of stay YoY,Average booking window,Average booking window YoY,View to contact rate,View to contact rate YoY,Contact to book rate,Contact to book rate YoY
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,1,,186.77,,2,...,93.38,,5.0,,116.0,,1.47%,,100%,
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,,0.0,'-100%,0,...,,,0.0,,,,,,,
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,


The next step is to assess the data to understand the information it contains, then check the number of duplicates in the data, and describe the data using the describe() function.

In [143]:
print("Data Info:")
print(january_data.info())

num_duplicates = january_data.duplicated().sum()
print("\nNumber of Duplicates:", num_duplicates)

print("\nData Description:")
print(january_data.describe())

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Listing ID                  1100 non-null   int64  
 1   Listing title               1100 non-null   object 
 2   Internal name               1077 non-null   object 
 3   Region                      1084 non-null   object 
 4   Currency                    1100 non-null   object 
 5   Bookings                    1100 non-null   int64  
 6   Bookings YoY                33 non-null     object 
 7   Booking value               1100 non-null   float64
 8   Booking value YoY           94 non-null     object 
 9   Nights booked               1100 non-null   int64  
 10  Nights booked YoY           94 non-null     object 
 11  Average daily rate          106 non-null    float64
 12  Average daily rate YoY      33 non-null     object 
 13  Average length of stay

The next step is dropping the unused column

In [147]:
columns = ["Bookings YoY","Booking value YoY","Nights booked YoY","Average length of stay YoY","Average booking window","Average booking window YoY","Average daily rate YoY" ,"View to contact rate","View to contact rate YoY","View to contact rate YoY","Contact to book rate","Contact to book rate YoY"]
january_data = january_data.drop(columns,axis=1)
january_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1100 non-null   int64  
 1   Listing title           1100 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1084 non-null   object 
 4   Currency                1100 non-null   object 
 5   Bookings                1100 non-null   int64  
 6   Booking value           1100 non-null   float64
 7   Nights booked           1100 non-null   int64  
 8   Average daily rate      106 non-null    float64
 9   Average length of stay  1100 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 86.1+ KB


After the dropping process, only 10 features remain for use. The next step is to check for missing values, whereby from the information provided, it is known that the "internal name," "region," and "average daily rate" columns still have missing values.



In [151]:
january_data.isna().sum()

Listing ID                  0
Listing title               0
Internal name              23
Region                     16
Currency                    0
Bookings                    0
Booking value               0
Nights booked               0
Average daily rate        994
Average length of stay      0
dtype: int64

Since the data pattern is the same as previous months, for the next step, we can proceed with dropping data that has missing values in the 'internal name' column.

In [156]:
january_data_cleaned = january_data.dropna(subset=['Internal name'])
print("Shape of dataframe after dropping rows with missing 'Internal name':", january_data_cleaned.shape)
january_data_cleaned.info()

Shape of dataframe after dropping rows with missing 'Internal name': (1077, 10)
<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      106 non-null    float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


The next step is to fill in missing values for the average daily rate using the mean value.

In [161]:
average_daily_rate_mean = january_data_cleaned['Average daily rate'].mean()
january_data_cleaned['Average daily rate'].fillna(average_daily_rate_mean, inplace=True)
january_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      1077 non-null   float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


Now, let's check the January booking data to ensure that all data has been cleaned.

In [165]:
january_data_cleaned.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,1,186.77,2,93.38,5.0
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,0.0,0,120.885943,0.0
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,0.0,0,120.885943,0.0
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,0.0,0,120.885943,0.0
5,743045318174537800,Nusa Dua's Tranquil Sanctuary with Immense Garden,Green D'Mel Nusa Dua - Suite 303,Bali,SGD,0,0.0,0,120.885943,0.0


# **Preprocess Booking Data February 2024**

The task involves reading the booking data for February using the pandas library and then displaying the top 5 records from the February CSV file using head() function

In [139]:
february_data = pd.read_csv("book_data/bookings_2024_february.csv", skiprows=1)
february_data.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Bookings YoY,Booking value,Booking value YoY,Nights booked,...,Average daily rate,Average daily rate YoY,Average length of stay,Average length of stay YoY,Average booking window,Average booking window YoY,View to contact rate,View to contact rate YoY,Contact to book rate,Contact to book rate YoY
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,29,2800%,5464.99,1344.63%,84,...,65.06,17.4%,3.03,'-39.31%,0.33,'-99.58%,2.79%,'-73.48%,53.05%,653.29%
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,


The next step is to assess the data to understand the information it contains, then check the number of duplicates in the data, and describe the data using the describe() function.

In [144]:
print("Data Info:")
print(february_data.info())

num_duplicates = february_data.duplicated().sum()
print("\nNumber of Duplicates:", num_duplicates)

print("\nData Description:")
print(february_data.describe())

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Listing ID                  1100 non-null   int64  
 1   Listing title               1100 non-null   object 
 2   Internal name               1077 non-null   object 
 3   Region                      1084 non-null   object 
 4   Currency                    1100 non-null   object 
 5   Bookings                    1100 non-null   int64  
 6   Bookings YoY                206 non-null    object 
 7   Booking value               1100 non-null   float64
 8   Booking value YoY           230 non-null    object 
 9   Nights booked               1100 non-null   int64  
 10  Nights booked YoY           230 non-null    object 
 11  Average daily rate          183 non-null    float64
 12  Average daily rate YoY      66 non-null     object 
 13  Average length of stay

The next step is dropping the unused column

In [148]:
columns = ["Bookings YoY","Booking value YoY","Nights booked YoY","Average length of stay YoY","Average booking window","Average booking window YoY","Average daily rate YoY" ,"View to contact rate","View to contact rate YoY","View to contact rate YoY","Contact to book rate","Contact to book rate YoY"]
february_data = february_data.drop(columns,axis=1)
february_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1100 non-null   int64  
 1   Listing title           1100 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1084 non-null   object 
 4   Currency                1100 non-null   object 
 5   Bookings                1100 non-null   int64  
 6   Booking value           1100 non-null   float64
 7   Nights booked           1100 non-null   int64  
 8   Average daily rate      183 non-null    float64
 9   Average length of stay  1100 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 86.1+ KB


After the dropping process, only 10 features remain for use. The next step is to check for missing values, whereby from the information provided, it is known that the "internal name," "region," and "average daily rate" columns still have missing values.



In [152]:
february_data.isna().sum()

Listing ID                  0
Listing title               0
Internal name              23
Region                     16
Currency                    0
Bookings                    0
Booking value               0
Nights booked               0
Average daily rate        917
Average length of stay      0
dtype: int64

Since the data pattern is the same as previous months, for the next step, we can proceed with dropping data that has missing values in the 'internal name' column.

In [157]:
february_data_cleaned = february_data.dropna(subset=['Internal name'])
print("Shape of dataframe after dropping rows with missing 'Internal name':", february_data_cleaned.shape)
february_data_cleaned.info()

Shape of dataframe after dropping rows with missing 'Internal name': (1077, 10)
<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      183 non-null    float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


The next step is to fill in missing values for the average daily rate using the mean value.

In [162]:
average_daily_rate_mean = february_data_cleaned['Average daily rate'].mean()
february_data_cleaned['Average daily rate'].fillna(average_daily_rate_mean, inplace=True)
february_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      1077 non-null   float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB



Now, let's check the February booking data to ensure that all data has been cleaned.

In [166]:
february_data_cleaned.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,29,5464.99,84,65.06,3.03
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,0.0,0,206.590546,0.0
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,0.0,0,206.590546,0.0
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,0.0,0,206.590546,0.0
5,743045318174537800,Nusa Dua's Tranquil Sanctuary with Immense Garden,Green D'Mel Nusa Dua - Suite 303,Bali,SGD,0,0.0,0,206.590546,0.0


# **Preprocess Booking Data March 2024**

The task involves reading the booking data for March using the pandas library and then displaying the top 5 records from the March CSV file using head() function

In [140]:
march_data = pd.read_csv("book_data/bookings_2024_march.csv", skiprows=1)
march_data.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Bookings YoY,Booking value,Booking value YoY,Nights booked,...,Average daily rate,Average daily rate YoY,Average length of stay,Average length of stay YoY,Average booking window,Average booking window YoY,View to contact rate,View to contact rate YoY,Contact to book rate,Contact to book rate YoY
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,22,,4418.95,1218.3%,75,...,58.92,19.99%,3.0,,0.42,,1.96%,'-52.14%,49.35%,133.47%
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,


The next step is to assess the data to understand the information it contains, then check the number of duplicates in the data, and describe the data using the describe() function.

In [145]:
print("Data Info:")
print(march_data.info())

num_duplicates = march_data.duplicated().sum()
print("\nNumber of Duplicates:", num_duplicates)

print("\nData Description:")
print(march_data.describe())

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Listing ID                  1100 non-null   int64  
 1   Listing title               1100 non-null   object 
 2   Internal name               1077 non-null   object 
 3   Region                      1084 non-null   object 
 4   Currency                    1100 non-null   object 
 5   Bookings                    1100 non-null   int64  
 6   Bookings YoY                269 non-null    object 
 7   Booking value               1100 non-null   float64
 8   Booking value YoY           301 non-null    object 
 9   Nights booked               1100 non-null   int64  
 10  Nights booked YoY           301 non-null    object 
 11  Average daily rate          176 non-null    float64
 12  Average daily rate YoY      71 non-null     object 
 13  Average length of stay

The next step is dropping the unused column

In [149]:
columns = ["Bookings YoY","Booking value YoY","Nights booked YoY","Average length of stay YoY","Average booking window","Average booking window YoY","Average daily rate YoY" ,"View to contact rate","View to contact rate YoY","View to contact rate YoY","Contact to book rate","Contact to book rate YoY"]
march_data = march_data.drop(columns,axis=1)
march_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1100 non-null   int64  
 1   Listing title           1100 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1084 non-null   object 
 4   Currency                1100 non-null   object 
 5   Bookings                1100 non-null   int64  
 6   Booking value           1100 non-null   float64
 7   Nights booked           1100 non-null   int64  
 8   Average daily rate      176 non-null    float64
 9   Average length of stay  1100 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 86.1+ KB


After the dropping process, only 10 features remain for use. The next step is to check for missing values, whereby from the information provided, it is known that the "internal name," "region," and "average daily rate" columns still have missing values.



In [153]:
march_data.isna().sum()

Listing ID                  0
Listing title               0
Internal name              23
Region                     16
Currency                    0
Bookings                    0
Booking value               0
Nights booked               0
Average daily rate        924
Average length of stay      0
dtype: int64

Since the data pattern is the same as previous months, for the next step, we can proceed with dropping data that has missing values in the 'internal name' column.

In [159]:
march_data_cleaned = march_data.dropna(subset=['Internal name'])
print("Shape of dataframe after dropping rows with missing 'Internal name':", march_data_cleaned.shape)
march_data_cleaned.info()

Shape of dataframe after dropping rows with missing 'Internal name': (1077, 10)
<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      176 non-null    float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


The next step is to fill in missing values for the average daily rate using the mean value.

In [163]:
average_daily_rate_mean = march_data_cleaned['Average daily rate'].mean()
march_data_cleaned['Average daily rate'].fillna(average_daily_rate_mean, inplace=True)
march_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      1077 non-null   float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB



Now, let's check the March booking data to ensure that all data has been cleaned.

In [167]:
march_data_cleaned.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,22,4418.95,75,58.92,3.0
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,0.0,0,98.057443,0.0
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,0.0,0,98.057443,0.0
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,0.0,0,98.057443,0.0
5,743045318174537800,Nusa Dua's Tranquil Sanctuary with Immense Garden,Green D'Mel Nusa Dua - Suite 303,Bali,SGD,0,0.0,0,98.057443,0.0


# **Preprocess Booking Data April 2024**

The task involves reading the booking data for April using the pandas library and then displaying the top 5 records from the April CSV file using head() function

In [142]:
april_data = pd.read_csv("book_data/booking_2024_april.csv", skiprows=1)
april_data.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Bookings YoY,Booking value,Booking value YoY,Nights booked,...,Average daily rate,Average daily rate YoY,Average length of stay,Average length of stay YoY,Average booking window,Average booking window YoY,View to contact rate,View to contact rate YoY,Contact to book rate,Contact to book rate YoY
0,696637522460154617,Mock Property Room 2 Test,,Yogyakarta,SGD,0,,0.0,,0,...,,,0.0,,,,,,,
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,23,475%,5696.8,471.34%,89,...,64.01,49%,4.3,'-21.74%,0.42,'-83.77%,2.27%,2.8%,66.92%,482.57%
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,'-100%,0.0,'-100%,0,...,,,0.0,'-100%,,,,,,
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,,0.0,,0,...,,,0.0,,,,,,,


The next step is to assess the data to understand the information it contains, then check the number of duplicates in the data, and describe the data using the describe() function.

In [146]:
print("Data Info:")
print(april_data.info())

num_duplicates = april_data.duplicated().sum()
print("\nNumber of Duplicates:", num_duplicates)

print("\nData Description:")
print(april_data.describe())

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Listing ID                  1100 non-null   int64  
 1   Listing title               1100 non-null   object 
 2   Internal name               1077 non-null   object 
 3   Region                      1084 non-null   object 
 4   Currency                    1100 non-null   object 
 5   Bookings                    1100 non-null   int64  
 6   Bookings YoY                346 non-null    object 
 7   Booking value               1100 non-null   float64
 8   Booking value YoY           366 non-null    object 
 9   Nights booked               1100 non-null   int64  
 10  Nights booked YoY           366 non-null    object 
 11  Average daily rate          179 non-null    float64
 12  Average daily rate YoY      83 non-null     object 
 13  Average length of stay

The next step is dropping the unused column

In [150]:
columns = ["Bookings YoY","Booking value YoY","Nights booked YoY","Average length of stay YoY","Average booking window","Average booking window YoY","Average daily rate YoY" ,"View to contact rate","View to contact rate YoY","View to contact rate YoY","Contact to book rate","Contact to book rate YoY"]
april_data = april_data.drop(columns,axis=1)
april_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1100 non-null   int64  
 1   Listing title           1100 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1084 non-null   object 
 4   Currency                1100 non-null   object 
 5   Bookings                1100 non-null   int64  
 6   Booking value           1100 non-null   float64
 7   Nights booked           1100 non-null   int64  
 8   Average daily rate      179 non-null    float64
 9   Average length of stay  1100 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 86.1+ KB


After the dropping process, only 10 features remain for use. The next step is to check for missing values, whereby from the information provided, it is known that the "internal name," "region," and "average daily rate" columns still have missing values.



In [154]:
april_data.isna().sum()

Listing ID                  0
Listing title               0
Internal name              23
Region                     16
Currency                    0
Bookings                    0
Booking value               0
Nights booked               0
Average daily rate        921
Average length of stay      0
dtype: int64

Since the data pattern is the same as previous months, for the next step, we can proceed with dropping data that has missing values in the 'internal name' column.

In [160]:
april_data_cleaned = april_data.dropna(subset=['Internal name'])
print("Shape of dataframe after dropping rows with missing 'Internal name':", april_data_cleaned.shape)
april_data_cleaned.info()

Shape of dataframe after dropping rows with missing 'Internal name': (1077, 10)
<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      179 non-null    float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB


The next step is to fill in missing values for the average daily rate using the mean value.

In [164]:
average_daily_rate_mean = april_data_cleaned['Average daily rate'].mean()
april_data_cleaned['Average daily rate'].fillna(average_daily_rate_mean, inplace=True)
april_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 1 to 1099
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Listing ID              1077 non-null   int64  
 1   Listing title           1077 non-null   object 
 2   Internal name           1077 non-null   object 
 3   Region                  1077 non-null   object 
 4   Currency                1077 non-null   object 
 5   Bookings                1077 non-null   int64  
 6   Booking value           1077 non-null   float64
 7   Nights booked           1077 non-null   int64  
 8   Average daily rate      1077 non-null   float64
 9   Average length of stay  1077 non-null   float64
dtypes: float64(3), int64(3), object(4)
memory usage: 92.6+ KB



Now, let's check the April booking data to ensure that all data has been cleaned.

In [168]:
april_data_cleaned.head()

Unnamed: 0,Listing ID,Listing title,Internal name,Region,Currency,Bookings,Booking value,Nights booked,Average daily rate,Average length of stay
1,44106402,Surf's Retreat: Minutes from Bingin Beach,Bingin Sun & Moon Villas - (Standard),Bali,SGD,23,5696.8,89,64.01,4.3
2,39637584,Sunny Uluwatu Cottages with Fast Wifi + Fresh ...,Uluwatu Kayana Bungalows - 2,Bali,SGD,0,0.0,0,110.214804,0.0
3,41945741,Digital Nomad Room by Bukit Vista | sterilized,Asri Village-2,Bali,SGD,0,0.0,0,110.214804,0.0
4,47416028,ATRA Bambulogy • Alluring Bamboo Villa for Group,"OFFBOARD Atra 5BR Vit2,Ar,At",Bali,SGD,0,0.0,0,110.214804,0.0
5,743045318174537800,Nusa Dua's Tranquil Sanctuary with Immense Garden,Green D'Mel Nusa Dua - Suite 303,Bali,SGD,0,0.0,0,110.214804,0.0


***Notes: The preprocessing stage may need to be thoroughly reviewed next week due to the possibility of unclean or outlier data.***