# Assignment 2: Instructions

In this assignment, you will take on a prediction competition for Airbnb bookings. Here, as opposed to predicting prices as we have been doing so far, you will use a variety of information provided to you to __predict the number of days a given listing will be booked in the next 30 days__. 

You have been provided with real listings data from Los Angeles, but you only have the actual realized bookings for a small subset of the listings, which you can use as your training data. (The column `availability_30` represents current bookings, but due to cancellations and future bookings, it's only a very noisy proxy to actual bookings). You can find the data dictionary [here](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit?gid=1322284596#gid=1322284596)

This assignment will be graded as a competition. We have a test set that only the grader has access to. In order to do well in this task, you will have to use everything you have learned in the class so far, including feature engineering and hyperparameter search via cross-validation.

## Write Up (8 pts)
    
You will need to turn in your code along with a short write-up, which you can include in your notebook. You will need to address the following components:

1. (3 pts) Explain how you constructed and / or preprocessed features to help with prediction, and why.


2. (3 pts) Explain what decisions you made using cross-validation, and how well you believe your final model will perform.
3. (2 pts) Explain what features you found to be important using the feature importance tools we discussed in class.

This write-up, along with your code, will be worth 8 points out of 15. These answers can be short: 1-3 sentences each + supporting tables or plots. 

## Performance (7 pts)
The remainder of your grade will be based on your predictive accuracy, as measured in terms of $R^2$. You will recieve one point for each percentage point of test $R^2$ you achieve over 15%, rounded down, up to a maximum of 7 -- So if your test $R^2$ is 21.9% you will recieve 6 points. To recieve full credit, you will need to achieve an $R^2$ of at least 22% on the test set.

In addition to this, there will be __5 points of extra credit__ available to each of the top 5 most accurate models across the whole in the class! You may use any method of your choice, even those that we have not covered, but be sure to explain it in your write-up. Also, to keep things well-scoped, __you may not pull in any datasets__ other than the one we are loading for you in the notebook (although this is a really good idea in practice!)

## Submission
You will need to submit two files: 

1.  Your predictions, in `.csv` format, which must have two columns: `id` and `prediction`
2.  Your code and write up, which should be provided together as an `.ipynb` notebook.

The provided notebook will get you started with loading data, and provide some checks to help you make sure your submission has the correct format. You can download `y_train.parquet` from canvas.

### Key Step for Preprocessing Features

| Step | Description                                                                                                           | Processed Columns                                                   |
|------|-----------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|
| 1    | Dropped irrelevant columns to remove unnecessary data that does not contribute to the analysis.                      | Irrelevant columns                                                  |
| 2    | Encoded the presence of `license` and `neighborhood` as 1 and absence as 0 to help the model distinguish whether the listing have liscense and neighborhood like or not. | `license`, `neighborhood`                                           |
| 3    | Grouped infrequent catergories in `property type` into "others" to prevent excessive sparsity in the one-hot encoding, I only keep those with frequency larger than 10. | `property type`                                                     |
| 4    | Split the `amenity` list into individual amenities and applied one-hot encoding to ensure better feature representation, here I also set a threshold to only encode those frequent amenities. | `amenity`                                                           |
| 5    | Converted all time-related data into timestamps to preserve temporal distance and help the model understand the time sequence. | `host_since`, `first_review`, `last_review`                         |
| 6    | Converted all boolean columns to 1 and 0.                                                                             | `host_is_superhost`, `host_has_profile_pic`, `host_identity_verified`, `has_availability`, `instant_bookable` |
| 7    | One-hot encoding to all categorical columns and scaled all numerical columns for consistency and improved model performance. | Normal categorical columns and numerical columns                    |



2. (3 pts) Explain what decisions you made using cross-validation, and how well you believe your final model will perform.
### Model Training process

| Step | Description                                                                                                 |
|------|-------------------------------------------------------------------------------------------------------------|
| 1    | Split the training set into training, testing, and validation subsets.                                      |
| 2    | Performed grid search cross-validation to identify the best hyperparameters for various models,  I tested Random Forest, Ridge, Lasso, ElasticNet, SVM, MLP, and CatBoost             |
| 3    | Compare the performane of multiple models under best hyperparameters.                                                        |
| 4    | Selected CatBoost as the best-performing model with an R² of 0.27 on the validation set for further tuning. The futher tuning model also have R² of 0.27, this is how well my model will predict |






### Model Performance

| Model           | Performance (R² on Validation Set) | Notes                             |
|------------------|------------------------------------|-----------------------------------|
| CatBoost        | 0.27                               | Best-performing model, chosen for further tuning. |
| Random Forest   | 0.26                      | Performed well but not as good as CatBoost. |
| Ridge           | 0.19                       | Tested but not optimal.           |
| Lasso           | 0.21                             | Tested but not optimal.           |
| ElasticNet      | 0.21                             | Tested but not optimal.           |
| SVM             | 0.23                                  | Tested but not optimal.           |
| MLP             | 0.25                                  | Tested but not optimal.           |

3. (2 pts) Explain what features you found to be important using the feature importance tools we discussed in class.

### Methods Used to Determine Feature Importance


| Method                  | Description                                                                                 |
|-------------------------|---------------------------------------------------------------------------------------------|
| Model's Feature Importance | Used the built-in `feature_importance` parameter from the model to evaluate feature relevance. |
| Permutation Importance   | Applied permutation importance to assess the effect of shuffling each feature on model performance. |


### Key Insights from Feature importance

| Feature                       | Insight                                                                                          |
|-------------------------------|--------------------------------------------------------------------------------------------------|
| `availability_60`, `availability_90`, `availability_30` | These availability metrics are among the most important indicators for predicting bookings.         |
| `last_review_timestamp`       | Represents the room's popularity, showing that listings with recent activity attract more bookings. |
| `number_of_reviews`,   `number_of_reviews_ltm`       | Another key indicator of a listing's popularity and trustworthiness.                          |
| Specific `amenities`          | Amenities like Self check-in, Bed linens, Refrigerator, Shower gel, Free dryer, TV, and Stove are significant in driving bookings, highlighting guest preferences for listings with these features. |
| `latitude` `longitude`          | Location is important for customers to choose a lsiting |


In [4]:
import pandas as pd
import numpy as np

def basic_preprocess(df):
    df["price"] = df["price"].str.replace("$", "").str.replace(",", "").astype(float)
    df = df.dropna(subset=["price"])
    return df

x_df = basic_preprocess(
    pd.read_csv(
        "https://data.insideairbnb.com/united-states/ca/los-angeles/2024-09-04/data/listings.csv.gz"
    )
)

# Grab this from canvas and save it in this directory
y_df = pd.read_parquet("y_train.parquet")

train_df = x_df.merge(y_df, on="id")

outer = x_df.merge(y_df, how='outer', indicator=True)
test_df = outer[(outer._merge=='left_only')].drop('_merge', axis=1)

In [5]:
y_df

Unnamed: 0,id,days_booked
0,38045411,0
1,976269998358038273,18
2,1224752567418481636,0
3,1027902058690487046,11
4,44537027,20
...,...,...
3724,37487138,13
3725,665354602920872350,27
3726,1029137238283418523,0
3727,32901700,14


In [6]:
test_df

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,days_booked
0,109,https://www.airbnb.com/rooms/109,20240904164210,2024-09-05,city scrape,Amazing bright elegant condo park front *UPGRA...,"*** Unit upgraded with new bamboo flooring, ne...",,https://a0.muscache.com/pictures/miso/Hosting-...,521,...,5.00,4.00,,f,1,1,0,0,0.01,
1,2708,https://www.airbnb.com/rooms/2708,20240904164210,2024-09-05,city scrape,Runyon Canyon | Beau Furn Mirror Mini-Suite Fi...,"Run Runyon Canyon, Our Gym & Sauna Open <br />...","Walk and run to Runyon Canyon, it is open!<br ...",https://a0.muscache.com/pictures/miso/Hosting-...,3008,...,4.95,4.86,,t,2,0,2,0,0.34,
2,2732,https://www.airbnb.com/rooms/2732,20240904164210,2024-09-05,city scrape,Zen Life at the Beach,An oasis of tranquility awaits you.,"This is the best part of Santa Monica. Quiet, ...",https://a0.muscache.com/pictures/1082974/0f74c...,3041,...,4.91,4.22,228269,f,2,1,1,0,0.15,
3,6931,https://www.airbnb.com/rooms/6931,20240904164210,2024-09-05,city scrape,"RUN Runyon, Beau Furn Rms w/ Stunning Terrace ...",Run Runyon Canyon and Views<br /><br />Gym & S...,We are in the middle of one of the great citie...,https://a0.muscache.com/pictures/miso/Hosting-...,3008,...,4.68,4.74,,t,2,0,2,0,0.18,
4,7992,https://www.airbnb.com/rooms/7992,20240904164210,2024-09-05,city scrape,Quiet/Cozy/Clean/Walkable Quaint Area,"Hello, Traveler. Please inquire about dates be...",Atwater Village has a variety of great shops a...,https://a0.muscache.com/pictures/d46d140a-58db...,22363,...,4.93,4.89,HSR19-003514,f,2,2,0,0,1.87,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37291,1237730130488617440,https://www.airbnb.com/rooms/1237730130488617440,20240904164210,2024-09-05,city scrape,2 Queen Size beds 1BD -Prime Loc,The whole group will enjoy easy access to ever...,,https://a0.muscache.com/pictures/miso/Hosting-...,543161576,...,,,Exempt - This listing is a hotel or motel,t,9,8,1,0,,
37292,1237742134920717900,https://www.airbnb.com/rooms/1237742134920717900,20240904164210,2024-09-05,city scrape,Spacious & Modern 3BR/2BA Unit,ATTENTION THIS IS NOT BEVERLY HILLS. THIS IS W...,ATTENTION THIS IS NOT BEVERLY HILLS. THIS IS W...,https://a0.muscache.com/pictures/miso/Hosting-...,449460327,...,,,,f,77,77,0,0,,
37293,1237885576982715423,https://www.airbnb.com/rooms/1237885576982715423,20240904164210,2024-09-05,city scrape,Comfortable Pasadena Retreat with good Location,Comfortable Pasadena Retreat with good Location,,https://a0.muscache.com/pictures/prohost-api/H...,344390054,...,,,,f,3,3,0,0,,
37294,1238216498268704313,https://www.airbnb.com/rooms/1238216498268704313,20240904164210,2024-09-05,city scrape,"Blueground | Mid-Wilshire, pool & a/c, nr retail",Feel at home wherever you choose to live with ...,This furnished rental is located in Mid-Wilshi...,https://a0.muscache.com/pictures/prohost-api/H...,107434423,...,,,,t,569,569,0,0,,


In [7]:
train_df

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,days_booked
0,1158417037056953812,https://www.airbnb.com/rooms/1158417037056953812,20240904164210,2024-09-05,city scrape,Lovely High Rise in Ktown,You'll have a great time at this comfortable p...,,https://a0.muscache.com/pictures/miso/Hosting-...,485962662,...,,,,f,2,2,0,0,,0
1,53405110,https://www.airbnb.com/rooms/53405110,20240904164210,2024-09-05,city scrape,Downtown Loft,,,https://a0.muscache.com/pictures/f0b186c4-2a75...,416339920,...,,,,f,4,2,2,0,,0
2,1099697666994383724,https://www.airbnb.com/rooms/1099697666994383724,20240904164210,2024-09-05,city scrape,Chic Apartment in the Heart of Beverly Hills #8,Welcome to our beautifully appointed apartment...,,https://a0.muscache.com/pictures/miso/Hosting-...,367097591,...,4.83,4.67,,t,4,4,0,0,1.14,24
3,1039777404032438158,https://www.airbnb.com/rooms/1039777404032438158,20240904164210,2024-09-05,city scrape,"Blueground | NoHo, walk to restaurants & art","Discover the best of Los Angeles, with this th...",This furnished apartment is located in North H...,https://a0.muscache.com/pictures/prohost-api/H...,107434423,...,,,,f,569,569,0,0,,28
4,53633927,https://www.airbnb.com/rooms/53633927,20240904164210,2024-09-05,city scrape,Venice Flower Cottage: Ideal WFH + dry sauna,Welcome to the Flower Cottage Venice - your ve...,Venice is an absolute dreamland. From the gorg...,https://a0.muscache.com/pictures/miso/Hosting-...,265269171,...,4.86,4.86,,t,2,2,0,0,0.23,30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3724,49468898,https://www.airbnb.com/rooms/49468898,20240904164210,2024-09-05,city scrape,Lovely + Newly Renovated Apt in the heart of LA,Newly renovated unit with all amenities one wo...,,https://a0.muscache.com/pictures/99003433-1782...,10087763,...,4.33,4.61,HSR21-001476,f,2,2,0,0,0.48,30
3725,1172781581966387312,https://www.airbnb.com/rooms/1172781581966387312,20240904164210,2024-09-05,city scrape,Stunning Skyview 2B/2B Suite in K-Town,Luxurious High-Rise 2B/2B Suite in Mid-Wilshir...,,https://a0.muscache.com/pictures/prohost-api/H...,72656625,...,4.50,3.00,Exempt,t,4,4,0,0,0.79,9
3726,1057046437127462500,https://www.airbnb.com/rooms/1057046437127462500,20240904164210,2024-09-06,city scrape,Colonial mansion fully remodeled,,,https://a0.muscache.com/pictures/miso/Hosting-...,39661267,...,,,,f,1,1,0,0,,1
3727,1076056030435406530,https://www.airbnb.com/rooms/1076056030435406530,20240904164210,2024-09-05,city scrape,Mini Glam Pad | Near LAX,Soak up the sun & live in luxury with modern a...,,https://a0.muscache.com/pictures/hosting/Hosti...,262298948,...,4.33,4.00,,t,5,5,0,0,0.83,30


## 1.View Data Overall Info

In [8]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3729 entries, 0 to 3728
Data columns (total 76 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            3729 non-null   int64  
 1   listing_url                                   3729 non-null   object 
 2   scrape_id                                     3729 non-null   int64  
 3   last_scraped                                  3729 non-null   object 
 4   source                                        3729 non-null   object 
 5   name                                          3729 non-null   object 
 6   description                                   3635 non-null   object 
 7   neighborhood_overview                         1839 non-null   object 
 8   picture_url                                   3729 non-null   object 
 9   host_id                                       3729 non-null   i

## 2.Drop and processing Columns with Limited Usefulness
1. These columns are URL links that are useless, which provide little analytical value and may not be relevant:
    - listing_url, picture_url, host_url, host_thumbnail_url, host_picture_url:
2. These are links with nature language, cause I am not familiar with nlp, I would drop them
    - host_about, neighborhood_overview, description,name,host_name:
3. These columns contains single value or not critical values
    - source,"scrape_id","calendar_last_scraped",calendar_updated,
    - bathroom text, the numbers of bathroom have already been extracted to another column, drop the original columns then.
4. Location，host_neighbourhood:
    As the location information can be telled from latitude and longtitude, and there are too many missing value and 239 unique value, if I do one hot encoding for it, it will make the data too spare
5. Drop neighbourhood_cleansed and do one-hot encoding with neighbourhood_group_cleansed
6. license: For license, encode it, if have license num will be marked as 1, without license will be marked as 0

In [9]:
dropping_list = [
    "listing_url", "picture_url", "host_url", "host_thumbnail_url", "host_picture_url",
    "host_about", "neighborhood_overview", "description", "name", "host_name",
    "host_location", "host_neighbourhood", "neighbourhood_cleansed", "source","scrape_id","calendar_last_scraped","calendar_updated",
    "bathrooms_text","last_scraped",'host_id'
]

# drop no use features
train_df = train_df.drop(columns=dropping_list, axis=1)
test_df=test_df.drop(columns=dropping_list, axis=1)

train_df

Unnamed: 0,id,host_since,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,...,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,days_booked
0,1158417037056953812,2022-11-01,a few days or more,30%,17%,f,2.0,2.0,"['email', 'phone']",t,...,,,,f,2,2,0,0,,0
1,53405110,2021-08-01,,,,f,4.0,6.0,['phone'],t,...,,,,f,4,2,2,0,,0
2,1099697666994383724,2020-09-11,within an hour,100%,91%,f,7.0,77.0,"['email', 'phone']",t,...,4.83,4.67,,t,4,4,0,0,1.14,24
3,1039777404032438158,2016-12-16,within an hour,100%,96%,f,4494.0,4784.0,"['email', 'phone', 'work_email']",t,...,,,,f,569,569,0,0,,28
4,53633927,2019-05-30,within an hour,100%,98%,t,5.0,28.0,"['email', 'phone', 'work_email']",t,...,4.86,4.86,,t,2,2,0,0,0.23,30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3724,49468898,2013-11-17,within a day,50%,,f,3.0,3.0,['phone'],t,...,4.33,4.61,HSR21-001476,f,2,2,0,0,0.48,30
3725,1172781581966387312,2016-05-18,within an hour,98%,100%,f,4.0,7.0,"['email', 'phone']",t,...,4.50,3.00,Exempt,t,4,4,0,0,0.79,9
3726,1057046437127462500,2015-07-27,within a day,100%,,f,1.0,3.0,"['email', 'phone', 'work_email']",t,...,,,,f,1,1,0,0,,1
3727,1076056030435406530,2019-05-16,within an hour,100%,100%,f,5.0,10.0,"['email', 'phone']",t,...,4.33,4.00,,t,5,5,0,0,0.83,30


## 3. Other Columns processing
- Step 1. For the `property_type`:Before performing one-hot encoding on this column, I noticed that it contains too many unique values. Directly applying one-hot encoding would result in a very sparse representation. Therefore, I performed some pre-processing by combining values with particularly low proportions into an "others" category.
- Step 2. For columns `license`, replace anyvalue with 1,and missing value with 0
        For columns `neighbourhood`, replace neighbourhood highlights with 1,and missing value with 0
- Step 3. For columns `first_review` and `last_review`, which record info related to review, there are many missing values. Based on my observation, the missing values are due to the lack of recent reviews. Therefore, I will fill these missing values with 0.


In [10]:
#STEP 1
value_distribution = train_df['property_type'].value_counts()
value_distribution

property_type
Entire home                           1069
Entire rental unit                     969
Private room in home                   443
Entire guesthouse                      246
Entire condo                           160
Private room in rental unit            149
Entire guest suite                     121
Entire villa                            67
Entire townhouse                        65
Entire bungalow                         45
Room in hotel                           43
Entire loft                             39
Entire serviced apartment               35
Private room in townhouse               29
Private room in condo                   29
Private room in villa                   21
Shared room in home                     21
Entire cottage                          15
Private room in bed and breakfast       15
Private room in guesthouse              13
Room in boutique hotel                  13
Tiny home                               12
Shared room in rental unit              

In [11]:
# Set a threshold （keep values with frequency > 9)
threshold = 9
to_keep = value_distribution[value_distribution > threshold].index  # Keep values that exceed the threshold

# Replace low-frequency values with "others"
train_df['property_type_cleaned'] = train_df['property_type'].apply(lambda x: x if x in to_keep else 'others')

# Display the frequency distribution of the cleaned feature
print(train_df['property_type_cleaned'].value_counts())

train_df=train_df.drop(columns='property_type',axis=1)

property_type_cleaned
Entire home                          1069
Entire rental unit                    969
Private room in home                  443
Entire guesthouse                     246
Entire condo                          160
Private room in rental unit           149
Entire guest suite                    121
Entire villa                           67
others                                 66
Entire townhouse                       65
Entire bungalow                        45
Room in hotel                          43
Entire loft                            39
Entire serviced apartment              35
Private room in townhouse              29
Private room in condo                  29
Shared room in home                    21
Private room in villa                  21
Private room in bed and breakfast      15
Entire cottage                         15
Room in boutique hotel                 13
Private room in guesthouse             13
Shared room in rental unit             12
Tiny home   

In [12]:
# Get the list of unique values in the training set
train_unique_values = train_df['property_type_cleaned'].unique()

# Update the test set: keep only values present in the training set, 
# and replace all other values with "others"
test_df['property_type_cleaned'] = test_df['property_type'].apply(lambda x: x if x in train_unique_values else 'others')
test_df=test_df.drop(columns='property_type',axis=1)


In [13]:
test_df_distrubution = test_df['property_type_cleaned'].value_counts()
test_df_distrubution

property_type_cleaned
Entire home                          9926
Entire rental unit                   8570
Private room in home                 4155
Entire guesthouse                    2237
Entire condo                         1289
Private room in rental unit          1201
Entire guest suite                   1020
Entire villa                          577
others                                574
Entire townhouse                      560
Entire bungalow                       384
Entire serviced apartment             370
Room in hotel                         357
Private room in condo                 318
Private room in townhouse             297
Entire loft                           228
Shared room in home                   227
Private room in villa                 175
Private room in guest suite           153
Shared room in rental unit            151
Tiny home                             142
Entire cottage                        133
Private room in bed and breakfast     124
Room in bout

In [14]:
#Step 2
train_df["license"] = train_df["license"].notna().astype(int)
test_df["license"] = test_df["license"].notna().astype(int)

train_df["neighbourhood"] = train_df["neighbourhood"].notna().astype(int)
test_df["neighbourhood"] = test_df["neighbourhood"].notna().astype(int)

In [15]:
test_df_copy=test_df

## 3.Identify categorical features
Evaluate the distribution of categorical features using describe function.

In [16]:
train_df_cate_features = train_df.select_dtypes(include=['object']).columns
train_df_cate_features



Index(['host_since', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified',
       'neighbourhood_group_cleansed', 'room_type', 'amenities',
       'has_availability', 'first_review', 'last_review', 'instant_bookable',
       'property_type_cleaned'],
      dtype='object')

In [17]:
train_df

Unnamed: 0,id,host_since,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,...,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,days_booked,property_type_cleaned
0,1158417037056953812,2022-11-01,a few days or more,30%,17%,f,2.0,2.0,"['email', 'phone']",t,...,,0,f,2,2,0,0,,0,Entire rental unit
1,53405110,2021-08-01,,,,f,4.0,6.0,['phone'],t,...,,0,f,4,2,2,0,,0,Private room in bed and breakfast
2,1099697666994383724,2020-09-11,within an hour,100%,91%,f,7.0,77.0,"['email', 'phone']",t,...,4.67,0,t,4,4,0,0,1.14,24,Entire rental unit
3,1039777404032438158,2016-12-16,within an hour,100%,96%,f,4494.0,4784.0,"['email', 'phone', 'work_email']",t,...,,0,f,569,569,0,0,,28,Entire townhouse
4,53633927,2019-05-30,within an hour,100%,98%,t,5.0,28.0,"['email', 'phone', 'work_email']",t,...,4.86,0,t,2,2,0,0,0.23,30,Entire cottage
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3724,49468898,2013-11-17,within a day,50%,,f,3.0,3.0,['phone'],t,...,4.61,1,f,2,2,0,0,0.48,30,Entire rental unit
3725,1172781581966387312,2016-05-18,within an hour,98%,100%,f,4.0,7.0,"['email', 'phone']",t,...,3.00,1,t,4,4,0,0,0.79,9,Entire rental unit
3726,1057046437127462500,2015-07-27,within a day,100%,,f,1.0,3.0,"['email', 'phone', 'work_email']",t,...,,0,f,1,1,0,0,,1,Entire home
3727,1076056030435406530,2019-05-16,within an hour,100%,100%,f,5.0,10.0,"['email', 'phone']",t,...,4.00,0,t,5,5,0,0,0.83,30,Entire rental unit


### 3.1 One-hot Encoding for amenities
Because each row of amenities have a list that show all the amentites the listing have, so we need to do a special one-hot encoding for amenities, we need to take the values out of the list as columns

In [18]:
from sklearn.preprocessing import MultiLabelBinarizer
from collections import Counter

# Step 1: Convert the string-format lists into actual Python lists
import ast
train_df["amenities"] = train_df["amenities"].apply(ast.literal_eval)

# Step 2: Flatten the list of amenities to count occurrences
amenities_counts = Counter([amenity for amenities in train_df["amenities"] for amenity in amenities])

# Step 3: Filter amenities that appear more than 10 times
threshold = 50
frequent_amenities = {amenity for amenity, count in amenities_counts.items() if count > threshold}

# Step 4: Filter out low-frequency amenities in the original DataFrame
train_df["filtered_amenities"] = train_df["amenities"].apply(
    lambda amenities: [amenity for amenity in amenities if amenity in frequent_amenities]
)

# Step 5: Use MultiLabelBinarizer for encoding the filtered amenities
mlb = MultiLabelBinarizer()
amenities_encoded = mlb.fit_transform(train_df["filtered_amenities"])

# Step 6: Create a DataFrame for the encoded data
encoded_amenities_df = pd.DataFrame(amenities_encoded, columns=mlb.classes_)

# Step 7: Merge the encoded data back into the original DataFrame
train_df = pd.concat([train_df.drop(columns=["amenities", "filtered_amenities"]), encoded_amenities_df], axis=1)

# Output the processed DataFrame
print(train_df)

                       id  host_since  host_response_time host_response_rate  \
0     1158417037056953812  2022-11-01  a few days or more                30%   
1                53405110  2021-08-01                 NaN                NaN   
2     1099697666994383724  2020-09-11      within an hour               100%   
3     1039777404032438158  2016-12-16      within an hour               100%   
4                53633927  2019-05-30      within an hour               100%   
...                   ...         ...                 ...                ...   
3724             49468898  2013-11-17        within a day                50%   
3725  1172781581966387312  2016-05-18      within an hour                98%   
3726  1057046437127462500  2015-07-27        within a day               100%   
3727  1076056030435406530  2019-05-16      within an hour               100%   
3728  1224871364424033713  2014-11-04      within an hour               100%   

     host_acceptance_rate host_is_super

In [19]:
# processing test set using the same method

# Step A: Convert string-format lists to Python lists in test set
test_df["amenities"] = test_df["amenities"].apply(ast.literal_eval)

# Step B: only keep the same colums with train set
test_df["filtered_amenities"] = test_df["amenities"].apply(
    lambda amenities: [amenity for amenity in amenities if amenity in frequent_amenities]
)

# Step C: Use mlb to transform the test_df
test_amenities_encoded = mlb.transform(test_df["filtered_amenities"])

# Step D: change to DataFrame
test_encoded_amenities_df = pd.DataFrame(test_amenities_encoded, columns=mlb.classes_)
test_df = pd.concat([test_df.drop(columns=["amenities", "filtered_amenities"]),
                     test_encoded_amenities_df], axis=1)


print(test_df)



           id  host_since host_response_time host_response_rate  \
0       109.0  2008-06-27                NaN                NaN   
1      2708.0  2008-09-16     within an hour               100%   
2      2732.0  2008-09-17     within an hour               100%   
3      6931.0  2008-09-16     within an hour               100%   
4      7992.0  2009-06-19     within an hour               100%   
...       ...         ...                ...                ...   
33502     NaN         NaN                NaN                NaN   
33531     NaN         NaN                NaN                NaN   
33536     NaN         NaN                NaN                NaN   
33553     NaN         NaN                NaN                NaN   
33559     NaN         NaN                NaN                NaN   

      host_acceptance_rate host_is_superhost  host_listings_count  \
0                      NaN                 f                  1.0   
1                     100%                 t             

In [20]:
test_df = test_df[test_df['id'].isin(test_df_copy['id'])]
test_df = test_df.set_index('id').reindex(test_df_copy.set_index('id').index).reset_index()

print(test_df)

                        id  host_since host_response_time host_response_rate  \
0                      109  2008-06-27                NaN                NaN   
1                     2708  2008-09-16     within an hour               100%   
2                     2732  2008-09-17     within an hour               100%   
3                     6931  2008-09-16     within an hour               100%   
4                     7992  2009-06-19     within an hour               100%   
...                    ...         ...                ...                ...   
33562  1237730130488617440  2023-10-23     within an hour               100%   
33563  1237742134920717900  2022-03-13     within an hour               100%   
33564  1237885576982715423  2020-04-21     within an hour               100%   
33565  1238216498268704313  2016-12-16     within an hour               100%   
33566  1238216815913275595  2016-12-16     within an hour               100%   

      host_acceptance_rate host_is_supe

## 4.Transforming data

### 4.1 Transform date
Use the to_datetime() function (pandas.to_datetime()) of the pandas library to convert the Date attribute to datetime objects. 

And fill in the Nan date with default value, although the Nan date means that during that period, there is no review, we still need to fill the missing value with a default date to make sure my model works. 

I would use a impossible date as default date

In [21]:

# Fill NaT values with the default Unix 0 time
default_date = pd.to_datetime("1970-01-01")


train_df["host_since"] = pd.to_datetime(train_df["host_since"], infer_datetime_format=True).fillna(default_date)
train_df["host_since_timestamp"] = train_df["host_since"].view('int64') // 1_000_000_000

train_df["first_review"] = pd.to_datetime(train_df["first_review"], infer_datetime_format=True).fillna(default_date)
train_df["first_review_timestamp"] = train_df["first_review"].view('int64') // 1_000_000_000

train_df["last_review"] = pd.to_datetime(train_df["last_review"], infer_datetime_format=True).fillna(default_date)
train_df["last_review_timestamp"] = train_df["last_review"].view('int64') // 1_000_000_000

train_df["last_review_timestamp"]

  train_df["host_since"] = pd.to_datetime(train_df["host_since"], infer_datetime_format=True).fillna(default_date)
  train_df["host_since_timestamp"] = train_df["host_since"].view('int64') // 1_000_000_000
  train_df["first_review"] = pd.to_datetime(train_df["first_review"], infer_datetime_format=True).fillna(default_date)
  train_df["first_review_timestamp"] = train_df["first_review"].view('int64') // 1_000_000_000
  train_df["last_review"] = pd.to_datetime(train_df["last_review"], infer_datetime_format=True).fillna(default_date)
  train_df["last_review_timestamp"] = train_df["last_review"].view('int64') // 1_000_000_000


0                0
1                0
2       1720915200
3                0
4       1720569600
           ...    
3724    1647216000
3725    1720310400
3726             0
3727    1722384000
3728             0
Name: last_review_timestamp, Length: 3729, dtype: int64

In [22]:
# Same date processing for test set
# Fill NaT values with the default Unix 0 time
default_date = pd.to_datetime("1970-01-01")


test_df["host_since"] = pd.to_datetime(test_df["host_since"], infer_datetime_format=True).fillna(default_date)
test_df["host_since_timestamp"] = test_df["host_since"].view('int64') // 1_000_000_000

test_df["first_review"] = pd.to_datetime(test_df["first_review"], infer_datetime_format=True).fillna(default_date)
test_df["first_review_timestamp"] = test_df["first_review"].view('int64') // 1_000_000_000

test_df["last_review"] = pd.to_datetime(test_df["last_review"], infer_datetime_format=True).fillna(default_date)
test_df["last_review_timestamp"] = test_df["last_review"].view('int64') // 1_000_000_000

test_df["last_review_timestamp"]

  test_df["host_since"] = pd.to_datetime(test_df["host_since"], infer_datetime_format=True).fillna(default_date)
  test_df["host_since_timestamp"] = test_df["host_since"].view('int64') // 1_000_000_000
  test_df["first_review"] = pd.to_datetime(test_df["first_review"], infer_datetime_format=True).fillna(default_date)
  test_df["first_review_timestamp"] = test_df["first_review"].view('int64') // 1_000_000_000
  test_df["last_review"] = pd.to_datetime(test_df["last_review"], infer_datetime_format=True).fillna(default_date)
  test_df["last_review_timestamp"] = test_df["last_review"].view('int64') // 1_000_000_000


0        1463270400
1        1722643200
2        1661040000
3        1723507200
4        1724457600
            ...    
33562             0
33563             0
33564             0
33565             0
33566             0
Name: last_review_timestamp, Length: 33567, dtype: int64

### 4.2 Transform boolean columns
For columns like `host_is_superhost`,`host_has_profile_pic`,`host_identity_verified`,`has_availability`,`instant_bookable`,
I will change its t and f to 1 and 0 instead

In [23]:
# For train set
boolean_columns=['host_is_superhost','host_has_profile_pic','host_identity_verified','has_availability','instant_bookable']
for column in boolean_columns:
    train_df[column] = train_df[column].apply(lambda x: 1 if x == 't' else 0)


# Same thing for test set
boolean_columns=['host_is_superhost','host_has_profile_pic','host_identity_verified','has_availability','instant_bookable']
for column in boolean_columns:
    test_df[column] = test_df[column].apply(lambda x: 1 if x == 't' else 0)



In [24]:
categorical_cols = train_df.select_dtypes(include=["object", "category"]).columns
categorical_cols

Index(['host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_verifications', 'neighbourhood_group_cleansed', 'room_type',
       'property_type_cleaned'],
      dtype='object')

In [25]:
numerical_cols = train_df.select_dtypes(include=["number"]).columns
numerical_cols

Index(['id', 'host_is_superhost', 'host_listings_count',
       'host_total_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'neighbourhood', 'latitude', 'longitude',
       'accommodates',
       ...
       'Valley view', 'Washer', 'Waterfront', 'Wifi', 'Window AC unit',
       'Window guards', 'Wine glasses', 'host_since_timestamp',
       'first_review_timestamp', 'last_review_timestamp'],
      dtype='object', length=227)

### 4.3 Transform other catergorical and numerical columns
I would use `StandardScaler` and `OneHotEncoder` to process other normal columns

In [26]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

def create_column_transformer(data):
    numeric_features = data.select_dtypes(include=["int64", "float64"]).columns
    binary_features = numeric_features[data[numeric_features].nunique() == 2]  
    numeric_features = numeric_features.difference(binary_features)  
    categorical_features = data.select_dtypes(include=["object", "category"]).columns
    
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value=0)),
        ('scaler', StandardScaler())
    ])
    
    
    binary_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value=0))
    ])
    
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    transformer = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('binary', binary_transformer, binary_features),
            ('cat', categorical_transformer, categorical_features)
        ]
    )
    return transformer



In [27]:
train_y = train_df["days_booked"] 
train_x = train_df.drop(columns=["days_booked"]) 
transformer = create_column_transformer(train_x)  
X_train= transformer.fit_transform(train_x)


In [28]:
X_train

array([[-0.73949657,  1.08996865,  0.33521151, ...,  0.        ,
         0.        ,  0.        ],
       [-0.73949657,  1.08996865,  1.16962475, ...,  0.        ,
         0.        ,  0.        ],
       [-0.73949657, -0.22770401,  0.24737854, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 1.31968155, -1.54537668, -2.02749545, ...,  0.        ,
         0.        ,  0.        ],
       [-0.73949657,  0.3872099 ,  0.27372843, ...,  0.        ,
         0.        ,  0.        ],
       [-0.73949657, -0.05201432, -0.56068481, ...,  0.        ,
         0.        ,  0.        ]])

## 5. Split the train set in to three subset(train, validation,test sets)

In [29]:
from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(
    X_train, train_y, test_size=0.2, random_state=125
)

# Step 3: Further split train+validation into train and validation sets (80% train, 20% validation)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.2, random_state=125
)

## 6. Model Training

### 6.1 RandomForest

In [30]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, r2_score

# Initialize the Random Forest Regressor model
rf = RandomForestRegressor(random_state=125)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the trees
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]  # Minimum number of samples required to be at a leaf node
}

# Define R² as the scoring metric
scorer = make_scorer(r2_score)

# Perform Grid Search with Cross-Validation
grid_search = GridSearchCV(
    estimator=rf,  # Random Forest model
    param_grid=param_grid,  # Hyperparameter grid
    scoring=scorer,  # Use R² as the evaluation metric
    cv=5,  # Perform 5-fold cross-validation
    n_jobs=-1,  # Use all available CPU cores for parallelization
    verbose=2  # Print detailed output during the grid search process
)

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Output the best hyperparameters and the best R² score from cross-validation
print("Best Parameters:", grid_search.best_params_)
print("Best R² Score:", grid_search.best_score_)

# Use the best model to predict on the test set and evaluate the R² score
best_rf = grid_search.best_estimator_
test_r2 = r2_score(y_test, best_rf.predict(X_test))
print("Test R² Score:", test_r2)


Fitting 5 folds for each of 108 candidates, totalling 540 fits
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   2.7s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   2.6s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   2.7s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   2.7s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   2.6s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   6.2s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   6.7s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   6.6s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   6.3s
[CV] END max_de

### 6.2 Ridge, Lasso, ElasticNet

In [31]:
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, r2_score
import numpy as np

# Define the R^2 scorer
r2_scorer = make_scorer(r2_score)

# Define parameter grids for each model
ridge_params = {"alpha": np.logspace(-4, 2, 10)}
lasso_params = {"alpha": np.logspace(-4, 2, 10)}
elasticnet_params = {
    "alpha": np.logspace(-4, 2, 10),
    "l1_ratio": [0.1, 0.3, 0.5, 0.7, 0.9]  # Mixing parameter for ElasticNet
}

# Initialize models
ridge = Ridge(random_state=125)
lasso = Lasso(random_state=125, max_iter=10000)
elasticnet = ElasticNet(random_state=125, max_iter=10000)

# Perform GridSearchCV for each model
ridge_search = GridSearchCV(ridge, ridge_params, cv=5, scoring=r2_scorer, n_jobs=-1)
lasso_search = GridSearchCV(lasso, lasso_params, cv=5, scoring=r2_scorer, n_jobs=-1)
elasticnet_search = GridSearchCV(elasticnet, elasticnet_params, cv=5, scoring=r2_scorer, n_jobs=-1)

# Fit models
print("Tuning Ridge Regression...")
ridge_search.fit(X_train, y_train)
print("Tuning Lasso Regression...")
lasso_search.fit(X_train, y_train)
print("Tuning ElasticNet Regression...")
elasticnet_search.fit(X_train, y_train)

# Extract the best models and scores
ridge_best_model = ridge_search.best_estimator_
ridge_best_score = ridge_search.best_score_

lasso_best_model = lasso_search.best_estimator_
lasso_best_score = lasso_search.best_score_

elasticnet_best_model = elasticnet_search.best_estimator_
elasticnet_best_score = elasticnet_search.best_score_

# Print the results
print("\nBest Ridge Model:", ridge_best_model)
print("Best Ridge R^2 Score:", ridge_best_score)

print("\nBest Lasso Model:", lasso_best_model)
print("Best Lasso R^2 Score:", lasso_best_score)

print("\nBest ElasticNet Model:", elasticnet_best_model)
print("Best ElasticNet R^2 Score:", elasticnet_best_score)


Tuning Ridge Regression...
Tuning Lasso Regression...


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Tuning ElasticNet Regression...


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(



Best Ridge Model: Ridge(alpha=100.0, random_state=125)
Best Ridge R^2 Score: 0.19213394105268827

Best Lasso Model: Lasso(alpha=0.21544346900318823, max_iter=10000, random_state=125)
Best Lasso R^2 Score: 0.21221147524202694

Best ElasticNet Model: ElasticNet(alpha=0.21544346900318823, max_iter=10000, random_state=125)
Best ElasticNet R^2 Score: 0.21530479832463384


### 6.3 SVM

In [32]:
from sklearn.svm import SVR
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer, r2_score
import numpy as np

# Define the SVM model
svm = SVR()

# Define the parameter distribution
param_distributions = {
    'C': np.logspace(-2, 2, 10),  # Regularization parameter: 0.01 to 100 (log scale)
    'kernel': ['linear', 'rbf', 'poly'],  # Kernel types
    'gamma': ['scale', 'auto', 0.01, 0.1, 1],  # Kernel coefficient
    'degree': [2, 3, 4]  # Degree of polynomial kernel (only for 'poly')
}

# Define the scoring metric as r2
scorer = make_scorer(r2_score)

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=svm,
    param_distributions=param_distributions,
    n_iter=20,  # Number of random combinations to try
    scoring=scorer,
    cv=5,  # 5-fold cross-validation
    verbose=2,
    n_jobs=-1,  # Use all available cores
    random_state=125  # For reproducibility
)

# Fit the random search to the data
random_search.fit(X_train, y_train)

# Best parameters and best score
print("Best parameters:", random_search.best_params_)
print("Best R2 score on validation data:", random_search.best_score_)

# Evaluate the best model on the test set
best_model = random_search.best_estimator_
test_r2_score = r2_score(y_test, best_model.predict(X_test))
print("R2 score on test data:", test_r2_score)


Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV] END C=0.027825594022071243, degree=4, gamma=auto, kernel=rbf; total time=   2.0s
[CV] END C=35.93813663804626, degree=4, gamma=auto, kernel=poly; total time=   1.9s
[CV] END C=0.027825594022071243, degree=4, gamma=auto, kernel=rbf; total time=   2.1s
[CV] END C=35.93813663804626, degree=4, gamma=auto, kernel=poly; total time=   2.0s
[CV] END C=35.93813663804626, degree=4, gamma=auto, kernel=poly; total time=   2.1s
[CV] END C=0.027825594022071243, degree=4, gamma=auto, kernel=rbf; total time=   2.1s
[CV] END C=0.027825594022071243, degree=4, gamma=auto, kernel=rbf; total time=   2.1s
[CV] END C=0.027825594022071243, degree=4, gamma=auto, kernel=rbf; total time=   2.2s
[CV] END C=35.93813663804626, degree=4, gamma=auto, kernel=poly; total time=   1.8s
[CV] END C=35.93813663804626, degree=4, gamma=auto, kernel=poly; total time=   1.8s
[CV] END C=0.5994842503189409, degree=2, gamma=0.1, kernel=rbf; total time=   1.9s
[CV] 

### 6.3 MLPRegressor

In [33]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

# Define the MLP model
mlp = MLPRegressor(max_iter=1000, random_state=125, early_stopping=True, validation_fraction=0.2)

# Set up the hyperparameter grid for a two-layer MLP
param_grid = {
    "hidden_layer_sizes": [
        (40, 20),  # Smaller two-layer network
        (50, 30), # Medium two-layer network
        (60, 35), # Larger two-layer network
    ],
    "activation": ["relu", "tanh"],  # Activation functions
    "alpha": [0.0001, 0.001, 0.01],  # L2 regularization strength
    "learning_rate": ["constant", "adaptive"],  # Learning rate strategies
    "solver": ["adam"],  # Use Adam optimizer
}

# GridSearchCV setup
grid_search = GridSearchCV(
    estimator=mlp,
    param_grid=param_grid,
    scoring="r2",  # Optimize for R² score
    cv=5,  # 5-fold cross-validation
    n_jobs=-1,  # Use all available processors
    verbose=2  # Print progress logs
)

# Fit the grid search on the training data
grid_search.fit(X_train, y_train)

# Output the best parameters and the corresponding R² score
print("Best parameters:", grid_search.best_params_)
print("Best R² score on training data:", grid_search.best_score_)

# Evaluate the best model on the validation set
best_mlp = grid_search.best_estimator_
y_val_pred = best_mlp.predict(X_val)
r2_val = r2_score(y_val, y_val_pred)
print("R² score on validation data:", r2_val)


Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV] END activation=relu, alpha=0.0001, hidden_layer_sizes=(40, 20), learning_rate=constant, solver=adam; total time=   0.6s
[CV] END activation=relu, alpha=0.0001, hidden_layer_sizes=(40, 20), learning_rate=adaptive, solver=adam; total time=   0.6s
[CV] END activation=relu, alpha=0.0001, hidden_layer_sizes=(40, 20), learning_rate=adaptive, solver=adam; total time=   0.7s
[CV] END activation=relu, alpha=0.0001, hidden_layer_sizes=(40, 20), learning_rate=constant, solver=adam; total time=   0.7s
[CV] END activation=relu, alpha=0.0001, hidden_layer_sizes=(40, 20), learning_rate=constant, solver=adam; total time=   0.8s
[CV] END activation=relu, alpha=0.0001, hidden_layer_sizes=(40, 20), learning_rate=adaptive, solver=adam; total time=   0.8s
[CV] END activation=relu, alpha=0.0001, hidden_layer_sizes=(40, 20), learning_rate=constant, solver=adam; total time=   0.8s
[CV] END activation=relu, alpha=0.0001, hidden_layer_sizes=(40,

In [34]:
from catboost import CatBoostRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

# Define the CatBoost model
catboost = CatBoostRegressor(verbose=0, random_state=125)

# Define the reduced hyperparameter grid
param_grid = {
    "iterations": [100, 200, 300],   # Reduced number of boosting iterations
    "depth": [4, 6, 8],             # Depth of the tree
    "learning_rate": [0.05, 0.1, 0.2],  # Learning rate
    "l2_leaf_reg": [1, 3],          # Reduced L2 regularization values
    "bagging_temperature": [0.0, 1.0],  # Sampling temperature
}

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=catboost,
    param_grid=param_grid,
    scoring="r2",  # Optimize for R² score
    cv=5,          # 5-fold cross-validation
    n_jobs=-1,      # Use all processors
    verbose=2       # Print logs for progress
)

# Fit the grid search on the training data
grid_search.fit(X_train, y_train)

# Output the best parameters and the corresponding R² score
print("Best parameters for CatBoost:", grid_search.best_params_)
print("Best R² score on training data:", grid_search.best_score_)

# Evaluate the best model on the validation set
best_catboost = grid_search.best_estimator_
y_val_pred = best_catboost.predict(X_val)
r2_val = r2_score(y_val, y_val_pred)
print("R² score of CatBoost on validation data:", r2_val)


Fitting 5 folds for each of 108 candidates, totalling 540 fits
[CV] END bagging_temperature=0.0, depth=4, iterations=100, l2_leaf_reg=1, learning_rate=0.05; total time=   0.9s
[CV] END bagging_temperature=0.0, depth=4, iterations=100, l2_leaf_reg=1, learning_rate=0.05; total time=   0.9s
[CV] END bagging_temperature=0.0, depth=4, iterations=100, l2_leaf_reg=1, learning_rate=0.05; total time=   0.9s
[CV] END bagging_temperature=0.0, depth=4, iterations=100, l2_leaf_reg=1, learning_rate=0.05; total time=   0.9s
[CV] END bagging_temperature=0.0, depth=4, iterations=100, l2_leaf_reg=1, learning_rate=0.05; total time=   0.9s
[CV] END bagging_temperature=0.0, depth=4, iterations=100, l2_leaf_reg=1, learning_rate=0.1; total time=   1.0s
[CV] END bagging_temperature=0.0, depth=4, iterations=100, l2_leaf_reg=1, learning_rate=0.1; total time=   1.0s
[CV] END bagging_temperature=0.0, depth=4, iterations=100, l2_leaf_reg=1, learning_rate=0.1; total time=   1.0s
[CV] END bagging_temperature=0.0, de

In [35]:
from catboost import CatBoostRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

# Define the CatBoost model
catboost = CatBoostRegressor(verbose=0, random_state=125)

# Define a finer hyperparameter grid
param_grid = {
    "iterations": [80, 100, 150],  # Fine-tune around 100
    "depth": [7, 8, 9],            # Test nearby depths
    "learning_rate": [0.03, 0.05, 0.07],  # Fine-tune learning rate
    "l2_leaf_reg": [2, 3, 4],      # Fine-tune L2 regularization
    "bagging_temperature": [0.0, 0.5],  # Fine-tune sampling temperature
}

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=catboost,
    param_grid=param_grid,
    scoring="r2",  # Optimize for R² score
    cv=5,          # 5-fold cross-validation
    n_jobs=-1,      # Use all processors
    verbose=2       # Print logs for progress
)

# Fit the grid search on the training data
grid_search.fit(X_train, y_train)

# Output the best parameters and the corresponding R² score
print("Best parameters for CatBoost after fine-tuning:", grid_search.best_params_)
print("Best R² score on training data:", grid_search.best_score_)

# Evaluate the best model on the validation set
best_catboost = grid_search.best_estimator_
y_val_pred = best_catboost.predict(X_val)
r2_val = r2_score(y_val, y_val_pred)
print("R² score of CatBoost on validation data after fine-tuning:", r2_val)


Fitting 5 folds for each of 162 candidates, totalling 810 fits
[CV] END bagging_temperature=0.0, depth=7, iterations=80, l2_leaf_reg=2, learning_rate=0.05; total time=   2.8s
[CV] END bagging_temperature=0.0, depth=7, iterations=80, l2_leaf_reg=2, learning_rate=0.03; total time=   3.0s
[CV] END bagging_temperature=0.0, depth=7, iterations=80, l2_leaf_reg=2, learning_rate=0.03; total time=   3.0s
[CV] END bagging_temperature=0.0, depth=7, iterations=80, l2_leaf_reg=2, learning_rate=0.05; total time=   3.0s
[CV] END bagging_temperature=0.0, depth=7, iterations=80, l2_leaf_reg=2, learning_rate=0.03; total time=   3.1s
[CV] END bagging_temperature=0.0, depth=7, iterations=80, l2_leaf_reg=2, learning_rate=0.03; total time=   3.2s
[CV] END bagging_temperature=0.0, depth=7, iterations=80, l2_leaf_reg=2, learning_rate=0.05; total time=   3.2s
[CV] END bagging_temperature=0.0, depth=7, iterations=80, l2_leaf_reg=2, learning_rate=0.03; total time=   3.3s
[CV] END bagging_temperature=0.0, depth=7

## Feature Importance

In [36]:
from catboost import CatBoostRegressor
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import numpy as np

# Step 1: Extract feature names from the transformer
try:
    feature_names = transformer.get_feature_names_out()
except AttributeError:
    print("Transformer does not support get_feature_names_out. Falling back to manual names.")
    feature_names = [f"feature_{i}" for i in range(X_train.shape[1])]  # Fallback for manual naming

# Step 2: Initialize the CatBoost model with best parameters
best_catboost = CatBoostRegressor(
    bagging_temperature=0.0,
    depth=8,
    iterations=100,
    l2_leaf_reg=3,
    learning_rate=0.05,
    verbose=0,  # Suppress training logs
    random_state=125
)

# Step 3: Fit the model on the training data
best_catboost.fit(X_train, y_train)

# Step 4: Predict on validation data and evaluate
y_val_pred = best_catboost.predict(X_val)
r2_val = r2_score(y_val, y_val_pred)
print("R² score on validation data:", r2_val)

# Step 5: Print Feature Importance
feature_importances = best_catboost.get_feature_importance()

# Sort features by importance
sorted_idx = np.argsort(feature_importances)[::-1]
sorted_importances = feature_importances[sorted_idx]
sorted_features = np.array(feature_names)[sorted_idx]

print("Feature Importance (Sorted):")
for name, importance in zip(sorted_features, sorted_importances):
    print(f"{name}: {importance:.2f}")




R² score on validation data: 0.2708526795131443
Feature Importance (Sorted):
num__availability_60: 7.02
num__availability_90: 5.90
num__availability_30: 4.20
num__last_review_timestamp: 3.89
binary__Self check-in: 2.63
num__longitude: 2.57
num__review_scores_accuracy: 1.79
num__number_of_reviews: 1.79
num__latitude: 1.71
binary__Microwave: 1.68
num__availability_365: 1.67
num__reviews_per_month: 1.65
num__id: 1.60
num__review_scores_rating: 1.56
num__minimum_nights: 1.54
num__maximum_nights_avg_ntm: 1.53
num__review_scores_cleanliness: 1.52
num__maximum_minimum_nights: 1.48
num__maximum_nights: 1.35
binary__Bed linens: 1.32
binary__Shower gel: 1.31
binary__Drying rack for clothing: 1.26
num__review_scores_value: 1.25
num__calculated_host_listings_count_entire_homes: 1.21
num__first_review_timestamp: 1.20
num__review_scores_communication: 1.20
num__review_scores_location: 1.19
num__number_of_reviews_ltm: 1.18
num__review_scores_checkin: 1.18
num__price: 1.08
cat__room_type_Entire home/a

## Permutation_importance

In [37]:
from sklearn.inspection import permutation_importance
import numpy as np
import matplotlib.pyplot as plt

# Train the CatBoost model with the best parameters
best_catboost = CatBoostRegressor(
    bagging_temperature=0.0,
    depth=8,
    iterations=100,
    l2_leaf_reg=3,
    learning_rate=0.05,
    verbose=0,
    random_state=125
)
best_catboost.fit(X_train, y_train)

# Step 1: Compute Permutation Importance
# Permutation Importance measures the decrease in model performance when a feature is randomly shuffled.
print("Calculating Permutation Importance...")
perm_importance = permutation_importance(
    best_catboost, X_val, y_val, n_repeats=30, random_state=125, scoring="r2"
)
perm_importances = perm_importance.importances_mean
perm_sorted_idx = np.argsort(perm_importances)[::-1]

# Step 2: Extract feature names (try to get from transformer or fallback to generic names)
try:
    feature_names = transformer.get_feature_names_out()
except AttributeError:
    feature_names = [f"feature_{i}" for i in range(X_train.shape[1])]

sorted_features_perm = np.array(feature_names)[perm_sorted_idx]

# Step 3: Print sorted Permutation Importance
print("\nPermutation Importance (Sorted):")
for name, importance in zip(sorted_features_perm, perm_importances[perm_sorted_idx]):
    print(f"{name}: {importance:.4f}")




Calculating Permutation Importance...

Permutation Importance (Sorted):
num__last_review_timestamp: 0.0214
num__availability_60: 0.0164
num__availability_90: 0.0140
num__availability_30: 0.0102
num__reviews_per_month: 0.0090
num__number_of_reviews_ltm: 0.0086
num__number_of_reviews: 0.0085
num__review_scores_accuracy: 0.0060
num__review_scores_location: 0.0057
num__review_scores_rating: 0.0051
num__review_scores_cleanliness: 0.0044
num__review_scores_value: 0.0043
num__longitude: 0.0038
num__review_scores_communication: 0.0038
num__maximum_minimum_nights: 0.0037
binary__Self check-in: 0.0037
num__minimum_nights: 0.0033
num__minimum_nights_avg_ntm: 0.0029
num__minimum_maximum_nights: 0.0022
num__maximum_nights: 0.0022
cat__host_response_rate_100%: 0.0022
binary__Drying rack for clothing: 0.0021
binary__BBQ grill: 0.0019
binary__Toaster: 0.0019
binary__Microwave: 0.0018
binary__Dishes and silverware: 0.0016
num__accommodates: 0.0016
num__price: 0.0015
num__first_review_timestamp: 0.0014


## Final Prediction


In [38]:


# Drop the NAn column `days_booked` and transform the test data
test_x = test_df.drop(columns=["days_booked"])  
X_test = transformer.transform(test_x) 
# using the best performance model to predict
test_y_pred = best_catboost.predict(X_test) 
test_df["prediction"] = test_y_pred 


print(test_df.head())  


     id host_since host_response_time host_response_rate host_acceptance_rate  \
0   109 2008-06-27                NaN                NaN                  NaN   
1  2708 2008-09-16     within an hour               100%                 100%   
2  2732 2008-09-17     within an hour               100%                  78%   
3  6931 2008-09-16     within an hour               100%                 100%   
4  7992 2009-06-19     within an hour               100%                  99%   

   host_is_superhost  host_listings_count  host_total_listings_count  \
0                  0                  1.0                        3.0   
1                  1                  2.0                        3.0   
2                  0                  2.0                        2.0   
3                  1                  2.0                        3.0   
4                  1                  2.0                        2.0   

   host_verifications  host_has_profile_pic  ...  Washer  Waterfront Wifi  \
0  

In [41]:
test_df = test_df[['id', 'prediction']]


In [42]:
# Check to make sure that test_df has correct format and write to csv
from hashlib import md5
import numpy as np
# Checks that you have predictions for the expected listing ids 
assert md5(np.sort(test_df.id.values)).hexdigest() == '87ed95adc911aad0ed9ef119a7a3315d', "Your listing ids are incorrect; you may need to regenerate test_df in the first cell"
assert "prediction" in test_df.columns, "You need to have a column named `prediction` in your output"
# Submit this CSV on canvas
test_df.to_csv("predictions.csv", index=False)

## Write Up

1. (3 pts) Explain how you constructed and / or preprocessed features to help with prediction, and why.

2. (3 pts) Explain what decisions you made using cross-validation, and how well you believe your final model will perform.

3. (2 pts) Explain what features you found to be important using the feature importance tools we discussed in class.


### Key Step for Preprocessing Features

| Step | Description                                                                                                           | Processed Columns                                                   |
|------|-----------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|
| 1    | Dropped irrelevant columns to remove unnecessary data that does not contribute to the analysis.                      | Irrelevant columns                                                  |
| 2    | Encoded the presence of `license` and `neighborhood` as 1 and absence as 0 to help the model distinguish whether the listing have liscense and neighborhood like or not. | `license`, `neighborhood`                                           |
| 3    | Grouped infrequent catergories in `property type` into "others" to prevent excessive sparsity in the one-hot encoding, I only keep those with frequency larger than 10. | `property type`                                                     |
| 4    | Split the `amenity` list into individual amenities and applied one-hot encoding to ensure better feature representation, here I also set a threshold to only encode those frequent amenities. | `amenity`                                                           |
| 5    | Converted all time-related data into timestamps to preserve temporal distance and help the model understand the time sequence. | `host_since`, `first_review`, `last_review`                         |
| 6    | Converted all boolean columns to 1 and 0.                                                                             | `host_is_superhost`, `host_has_profile_pic`, `host_identity_verified`, `has_availability`, `instant_bookable` |
| 7    | One-hot encoding to all categorical columns and scaled all numerical columns for consistency and improved model performance. | Normal categorical columns and numerical columns                    |




2. (3 pts) Explain what decisions you made using cross-validation, and how well you believe your final model will perform.
### Model Training process

| Step | Description                                                                                                 |
|------|-------------------------------------------------------------------------------------------------------------|
| 1    | Split the training set into training, testing, and validation subsets.                                      |
| 2    | Performed grid search cross-validation to identify the best hyperparameters for various models,  I tested Random Forest, Ridge, Lasso, ElasticNet, SVM, MLP, and CatBoost             |
| 3    | Compare the performane of multiple models under best hyperparameters.                                                        |
| 4    | Selected CatBoost as the best-performing model with an R² of 0.27 on the validation set for further tuning. The futher tuning model also have R² of 0.27, this is how well my model will predict |






### Model Performance

| Model           | Performance (R² on Validation Set) | Notes                             |
|------------------|------------------------------------|-----------------------------------|
| CatBoost        | 0.27                               | Best-performing model, chosen for further tuning. |
| Random Forest   | 0.26                      | Performed well but not as good as CatBoost. |
| Ridge           | 0.19                       | Tested but not optimal.           |
| Lasso           | 0.21                             | Tested but not optimal.           |
| ElasticNet      | 0.21                             | Tested but not optimal.           |
| SVM             | 0.23                                  | Tested but not optimal.           |
| MLP             | 0.25                                  | Tested but not optimal.           |

3. (2 pts) Explain what features you found to be important using the feature importance tools we discussed in class.

### Methods Used to Determine Feature Importance


| Method                  | Description                                                                                 |
|-------------------------|---------------------------------------------------------------------------------------------|
| Model's Feature Importance | Used the built-in `feature_importance` parameter from the model to evaluate feature relevance. |
| Permutation Importance   | Applied permutation importance to assess the effect of shuffling each feature on model performance. |


### Key Insights from Feature importance

| Feature                       | Insight                                                                                          |
|-------------------------------|--------------------------------------------------------------------------------------------------|
| `availability_60`, `availability_90`, `availability_30` | These availability metrics are among the most important indicators for predicting bookings.         |
| `last_review_timestamp`       | Represents the room's popularity, showing that listings with recent activity attract more bookings. |
| `number_of_reviews`,   `number_of_reviews_ltm`       | Another key indicator of a listing's popularity and trustworthiness.                          |
| Specific `amenities`          | Amenities like Self check-in, Bed linens, Refrigerator, Shower gel, Free dryer, TV, and Stove are significant in driving bookings, highlighting guest preferences for listings with these features. |
| `latitude` `longitude`          | Location is important for customers to choose a lsiting |