# Building a prediction model on house prices
Data Analysis 3 - Assignment 1  
Submitted by: Zariza Chowdhury (ID: 2500086)    
Deadline: 2 February 2026

## Business Case
My business case is to operate a chain of Airbnbs.      
The task is to build a pricing model.

## Part I. Modelling

### Step 0: Setup
Import the necessary libraries

In [135]:
# Core libraries
import numpy as np
import pandas as pd
import time
import warnings
warnings.filterwarnings("ignore")

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn: preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Scikit-learn: models
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Scikit-learn: evaluation
from sklearn.metrics import mean_squared_error

# Interpretable ML
import shap

### Step 1: Data Selection, Wrangling and Feature Engineering

##### A. Dataset Selection:
- **Source**: Inside Airbnb - *https://insideairbnb.com/get-the-data/*
- **Dataset**: *listings.csv* (loaded directly from GitHub Repo)
- **City, Country**: Tokyo, Japan
- **Time Period**: Q4 2024
- **Reproducibility**: Data is uploaded and stored in a public GitHub repo and loaded directly via a raw URL

Load the Dataset

In [136]:
# Load dataset
url = "https://raw.githubusercontent.com/zarizachow/Data-Analysis-3/refs/heads/main/Assignment-1/Data/Raw/Tokyo_listings/Tokyo_2024-30-Dec/listings.csv"
df = pd.read_csv(url)

# Basic inspection
print("Shape (rows, columns):", df.shape)

print("\nColumn names:")
print(df.columns)

print("\nData types and missing values:")
df.info()

print("\nFirst 5 rows:")
df.head()

Shape (rows, columns): (21058, 75)

Column names:
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nig

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,197677,https://www.airbnb.com/rooms/197677,20241230011552,2024-12-30,city scrape,Oshiage Holiday Apartment,,,https://a0.muscache.com/pictures/38437056/d27f...,964081,...,4.84,4.56,4.8,M130003350,f,1,1,0,0,1.13
1,776070,https://www.airbnb.com/rooms/776070,20241230011552,2024-12-30,city scrape,Kero-kero house room 1,We have been in airbnb since 2011 and it has g...,We love Nishinippori because is nearer to Toky...,https://a0.muscache.com/pictures/efd9f039-dbd2...,801494,...,4.98,4.84,4.92,M130000243,f,1,0,1,0,1.79
2,905944,https://www.airbnb.com/rooms/905944,20241230011552,2024-12-30,city scrape,4F Spacious Apartment in Shinjuku / Shibuya Tokyo,NEWLY RENOVATED property entirely for you & yo...,Hatagaya is a great neighborhood located 4 min...,https://a0.muscache.com/pictures/miso/Hosting-...,4847803,...,4.91,4.78,4.78,Hotels and Inns Business Act | 渋谷区保健所長 | 31渋健生...,t,8,8,0,0,1.69
3,1016831,https://www.airbnb.com/rooms/1016831,20241230011552,2024-12-30,city scrape,5 mins Shibuya Cat modern sunny Shimokita,"Hi there, I am Wakana and I live with my two f...",The location is walkable distance to famous Sh...,https://a0.muscache.com/pictures/airflow/Hosti...,5596383,...,4.98,4.92,4.9,M130001107,f,1,0,1,0,1.9
4,1196177,https://www.airbnb.com/rooms/1196177,20241230011552,2024-12-30,city scrape,Homestay at Host's House - Senju-Ohashi Station,Our accommodation offers: <br /><br />1. **Gr...,There are shopping mall near Senjuohashi stati...,https://a0.muscache.com/pictures/72890882/05ec...,5686404,...,4.92,4.74,4.82,M130007760,f,1,0,1,0,0.97


##### B. Data Wrangling and Feature Engineering

**Handle Missing Values**

- Identify missing values across numeric and categorical variables - to ensure that all models can be estimated without errors
- Impute numeric variables using simple summary statistics - this is a simple and robust method so the clean dataset is not sensitive to outliers
- Treat missing categorical values as a separate category where needed - to preserve information and avoid dropping observations

In [137]:
# Handle missing values

# Separate numeric and categorical columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(exclude=[np.number]).columns

# Impute numeric variables with median
for col in numeric_cols:
    df[col] = df[col].fillna(df[col].median())

# Impute categorical variables with explicit category
for col in categorical_cols:
    df[col] = df[col].fillna("missing")

# Sanity check
df.isnull().sum().sort_values(ascending=False).head()

neighbourhood_group_cleansed    21058
calendar_updated                21058
minimum_nights_avg_ntm              0
availability_365                    0
availability_90                     0
dtype: int64

**Clean and Standardize Numeric Variables**

- Inspect numeric variables for unrealistic or extreme values – to identify potential data quality issues
- Apply simple cleaning rules and transformations where needed – to ensure the variables are consistent
- Standardize the format of the numeric variables – to support estimation across different predictive models

In [138]:
# Clean and standardize numeric variables

# Fix price variable (it may contain strings or missing)
df["price"] = df["price"].replace("missing", np.nan)

df["price"] = (
    df["price"]
    .astype(str)
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
)

df["price"] = pd.to_numeric(df["price"], errors="coerce")

# Impute missing price values with median
df["price"] = df["price"].fillna(df["price"].median())

# Update numeric columns after cleaning price
numeric_cols = df.select_dtypes(include=[np.number]).columns

# Inspect summary statistics
df[numeric_cols].describe().T

# Cap extreme values (simple winsorization)
for col in numeric_cols:
    lower = df[col].quantile(0.01)
    upper = df[col].quantile(0.99)
    df[col] = df[col].clip(lower, upper)

# Sanity check after cleaning
df[numeric_cols].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,21058.0,7.505109e+17,5.214894e+17,8998859.0,46321860.0,9.859087e+17,1.188786e+18,1.313531e+18
scrape_id,21058.0,20241230000000.0,10.35962,20241230000000.0,20241230000000.0,20241230000000.0,20241230000000.0,20241230000000.0
host_id,21058.0,332607600.0,198555500.0,6648140.0,154225900.0,330316900.0,527232500.0,661916700.0
host_listings_count,21058.0,24.29951,30.15113,1.0,4.0,11.0,32.0,141.0
host_total_listings_count,21058.0,31.46866,41.09053,1.0,6.0,15.0,39.0,214.91
neighbourhood_group_cleansed,0.0,,,,,,,
latitude,21058.0,35.69799,0.04157748,35.55644,35.68777,35.70394,35.72249,35.77726
longitude,21058.0,139.738,0.06482399,139.4746,139.6993,139.7276,139.7923,139.8767
accommodates,21058.0,4.436984,2.955584,1.0,2.0,4.0,6.0,16.0
bathrooms,21058.0,1.130331,0.3829521,0.5,1.0,1.0,1.0,3.0


In [None]:
# Ensure target variable (price) is numeric (avoid errors in models)

df["price"] = (
    df["price"]
    .astype(str)
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
)

df["price"] = pd.to_numeric(df["price"], errors="coerce")

# Drop rows where price is missing
df = df.dropna(subset=["price"])

# Sanity check
df["price"].dtype, df["price"].isna().sum(), df.shape

(dtype('float64'), np.int64(0), (21058, 75))

In [140]:
# Drop unusable variables (except id)

# Columns with all missing values
all_missing_cols = [
    "neighbourhood_group_cleansed",
    "calendar_updated"
]

# System-generated id columns
id_cols = [
    "scrape_id"
]

cols_to_drop = [col for col in all_missing_cols + id_cols if col in df.columns]
df = df.drop(columns=cols_to_drop)

# Sanity check
print("Dropped columns:", cols_to_drop)
print("New shape:", df.shape)

Dropped columns: ['neighbourhood_group_cleansed', 'calendar_updated', 'scrape_id']
New shape: (21058, 72)


**Variable Selection for Modelling**

- Exclude id, URLs, dates, and free-text fields that are not useful for prediction  
- Keep structured listing, host, location, and amenity variables  
- Use the same variables across all datasets for out-of-sample comparison

**Extract Amenities**

- Parse the amenities text field into structured variables – to make the data usable  
- Create binary indicators for selected amenities – to capture key listing features  
- Use amenity features as additional inputs in the models

In [141]:
# Extract amenities

# Convert amenities column to string
df["amenities"] = df["amenities"].astype(str)

# List of selected amenities to extract
amenities_list = [
    "Wifi",
    "Kitchen",
    "Air conditioning",
    "Heating",
    "Washer",
    "Dryer",
    "Elevator",
    "TV"
]

# Create binary indicators for each amenity
for amenity in amenities_list:
    df[f"amenity_{amenity.lower().replace(' ', '_')}"] = (
        df["amenities"].str.contains(amenity, case=False, regex=False).astype(int)
    )

# Drop original amenities text field
df = df.drop(columns=["amenities"])

# Sanity check
df.filter(like="amenity_").head()

Unnamed: 0,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_elevator,amenity_tv
0,1,1,1,1,1,1,0,1
1,1,0,1,1,0,1,0,1
2,1,1,0,1,0,1,0,1
3,1,1,0,1,1,1,0,1
4,1,0,0,1,1,1,0,1


**Save Cleaned Dataset**

In [None]:
# Save cleaned Tokyo Q4 2024 dataset

output_path = "Data/Cleaned/Tokyo_listings/tokyo_listings_q4_2024_clean.csv"

# Overwrite file if it already exists
df.to_csv(output_path, index=False)

print("Cleaned dataset saved (overwrite if existed):")
print(output_path)

Cleaned dataset saved (overwritten if existed):
Data/Cleaned/Tokyo_listings/tokyo_listings_q4_2024_clean.csv


**Encode Categorical Variables**

- Select relevant categorical variables  
- Convert categorical variables into numeric form  
- Use the same encoding across all datasets

In [None]:
# Encode categorical variables

# Select categorical variables
categorical_cols = df.select_dtypes(exclude=[np.number]).columns.tolist()

# Exclude variables not used for modelling
categorical_cols = [
    col for col in categorical_cols
    if col not in ["id"]  # keep id as primary key, but not as a feature
]

# Sanity check
categorical_cols

['listing_url',
 'last_scraped',
 'source',
 'name',
 'description',
 'neighborhood_overview',
 'picture_url',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'neighbourhood',
 'neighbourhood_cleansed',
 'property_type',
 'room_type',
 'bathrooms_text',
 'has_availability',
 'calendar_last_scraped',
 'first_review',
 'last_review',
 'license',
 'instant_bookable']

### Step 2: Build Predictive Models

##### A. OLS Model

- Define target variable (`price`) and feature matrix (`X`)  
- Split data into training and test sets  
- Fit an OLS baseline model and generate predictions

In [144]:
# OLS model (with imputation in pipeline)

from sklearn.impute import SimpleImputer

# Define target and features
y = df["price"]
X = df.drop(columns=["price", "id"])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Identify numeric and categorical features
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing pipelines
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# OLS pipeline
ols_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", LinearRegression())
    ]
)

# Fit model
ols_model.fit(X_train, y_train)

# Predictions
y_pred_ols = ols_model.predict(X_test)

# Evaluation (RMSE)
rmse_ols = np.sqrt(mean_squared_error(y_test, y_pred_ols))
print("OLS Model RMSE:", rmse_ols)

OLS Model RMSE: 12740.712267742509


##### B. LASSO Model

- Use the same target variable and feature set as in the OLS model  
- Fit a LASSO model to allow coefficient shrinkage  
- Predict prices on the test set and evaluate model performance

In [145]:
# Restrict categorical variables before LASSO to avoid huge dummy matrix

# Define categorical variables to keep (class-style)
categorical_keep = [
    "room_type",
    "property_type",
    "neighbourhood_cleansed",
    "host_is_superhost",
    "instant_bookable"
]

# Keep only those that exist in the dataset
categorical_keep = [c for c in categorical_keep if c in df.columns]

# Define target and features
y = df["price"]
X = df.drop(columns=["price", "id"])

# Drop all other object/string columns
object_cols = X.select_dtypes(exclude=[np.number]).columns.tolist()
drop_cols = [c for c in object_cols if c not in categorical_keep]

X = X.drop(columns=drop_cols)

# Sanity check
print("Categorical variables kept:", categorical_keep)
print("Number of object/string columns dropped:", len(drop_cols))
print("Final feature shape:", X.shape)

Categorical variables kept: ['room_type', 'property_type', 'neighbourhood_cleansed', 'host_is_superhost', 'instant_bookable']
Number of object/string columns dropped: 28
Final feature shape: (21058, 49)


In [146]:
# LASSO model

# Train-test split (using the restricted X, y)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_lasso = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# LASSO pipeline
lasso_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor_lasso),
        ("model", Lasso(alpha=1.0, max_iter=3000))
    ]
)

# Fit model
lasso_model.fit(X_train, y_train)

# Predictions
y_pred_lasso = lasso_model.predict(X_test)

# Evaluation (RMSE)
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
print("LASSO Model RMSE:", rmse_lasso)

LASSO Model RMSE: 15064.244563773585


##### C. Random Forest Model

- Use the same target variable and working feature set (`X`, `y`)  
- Fit a Random Forest model and generate predictions  
- Evaluate performance on the test set (RMSE)

In [150]:
# Random Forest model

# Train-test split (using the same working X, y)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median"))]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_rf = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Random Forest pipeline (final, faster spec)
rf_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor_rf),
        ("model", RandomForestRegressor(
            n_estimators=100,
            max_depth=20,
            min_samples_leaf=5,
            max_features="sqrt",
            random_state=42,
            n_jobs=-1
        ))
    ]
)

# Fit model
rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluation (RMSE)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print("Random Forest Model RMSE:", rmse_rf)

Random Forest Model RMSE: 13867.28701676302


##### D. Boosting Model (Gradient Boosting)

- **Chosen model**: Gradient Boosting
- **Reason for choosing this model**: This is common boosting method used in class that performs well on tabular data and allows feature importance analysis

In [148]:
# Gradient Boosting model

# Train-test split (using the same working X, y)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing (same as Random Forest)
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median"))
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_gb = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Gradient Boosting pipeline
gb_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor_gb),
        ("model", GradientBoostingRegressor(random_state=42))
    ]
)

# Fit model
gb_model.fit(X_train, y_train)

# Predictions
y_pred_gb = gb_model.predict(X_test)

# Evaluation (RMSE)
rmse_gb = np.sqrt(mean_squared_error(y_test, y_pred_gb))
print("Gradient Boosting Model RMSE:", rmse_gb)

Gradient Boosting Model RMSE: 14187.44720172246


##### E. Decision Tree

- **Chosen model**: Decision Tree (CART)  
- **Reason for choosing the model**: This is a simple tree-based model which provides a clear baseline to compare with ensemble methods

In [149]:
# Decision Tree model (CART)

# Train-test split (using the same working X, y)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing (same as RF and Boosting)
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median"))
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_dt = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Decision Tree pipeline
dt_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor_dt),
        ("model", DecisionTreeRegressor(random_state=42))
    ]
)

# Fit model
dt_model.fit(X_train, y_train)

# Predictions
y_pred_dt = dt_model.predict(X_test)

# Evaluation (RMSE)
rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred_dt))
print("Decision Tree Model RMSE:", rmse_dt)

Decision Tree Model RMSE: 18161.323384531686


### Step 3: Compare the Models in Terms of Fit and Time

##### A. Horserace Table

The horserace table compares the 5 models in terms of predictive accuracy and computation time.

- **RMSE** is used to measure out-of-sample prediction error. Lower RMSE values indicate better predictive performance.
- **Time** captures total model training and prediction time
- All models use the same data and train–test split
- This ensures results are directly comparable across models

| Model                    | RMSE (Test Set) | Runtime (seconds) |
|--------------------------|----------------|-------------------|
| OLS                      | 12,740.71      | 4.5               |
| LASSO                    | 15,064.24      | 1.7               |
| Random Forest            | 13,867.29      | 0.9               |
| Gradient Boosting        | 14,187.45      | 10.3              |
| Decision Tree (CART)     | 18,161.32      | 0.8               |

##### B. Discussion of Performance

The horserace table shows clear differences across the models in terms of fit and runtime.

- **OLS**  
  OLS performs well for a simple baseline model, with a relatively low RMSE and low runtime.

- **LASSO**  
  LASSO performs worse than OLS, suggesting that shrinkage/regularisation does not improve prediction in this setup.

- **Random Forest**  
  Random Forest performs better than LASSO and the single tree, but it is not the best model here. It is also more computationally expensive than OLS.

- **Gradient Boosting**  
  Gradient Boosting performs similarly to Random Forest and improves over a single decision tree, but it does not beat the simpler OLS baseline in this case.

- **Decision Tree (CART)**  
  CART performs the worst, which is expected because a single tree can overfit and does not generalize as well as ensemble models.

Overall, the results show that more complex models do not automatically guarantee better accuracy.  
In this dataset, OLS provides a strong and efficient baseline, while tree-based models add complexity with limited gains.

### Step 4: Analyzing Random Forest Model and Gradient Boosting Model

##### A. Feature Importance of Random Forest
- Feature importance shows which variables contribute most to predicting prices in this model
- Since this is a black-box model without an interpretable formula, feature importance helps interpret its behaviour
- Feature importance values are based on how much each feature improves prediction across the trees
- The focus is to identify the most important features, not the exact numerical values

In [151]:
# Random Forest Feature Importance

# Get fitted Random Forest model
rf_estimator = rf_model.named_steps["model"]

# Get feature names after preprocessing
preprocessor = rf_model.named_steps["preprocessor"]

# Numeric feature names
num_features = preprocessor.named_transformers_["num"].feature_names_in_

# Categorical feature names (after one-hot encoding)
cat_ohe = preprocessor.named_transformers_["cat"].named_steps["onehot"]
cat_features = cat_ohe.get_feature_names_out(
    preprocessor.named_transformers_["cat"].feature_names_in_
)

# Combine all feature names
feature_names = np.concatenate([num_features, cat_features])

# Extract feature importances
importances = rf_estimator.feature_importances_

# Create importance table
rf_importance_df = (
    pd.DataFrame({
        "feature": feature_names,
        "importance": importances
    })
    .sort_values("importance", ascending=False)
)

# Show top 10 features
rf_importance_df.head(10)

Unnamed: 0,feature,importance
5,accommodates,0.17182
8,beds,0.118331
7,bedrooms,0.085578
6,bathrooms,0.055443
3,latitude,0.033426
4,longitude,0.027054
104,property_type_Entire home,0.025415
0,host_id,0.023638
19,availability_90,0.02121
32,calculated_host_listings_count_entire_homes,0.020079


**Top 10 Features – Random Forest**

| Feature                          | Importance |
|----------------------------------|------------|
| accommodates                      | 0.171820   |
| beds                              | 0.118331   |
| bedrooms                          | 0.085578   |
| bathrooms                         | 0.055443   |
| latitude                          | 0.033426   |
| longitude                         | 0.027054   |
| property_type_Entire home         | 0.025415   |
| host_id                           | 0.023638   |
| availability_90                   | 0.021210   |
| calculated_host_listings_count_entire_homes | 0.020079   |

The Random Forest model is mainly driven by **capacity** (`accommodates`) and **property size** (`beds`, `bedrooms`, `bathrooms`).  
**Location** (`latitude`, `longitude`) and **property type** (`property_type_Entire home`) also matter for predicting price.

##### B. Feature Importance of Gradient Boosting

This section examines which features are most important in the Gradient Boosting model.

- Feature importance highlights which variables drive price predictions
- Gradient Boosting is also a black-box model, so feature importance helps with interpretation
- Importance values reflect how much each feature contributes to improving predictions
- The focus is on identifying the most important features rather than exact numerical values

In [152]:
# Gradient Boosting Feature Importance

# Get fitted Gradient Boosting model
gb_estimator = gb_model.named_steps["model"]

# Get preprocessor
preprocessor = gb_model.named_steps["preprocessor"]

# Numeric feature names
num_features = preprocessor.named_transformers_["num"].feature_names_in_

# Categorical feature names (after one-hot encoding)
cat_ohe = preprocessor.named_transformers_["cat"].named_steps["onehot"]
cat_features = cat_ohe.get_feature_names_out(
    preprocessor.named_transformers_["cat"].feature_names_in_
)

# Combine all feature names
feature_names = np.concatenate([num_features, cat_features])

# Extract feature importances
importances = gb_estimator.feature_importances_

# Create importance table
gb_importance_df = (
    pd.DataFrame({
        "feature": feature_names,
        "importance": importances
    })
    .sort_values("importance", ascending=False)
)

# Show top 10 features
gb_importance_df.head(10)

Unnamed: 0,feature,importance
5,accommodates,0.516623
8,beds,0.090508
6,bathrooms,0.06338
3,latitude,0.061143
7,bedrooms,0.052976
4,longitude,0.029512
88,neighbourhood_cleansed_Shinjuku Ku,0.019219
19,availability_90,0.018191
0,host_id,0.013844
86,neighbourhood_cleansed_Shibuya Ku,0.010434


**Top 10 Features – Gradient Boosting**

| Feature                                | Importance |
|----------------------------------------|------------|
| accommodates                           | 0.516623   |
| beds                                   | 0.090508   |
| bathrooms                              | 0.063380   |
| latitude                               | 0.061143   |
| bedrooms                               | 0.052976   |
| longitude                              | 0.029512   |
| neighbourhood_cleansed_Shinjuku Ku     | 0.019219   |
| availability_90                        | 0.018191   |
| host_id                                | 0.013844   |
| neighbourhood_cleansed_Shibuya Ku      | 0.010434   |

The Gradient Boosting model is mainly driven by capacity (`accommodates`) and property size (`beds`, `bathrooms`, `bedrooms`).      
Location (`latitude`, `longitude`, and neighbourhood) and availability also contribute to price prediction.

##### C. Compare the 10 Most Important Features of Random Forest and Gradient Boosting Models

Both models rely on similar core drivers of price, but they emphasize different aspects of the listing.

**Similarities: Features appearing in the top 10 of both models**
- Capacity: `accommodates`
- Property size: `beds`, `bathrooms`, `bedrooms`
- Location: `latitude`, `longitude`
- Availability: `availability_90`
- Host-related: `host_id`

**Differences**
- **Random Forest** places more weight on host portfolio size and listing structure, such as  
  `property_type_Entire home` and `calculated_host_listings_count_entire_homes`.
- **Gradient Boosting** gives more importance to neighbourhood-specific variables, such as  
  `neighbourhood_cleansed_Shinjuku Ku` and `neighbourhood_cleansed_Shibuya Ku`.

Overall, both models confirm that **capacity and location** are the main drivers of Airbnb prices.  
However, Gradient Boosting captures neighbourhood effects more strongly, while Random Forest focuses more on host and property structure.

## Part II. Validity

This section tests how well the models perform on new/'live' datasets.

2 additional datasets are used:
- A **later time period** for the same city (Tokyo) - `Q3 2025`
- A **different city** from the same region - 

The same data wrangling steps and the same 5 predictive models from Part I are applied to these new datasets.       
Model performance is then compared to assess how well the models generalize to new data and settings.

### Step 5: Adding 2 'Live' Datasets

##### A. Later Date: Tokyo (Q3 2025)

- **City, Country**: Tokyo, Japan  
- **Dataset**: *listings.csv* (loaded directly from GitHub Repo)
- **Time Period**: Q3 2025  
- **Purpose**: Evaluate how well models trained on earlier data perform on a later time period

The same data wrangling and feature engineering steps from Part I are applied to this dataset.

Load the Dataset

In [153]:
# Load dataset
url_tokyo_q3_2025 = "https://raw.githubusercontent.com/zarizachow/Data-Analysis-3/refs/heads/main/Assignment-1/Data/Raw/Tokyo_listings/Tokyo_2025_29_Sep/listings.csv"
df_tokyo_q3_2025 = pd.read_csv(url_tokyo_q3_2025)

# Basic inspection
print("Shape (rows, columns):", df_tokyo_q3_2025.shape)

print("\nColumn names:")
print(df_tokyo_q3_2025.columns)

print("\nData types and missing values:")
df_tokyo_q3_2025.info()

print("\nFirst 5 rows:")
df_tokyo_q3_2025.head()

Shape (rows, columns): (27945, 79)

Column names:
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nig

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,197677,https://www.airbnb.com/rooms/197677,20250929042135,2025-09-30,city scrape,Oshiage Holiday Apartment,,,https://a0.muscache.com/pictures/38437056/d27f...,964081,...,4.84,4.58,4.8,M130003350,f,1,1,0,0,1.11
1,776070,https://www.airbnb.com/rooms/776070,20250929042135,2025-09-29,city scrape,Kero-kero room 1F,We have been in airbnb since 2011 and it has g...,We love Nishinippori because is nearer to Toky...,https://a0.muscache.com/pictures/efd9f039-dbd2...,801494,...,4.98,4.85,4.92,M130000243,f,1,0,1,0,1.74
2,905944,https://www.airbnb.com/rooms/905944,20250929042135,2025-09-29,city scrape,4F Spacious Apartment in Shinjuku / Shibuya Tokyo,NEWLY RENOVATED property entirely for you & yo...,Hatagaya is a great neighborhood located 4 min...,https://a0.muscache.com/pictures/hosting/Hosti...,4847803,...,4.93,4.81,4.81,Hotels and Inns Business Act | 渋谷区保健所長 | 31渋健生...,t,9,9,0,0,1.85
3,1016831,https://www.airbnb.com/rooms/1016831,20250929042135,2025-09-29,city scrape,5 mins Shibuya Cat modern sunny Shimokita,"Hi there, I am Wakana and I live with my two f...",The location is walkable distance to famous Sh...,https://a0.muscache.com/pictures/airflow/Hosti...,5596383,...,4.98,4.93,4.89,M130001107,f,1,0,1,0,1.87
4,1196177,https://www.airbnb.com/rooms/1196177,20250929042135,2025-09-29,city scrape,Homestay at Host's House - Senju-Ohashi Station,Our accommodation offers: <br /><br />1. **Gr...,There are shopping mall near Senjuohashi stati...,https://a0.muscache.com/pictures/72890882/05ec...,5686404,...,4.91,4.75,4.83,M130007760,f,1,0,1,0,1.01


**Handle Missing Values**

- Identify missing values across numeric and categorical variables to avoid model errors  
- Impute numeric variables using median values  
- Treat missing categorical values as `"missing"` to avoid dropping observations

In [154]:
# Handle missing values

# Separate numeric and categorical columns
numeric_cols = df_tokyo_q3_2025.select_dtypes(include=[np.number]).columns
categorical_cols = df_tokyo_q3_2025.select_dtypes(exclude=[np.number]).columns

# Impute numeric variables with median
for col in numeric_cols:
    df_tokyo_q3_2025[col] = df_tokyo_q3_2025[col].fillna(df_tokyo_q3_2025[col].median())

# Impute categorical variables with explicit category
for col in categorical_cols:
    df_tokyo_q3_2025[col] = df_tokyo_q3_2025[col].fillna("missing")

# Sanity check
df_tokyo_q3_2025.isnull().sum().sort_values(ascending=False).head()

neighbourhood_group_cleansed    27945
calendar_updated                27945
id                                  0
number_of_reviews_ltm               0
number_of_reviews                   0
dtype: int64

**Clean and Standardize Numeric Variables**

- Clean the `price` column to ensure it is numeric  
- Winsorize numeric variables (1st–99th percentile) to reduce the impact of extreme outliers

In [155]:
# Clean and standardize numeric variables

# Fix price variable (it may contain strings or missing)
df_tokyo_q3_2025["price"] = df_tokyo_q3_2025["price"].replace("missing", np.nan)

df_tokyo_q3_2025["price"] = (
    df_tokyo_q3_2025["price"]
    .astype(str)
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
)

df_tokyo_q3_2025["price"] = pd.to_numeric(df_tokyo_q3_2025["price"], errors="coerce")

# Impute missing price values with median
df_tokyo_q3_2025["price"] = df_tokyo_q3_2025["price"].fillna(df_tokyo_q3_2025["price"].median())

# Update numeric columns after cleaning price
numeric_cols = df_tokyo_q3_2025.select_dtypes(include=[np.number]).columns

# Cap extreme values (simple winsorization)
for col in numeric_cols:
    lower = df_tokyo_q3_2025[col].quantile(0.01)
    upper = df_tokyo_q3_2025[col].quantile(0.99)
    df_tokyo_q3_2025[col] = df_tokyo_q3_2025[col].clip(lower, upper)

# Sanity check
df_tokyo_q3_2025[numeric_cols].describe().T.head()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,27945.0,9.498094e+17,5.379486e+17,10516390.0,7.624382e+17,1.169462e+18,1.362098e+18,1.508563e+18
scrape_id,27945.0,20250930000000.0,9.773612,20250930000000.0,20250930000000.0,20250930000000.0,20250930000000.0,20250930000000.0
host_id,27945.0,378910900.0,217677800.0,7927902.0,189502000.0,425423500.0,557366500.0,711472700.0
host_listings_count,27945.0,30.10864,40.01565,1.0,5.0,14.0,40.0,194.0
host_total_listings_count,27945.0,38.6531,57.28311,1.0,6.0,17.0,45.0,342.0


In [156]:
# Ensure target variable (price) is numeric

df_tokyo_q3_2025["price"] = (
    df_tokyo_q3_2025["price"]
    .astype(str)
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
)

df_tokyo_q3_2025["price"] = pd.to_numeric(df_tokyo_q3_2025["price"], errors="coerce")

# Drop rows where price is missing
df_tokyo_q3_2025 = df_tokyo_q3_2025.dropna(subset=["price"])

# Sanity check
df_tokyo_q3_2025["price"].dtype, df_tokyo_q3_2025["price"].isna().sum(), df_tokyo_q3_2025.shape

(dtype('float64'), np.int64(0), (27945, 79))

In [157]:
# Drop unusable variables (except id)

all_missing_cols = [
    "neighbourhood_group_cleansed",
    "calendar_updated"
]

id_cols = [
    "scrape_id"
]

cols_to_drop = [col for col in all_missing_cols + id_cols if col in df_tokyo_q3_2025.columns]
df_tokyo_q3_2025 = df_tokyo_q3_2025.drop(columns=cols_to_drop)

print("Dropped columns:", cols_to_drop)
print("New shape:", df_tokyo_q3_2025.shape)

Dropped columns: ['neighbourhood_group_cleansed', 'calendar_updated', 'scrape_id']
New shape: (27945, 76)


**Variable Selection for Modelling**

- Exclude id, URLs, dates, and free-text fields that are not useful for prediction  
- Keep structured listing, host, location, and amenity variables  
- Use the same variables across all datasets for out-of-sample comparison

**Extract Amenities**

- Convert the amenities text field into binary indicators  
- Use the same amenity list as Part I for consistency

In [158]:
# Extract amenities

# Convert amenities column to string
df_tokyo_q3_2025["amenities"] = df_tokyo_q3_2025["amenities"].astype(str)

# Same selected amenities list as Part I
amenities_list = [
    "Wifi",
    "Kitchen",
    "Air conditioning",
    "Heating",
    "Washer",
    "Dryer",
    "Elevator",
    "TV"
]

# Create binary indicators for each amenity
for amenity in amenities_list:
    df_tokyo_q3_2025[f"amenity_{amenity.lower().replace(' ', '_')}"] = (
        df_tokyo_q3_2025["amenities"].str.contains(amenity, case=False, regex=False).astype(int)
    )

# Drop original amenities text field
df_tokyo_q3_2025 = df_tokyo_q3_2025.drop(columns=["amenities"])

# Sanity check
df_tokyo_q3_2025.filter(like="amenity_").head()

Unnamed: 0,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_elevator,amenity_tv
0,1,1,1,1,1,1,0,1
1,1,0,1,1,0,1,0,1
2,1,1,0,1,0,1,0,1
3,1,1,0,1,1,1,0,1
4,1,0,0,1,1,1,0,1


**Save Cleaned Dataset**

In [159]:
# Save cleaned Tokyo Q3 2025 dataset (overwrite if exists)

output_path = "Data/Cleaned/Tokyo_listings/tokyo_listings_q3_2025_clean.csv"
df_tokyo_q3_2025.to_csv(output_path, index=False)

print("Cleaned dataset saved (overwritten if existed):")
print(output_path)

Cleaned dataset saved (overwritten if existed):
Data/Cleaned/Tokyo_listings/tokyo_listings_q3_2025_clean.csv


**Encoding of Categorical Variables**

Categorical variables are encoded within the model pipelines already, ensuring consistent preprocessing across all datasets.

##### B. Another City in the Same Region: Hong Kong

- **City, Country**: Hong Kong, China  
- **Dataset**: `listings.csv` (loaded directly from the GitHub repository)  
- **Time period**: Latest available data  
- **Purpose**: Test geographic validity by applying the models to another city in the same broader region (Asia)
- **Reasons for Choosing Hong Kong**
    - *Same region*: Hong Kong and Tokyo are both located in East Asia, allowing for a regional validity check
    - *Highly urban and dense*: Both cities are large, high-density urban areas where Airbnb listings are mostly apartments
    - *Strong short-term rental demand*: Tourism and business travel play an important role in both markets, making pricing more comparable

The same data wrangling and feature engineering steps from Part I are applied to this dataset before using the trained models for evaluation.

Load the Dataset

In [160]:
# Load dataset
url_hk_latest = "https://raw.githubusercontent.com/zarizachow/Data-Analysis-3/refs/heads/main/Assignment-1/Data/Raw/HongKong_listings/listings.csv"
df_hk_latest = pd.read_csv(url_hk_latest)

# Basic inspection
print("Shape (rows, columns):", df_hk_latest.shape)

print("\nColumn names:")
print(df_hk_latest.columns)

print("\nData types and missing values:")
df_hk_latest.info()

print("\nFirst 5 rows:")
df_hk_latest.head()

Shape (rows, columns): (6801, 79)

Column names:
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nigh

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,103760,https://www.airbnb.com/rooms/103760,20250923203502,2025-09-24,city scrape,Central Centre 5 min walk to/from Central MTR,"Located right in the heart of Central, this 2 ...",,https://a0.muscache.com/pictures/815221/056993...,304876,...,4.63,4.72,4.41,,f,1,1,0,0,1.82
1,248140,https://www.airbnb.com/rooms/248140,20250923203502,2025-09-24,city scrape,Bright Studio - Soho - Central HK,Our bright studio apartment is perfect for vis...,The local neighbourhood is quiet and relaxing....,https://a0.muscache.com/pictures/4e5463bc-38be...,1300549,...,4.98,4.76,4.8,,f,1,1,0,0,1.35
2,263081,https://www.airbnb.com/rooms/263081,20250923203502,2025-09-23,city scrape,"3 睡房, 2500 平方呎 有工人 Family Friendly! 在中環半山 有車","Mid Level on the top of central, Next to the m...","We are in the most luxury area of HK, conduit ...",https://a0.muscache.com/pictures/7348025/73278...,1370155,...,,,,,f,1,0,1,0,
3,306515,https://www.airbnb.com/rooms/306515,20250923203502,2025-09-24,city scrape,"HongKong,Central Bright Double Room",,,https://a0.muscache.com/pictures/3335548/00a8d...,1576511,...,4.84,4.89,4.79,,f,1,1,0,0,0.11
4,378047,https://www.airbnb.com/rooms/378047,20250923203502,2025-09-24,previous scrape,LUXURY HONG KONG COUNTRY COW SHED,Your bijoux getaway is a one story cottage in ...,"We are in the South Lantau foot hills, 20 min ...",https://a0.muscache.com/pictures/hosting/Hosti...,1805628,...,4.94,4.52,4.62,,f,1,1,0,0,0.51


**Handle Missing Values**

- Identify missing values across numeric and categorical variables to avoid model errors  
- Impute numeric variables using median values  
- Treat missing categorical values as `"missing"` to avoid dropping observations

In [161]:
# Handle missing values

# Separate numeric and categorical columns
numeric_cols = df_hk_latest.select_dtypes(include=[np.number]).columns
categorical_cols = df_hk_latest.select_dtypes(exclude=[np.number]).columns

# Impute numeric variables with median
for col in numeric_cols:
    df_hk_latest[col] = df_hk_latest[col].fillna(df_hk_latest[col].median())

# Impute categorical variables with explicit category
for col in categorical_cols:
    df_hk_latest[col] = df_hk_latest[col].fillna("missing")

# Sanity check
df_hk_latest.isnull().sum().sort_values(ascending=False).head()

calendar_updated                6801
license                         6801
neighbourhood_group_cleansed    6801
has_availability                   0
number_of_reviews                  0
dtype: int64

**Clean and Standardize Numeric Variables**

- Clean the `price` column to ensure it is numeric  
- Winsorize numeric variables (1st–99th percentile) to reduce the impact of extreme outliers

In [162]:
# Clean and standardize numeric variables

# Fix price variable (it may contain strings or missing)
df_hk_latest["price"] = df_hk_latest["price"].replace("missing", np.nan)

df_hk_latest["price"] = (
    df_hk_latest["price"]
    .astype(str)
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
)

df_hk_latest["price"] = pd.to_numeric(df_hk_latest["price"], errors="coerce")

# Impute missing price values with median
df_hk_latest["price"] = df_hk_latest["price"].fillna(df_hk_latest["price"].median())

# Update numeric columns after cleaning price
numeric_cols = df_hk_latest.select_dtypes(include=[np.number]).columns

# Cap extreme values (simple winsorization)
for col in numeric_cols:
    lower = df_hk_latest[col].quantile(0.01)
    upper = df_hk_latest[col].quantile(0.99)
    df_hk_latest[col] = df_hk_latest[col].clip(lower, upper)

# Sanity check
df_hk_latest[numeric_cols].describe().T.head()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,6801.0,6.270046e+17,5.680543e+17,1910193.0,32996790.0,8.266702e+17,1.131615e+18,1.492672e+18
scrape_id,6801.0,20250920000000.0,1.738409,20250920000000.0,20250920000000.0,20250920000000.0,20250920000000.0,20250920000000.0
host_id,6801.0,186793500.0,208110900.0,1654196.0,17232270.0,97240130.0,353326600.0,689219700.0
host_listings_count,6801.0,112.2978,143.4581,1.0,4.0,30.0,217.0,517.0
host_total_listings_count,6801.0,145.2341,197.5859,1.0,7.0,38.0,309.0,885.0


In [163]:
# Ensure target variable (price) is numeric

df_hk_latest["price"] = (
    df_hk_latest["price"]
    .astype(str)
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
)

df_hk_latest["price"] = pd.to_numeric(df_hk_latest["price"], errors="coerce")

# Drop rows where price is missing
df_hk_latest = df_hk_latest.dropna(subset=["price"])

# Sanity check
df_hk_latest["price"].dtype, df_hk_latest["price"].isna().sum(), df_hk_latest.shape

(dtype('float64'), np.int64(0), (6801, 79))

In [164]:
# Drop unusable variables (except id)

all_missing_cols = [
    "neighbourhood_group_cleansed",
    "calendar_updated"
]

id_cols = [
    "scrape_id"
]

cols_to_drop = [col for col in all_missing_cols + id_cols if col in df_hk_latest.columns]
df_hk_latest = df_hk_latest.drop(columns=cols_to_drop)

print("Dropped columns:", cols_to_drop)
print("New shape:", df_hk_latest.shape)

Dropped columns: ['neighbourhood_group_cleansed', 'calendar_updated', 'scrape_id']
New shape: (6801, 76)


**Variable Selection for Modelling**

Before extracting amenities, I remove variables that are not useful for prediction.

- Drop identifiers, URLs, and date fields  
- Drop free-text fields (titles, descriptions, long notes)  
- Keep structured listing, host, and location variables for modelling consistency

**Extract Amenities**

- Convert the amenities text field into binary indicators  
- Use the same amenity list as Part I for consistency

In [165]:
# Extract amenities

# Convert amenities column to string
df_hk_latest["amenities"] = df_hk_latest["amenities"].astype(str)

# Same selected amenities list as Part I
amenities_list = [
    "Wifi",
    "Kitchen",
    "Air conditioning",
    "Heating",
    "Washer",
    "Dryer",
    "Elevator",
    "TV"
]

# Create binary indicators for each amenity
for amenity in amenities_list:
    df_hk_latest[f"amenity_{amenity.lower().replace(' ', '_')}"] = (
        df_hk_latest["amenities"].str.contains(amenity, case=False, regex=False).astype(int)
    )

# Drop original amenities text field
df_hk_latest = df_hk_latest.drop(columns=["amenities"])

# Sanity check
df_hk_latest.filter(like="amenity_").head()

Unnamed: 0,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_elevator,amenity_tv
0,1,1,0,1,1,1,0,1
1,1,1,0,1,0,1,1,1
2,1,1,1,1,1,1,1,1
3,1,1,1,0,1,1,1,1
4,1,1,1,1,1,1,0,0


**Save Cleaned Dataset**

In [166]:
# Save cleaned Hong Kong dataset (overwrite if exists)

output_path = "Data/Cleaned/HongKong_listings/hongkong_listings_latest_clean.csv"
df_hk_latest.to_csv(output_path, index=False)

print("Cleaned dataset saved (overwritten if existed):")
print(output_path)

Cleaned dataset saved (overwritten if existed):
Data/Cleaned/HongKong_listings/hongkong_listings_latest_clean.csv


**Encoding of Categorical Variables**

Categorical variables are encoded within the model pipelines already, ensuring consistent preprocessing across all datasets.

### Step 6: Use the 5 Core Models on the 'Live' Datasets

##### A. Later Date: Tokyo (Q3 2025)

**OLS Model**

In [167]:
# OLS model on Tokyo Q3 2025 dataset

# Define target and features
y_tokyo_q3_2025 = df_tokyo_q3_2025["price"]
X_tokyo_q3_2025 = df_tokyo_q3_2025.drop(columns=["price", "id"])

# Train-test split (same random state for consistency)
X_train, X_test, y_train, y_test = train_test_split(
    X_tokyo_q3_2025, y_tokyo_q3_2025, test_size=0.2, random_state=42
)

# Identify numeric and categorical features
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing pipelines
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# OLS pipeline
ols_tokyo_q3_2025 = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", LinearRegression())
    ]
)

# Fit + predict with timing
start_time = time.time()

ols_tokyo_q3_2025.fit(X_train, y_train)
y_pred = ols_tokyo_q3_2025.predict(X_test)

runtime_ols_tokyo_q3_2025 = time.time() - start_time

# RMSE
rmse_ols_tokyo_q3_2025 = np.sqrt(mean_squared_error(y_test, y_pred))

print("Tokyo Q3 2025 - OLS RMSE:", rmse_ols_tokyo_q3_2025)
print("Tokyo Q3 2025 - OLS Runtime (s):", round(runtime_ols_tokyo_q3_2025, 2))

Tokyo Q3 2025 - OLS RMSE: 11416.208545812342
Tokyo Q3 2025 - OLS Runtime (s): 6.97


**LASSO Model**

In [168]:
 # LASSO on Tokyo Q3 2025 dataset

# Restrict categorical variables (same as Part I)
categorical_keep = [
    "room_type",
    "property_type",
    "neighbourhood_cleansed",
    "host_is_superhost",
    "instant_bookable"
]

categorical_keep = [c for c in categorical_keep if c in df_tokyo_q3_2025.columns]

# Define target and features
y_tokyo_q3_2025 = df_tokyo_q3_2025["price"]
X_tokyo_q3_2025 = df_tokyo_q3_2025.drop(columns=["price", "id"])

# Drop other object/string columns
object_cols = X_tokyo_q3_2025.select_dtypes(exclude=[np.number]).columns.tolist()
drop_cols = [c for c in object_cols if c not in categorical_keep]
X_tokyo_q3_2025 = X_tokyo_q3_2025.drop(columns=drop_cols)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_tokyo_q3_2025, y_tokyo_q3_2025, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_lasso = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# LASSO pipeline
lasso_tokyo_q3_2025 = Pipeline(
    steps=[
        ("preprocessor", preprocessor_lasso),
        ("model", Lasso(alpha=1.0, max_iter=3000))
    ]
)

# Fit + predict with timing
start_time = time.time()

lasso_tokyo_q3_2025.fit(X_train, y_train)
y_pred = lasso_tokyo_q3_2025.predict(X_test)

runtime_lasso_tokyo_q3_2025 = time.time() - start_time

# RMSE
rmse_lasso_tokyo_q3_2025 = np.sqrt(mean_squared_error(y_test, y_pred))

print("Tokyo Q3 2025 - LASSO RMSE:", rmse_lasso_tokyo_q3_2025)
print("Tokyo Q3 2025 - LASSO Runtime (s):", round(runtime_lasso_tokyo_q3_2025, 2))


Tokyo Q3 2025 - LASSO RMSE: 12501.681388469251
Tokyo Q3 2025 - LASSO Runtime (s): 14.15


**Random Forest Model**

In [169]:
# Random Forest model on Tokyo Q3 2025 dataset

# Define target and features
y_tokyo_q3_2025 = df_tokyo_q3_2025["price"]
X_tokyo_q3_2025 = df_tokyo_q3_2025.drop(columns=["price", "id"])

# Train-test split (same random state for consistency)
X_train, X_test, y_train, y_test = train_test_split(
    X_tokyo_q3_2025, y_tokyo_q3_2025, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing (same as Part I)
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median"))]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_rf = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Random Forest pipeline (same settings as Part I)
rf_tokyo_q3_2025 = Pipeline(
    steps=[
        ("preprocessor", preprocessor_rf),
        ("model", RandomForestRegressor(
            n_estimators=100,
            max_depth=20,
            min_samples_leaf=5,
            max_features="sqrt",
            random_state=42,
            n_jobs=-1
        ))
    ]
)

# Fit + predict with timing
start_time = time.time()

rf_tokyo_q3_2025.fit(X_train, y_train)
y_pred = rf_tokyo_q3_2025.predict(X_test)

runtime_rf_tokyo_q3_2025 = time.time() - start_time

# RMSE
rmse_rf_tokyo_q3_2025 = np.sqrt(mean_squared_error(y_test, y_pred))

print("Tokyo Q3 2025 - Random Forest RMSE:", rmse_rf_tokyo_q3_2025)
print("Tokyo Q3 2025 - Random Forest Runtime (s):", round(runtime_rf_tokyo_q3_2025, 2))

Tokyo Q3 2025 - Random Forest RMSE: 16035.611614317326
Tokyo Q3 2025 - Random Forest Runtime (s): 1.31


**Gradient Boosting Model**

In [170]:
# Gradient Boosting on Tokyo Q3 2025

# Define target and features
y_tokyo_q3_2025 = df_tokyo_q3_2025["price"]
X_tokyo_q3_2025 = df_tokyo_q3_2025.drop(columns=["price", "id"])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_tokyo_q3_2025, y_tokyo_q3_2025, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median"))]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_gb = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Gradient Boosting pipeline
gb_tokyo_q3_2025 = Pipeline(
    steps=[
        ("preprocessor", preprocessor_gb),
        ("model", GradientBoostingRegressor(random_state=42))
    ]
)

# Fit + predict with timing
start_time = time.time()

gb_tokyo_q3_2025.fit(X_train, y_train)
y_pred = gb_tokyo_q3_2025.predict(X_test)

runtime_gb_tokyo_q3_2025 = time.time() - start_time

# RMSE
rmse_gb_tokyo_q3_2025 = np.sqrt(mean_squared_error(y_test, y_pred))

print("Tokyo Q3 2025 - Gradient Boosting RMSE:", rmse_gb_tokyo_q3_2025)
print("Tokyo Q3 2025 - Gradient Boosting Runtime (s):", round(runtime_gb_tokyo_q3_2025, 2))

Tokyo Q3 2025 - Gradient Boosting RMSE: 10117.57589237642
Tokyo Q3 2025 - Gradient Boosting Runtime (s): 27.15


**Decision Tree (CART)**

In [171]:
# Decision Tree (CART) on Tokyo Q3 2025

# Define target and features
y_tokyo_q3_2025 = df_tokyo_q3_2025["price"]
X_tokyo_q3_2025 = df_tokyo_q3_2025.drop(columns=["price", "id"])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_tokyo_q3_2025, y_tokyo_q3_2025, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing (same as RF/GB)
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median"))]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_dt = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Decision Tree pipeline
dt_tokyo_q3_2025 = Pipeline(
    steps=[
        ("preprocessor", preprocessor_dt),
        ("model", DecisionTreeRegressor(random_state=42))
    ]
)

# Fit + predict with timing
start_time = time.time()

dt_tokyo_q3_2025.fit(X_train, y_train)
y_pred = dt_tokyo_q3_2025.predict(X_test)

runtime_dt_tokyo_q3_2025 = time.time() - start_time

# RMSE
rmse_dt_tokyo_q3_2025 = np.sqrt(mean_squared_error(y_test, y_pred))

print("Tokyo Q3 2025 - Decision Tree RMSE:", rmse_dt_tokyo_q3_2025)
print("Tokyo Q3 2025 - Decision Tree Runtime (s):", round(runtime_dt_tokyo_q3_2025, 2))

Tokyo Q3 2025 - Decision Tree RMSE: 10271.0395070093
Tokyo Q3 2025 - Decision Tree Runtime (s): 28.28


##### Horserace Table (Tokyo Q3 2025)

The horserace table compares the 5 models in terms of predictive accuracy and computation time.

- **RMSE** is used to measure out-of-sample prediction error. Lower RMSE values indicate better predictive performance.
- **Time** captures total model training and prediction time
- All models use the same data and train–test split
- This ensures results are directly comparable across models

| Model                    | RMSE (Test Set) | Runtime (seconds) |
|--------------------------|----------------:|------------------:|
| OLS                      | 11,416.21       | 6.97              |
| LASSO                    | 12,501.68       | 14.15             |
| Random Forest            | 16,035.61       | 1.31              |
| Gradient Boosting        | 10,117.58       | 27.15             |
| Decision Tree (CART)     | 10,271.04       | 28.28             |

##### Discussion of Performance (Tokyo Q3 2025)

The horserace table shows clear differences across the models in terms of fit and runtime.

- **OLS**  
  OLS performs well as a baseline model, with a relatively low RMSE and moderate runtime.

- **LASSO**  
  LASSO performs worse than OLS, suggesting that shrinkage/regularisation does not improve prediction here.

- **Random Forest**  
  Random Forest is fast in this run, but it has the highest RMSE, meaning it performs the worst on Tokyo Q3 2025.

- **Gradient Boosting**  
  Gradient Boosting has the lowest RMSE, so it predicts best here, but it is the slowest to run.

- **Decision Tree (CART)**  
  CART performs almost as well as Gradient Boosting in RMSE, but it is also very slow in this run and can be less stable because it is a single tree.

Overall, the results show a trade-off between accuracy and computation time.  
In this dataset, **Gradient Boosting gives the best accuracy**, while **OLS remains a strong and simpler baseline**.

##### B. Another City in the Same Region: Hong Kong

**OLS Model**

In [None]:
# OLS Model on Hong Kong

# Define target and features
y_hk = df_hk_latest["price"]
X_hk = df_hk_latest.drop(columns=["price", "id"])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_hk, y_hk, test_size=0.2, random_state=42
)

# Identify numeric and categorical features
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing pipelines
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# OLS pipeline
ols_hk = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", LinearRegression())
    ]
)

# Fit + predict with timing
start_time = time.time()

ols_hk.fit(X_train, y_train)
y_pred = ols_hk.predict(X_test)

runtime_ols_hk = time.time() - start_time
rmse_ols_hk = np.sqrt(mean_squared_error(y_test, y_pred))

print("Hong Kong - OLS RMSE:", rmse_ols_hk)
print("Hong Kong - OLS Runtime (s):", round(runtime_ols_hk, 2))

Hong Kong - OLS RMSE: 455.9624123770325
Hong Kong - OLS Runtime (s): 1.38


**LASSO Model**

In [None]:
# LASSO Model on Hong Kong

# Restrict categorical variables (same as Part I)
categorical_keep = [
    "room_type",
    "property_type",
    "neighbourhood_cleansed",
    "host_is_superhost",
    "instant_bookable"
]
categorical_keep = [c for c in categorical_keep if c in df_hk_latest.columns]

# Define target and features
y_hk = df_hk_latest["price"]
X_hk = df_hk_latest.drop(columns=["price", "id"])

# Drop other object/string columns (keep only selected categoricals)
object_cols = X_hk.select_dtypes(exclude=[np.number]).columns.tolist()
drop_cols = [c for c in object_cols if c not in categorical_keep]
X_hk = X_hk.drop(columns=drop_cols)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_hk, y_hk, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_lasso = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# LASSO pipeline
lasso_hk = Pipeline(
    steps=[
        ("preprocessor", preprocessor_lasso),
        ("model", Lasso(alpha=1.0, max_iter=3000))
    ]
)

# Fit + predict with timing
start_time = time.time()

lasso_hk.fit(X_train, y_train)
y_pred = lasso_hk.predict(X_test)

runtime_lasso_hk = time.time() - start_time
rmse_lasso_hk = np.sqrt(mean_squared_error(y_test, y_pred))

print("Hong Kong - LASSO RMSE:", rmse_lasso_hk)
print("Hong Kong - LASSO Runtime (s):", round(runtime_lasso_hk, 2))

Hong Kong - LASSO RMSE: 546.482964740583
Hong Kong - LASSO Runtime (s): 0.61


**Random Forest Model**

In [175]:
# Random Forest Model on Hong Kong

# Define target and features
y_hk = df_hk_latest["price"]
X_hk = df_hk_latest.drop(columns=["price", "id"])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_hk, y_hk, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median"))]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_rf = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Random Forest pipeline (same as Part I)
rf_hk = Pipeline(
    steps=[
        ("preprocessor", preprocessor_rf),
        ("model", RandomForestRegressor(
            n_estimators=100,
            max_depth=20,
            min_samples_leaf=5,
            max_features="sqrt",
            random_state=42,
            n_jobs=-1
        ))
    ]
)

# Fit + predict with timing
start_time = time.time()

rf_hk.fit(X_train, y_train)
y_pred = rf_hk.predict(X_test)

runtime_rf_hk = time.time() - start_time
rmse_rf_hk = np.sqrt(mean_squared_error(y_test, y_pred))

print("Hong Kong - Random Forest RMSE:", rmse_rf_hk)
print("Hong Kong - Random Forest Runtime (s):", round(runtime_rf_hk, 2))

Hong Kong - Random Forest RMSE: 636.6375035558883
Hong Kong - Random Forest Runtime (s): 0.43


**Gradient Boosting Model**

In [176]:
# Gradient Boosting Model on Hong Kong

# Define target and features
y_hk = df_hk_latest["price"]
X_hk = df_hk_latest.drop(columns=["price", "id"])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_hk, y_hk, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median"))]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_gb = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Gradient Boosting pipeline
gb_hk = Pipeline(
    steps=[
        ("preprocessor", preprocessor_gb),
        ("model", GradientBoostingRegressor(random_state=42))
    ]
)

# Fit + predict with timing
start_time = time.time()

gb_hk.fit(X_train, y_train)
y_pred = gb_hk.predict(X_test)

runtime_gb_hk = time.time() - start_time
rmse_gb_hk = np.sqrt(mean_squared_error(y_test, y_pred))

print("Hong Kong - Gradient Boosting RMSE:", rmse_gb_hk)
print("Hong Kong - Gradient Boosting Runtime (s):", round(runtime_gb_hk, 2))

Hong Kong - Gradient Boosting RMSE: 466.0048116110818
Hong Kong - Gradient Boosting Runtime (s): 5.14


**Decision Tree (CART)**

In [177]:
# Decision Tree (CART) on Hong Kong

# Define target and features
y_hk = df_hk_latest["price"]
X_hk = df_hk_latest.drop(columns=["price", "id"])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_hk, y_hk, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing (same as RF/GB)
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median"))]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_dt = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Decision Tree pipeline
dt_hk = Pipeline(
    steps=[
        ("preprocessor", preprocessor_dt),
        ("model", DecisionTreeRegressor(random_state=42))
    ]
)

# Fit + predict with timing
start_time = time.time()

dt_hk.fit(X_train, y_train)
y_pred = dt_hk.predict(X_test)

runtime_dt_hk = time.time() - start_time
rmse_dt_hk = np.sqrt(mean_squared_error(y_test, y_pred))

print("Hong Kong - Decision Tree RMSE:", rmse_dt_hk)
print("Hong Kong - Decision Tree Runtime (s):", round(runtime_dt_hk, 2))

Hong Kong - Decision Tree RMSE: 499.90527905145404
Hong Kong - Decision Tree Runtime (s): 1.92


##### Horserace Table

The horserace table compares the 5 models in terms of predictive accuracy and computation time.

- **RMSE** is used to measure out-of-sample prediction error. Lower RMSE values indicate better predictive performance.
- **Time** captures total model training and prediction time
- All models use the same data and train–test split
- This ensures results are directly comparable across models

| Model                    | RMSE (Test Set) | Runtime (seconds) |
|--------------------------|----------------:|------------------:|
| OLS                      | 455.96          | 1.38              |
| LASSO                    | 546.48          | 0.61              |
| Random Forest            | 636.64          | 0.43              |
| Gradient Boosting        | 466.00          | 5.14              |
| Decision Tree (CART)     | 499.91          | 1.92              |

##### Discussion of Performance

The horserace table shows clear differences across the models in terms of fit and runtime.

- **OLS**  
  OLS performs best here (lowest RMSE), while staying fast. It is a strong baseline on the Hong Kong dataset.

- **LASSO**  
  LASSO performs worse than OLS, suggesting that shrinkage/regularisation is not improving prediction in this setup.

- **Random Forest**  
  Random Forest performs the worst in terms of RMSE, even though it runs quickly in this version. This suggests the model settings may be underfitting, or the signal is better captured by simpler structure.

- **Gradient Boosting**  
  Gradient Boosting performs close to OLS (second-best RMSE), but takes much longer to run.

- **Decision Tree (CART)**  
  CART performs better than LASSO and Random Forest, but worse than OLS and Gradient Boosting. This is expected since single trees can be unstable.

Overall, the results show that simple models generalize well in this dataset.   
OLS gives the best accuracy with low runtime, while boosting improves slightly but costs more time.

##### Experience of Running Models on 'Live' Datasets

While building the models, I faced runtime issues, mainly with the Random Forest model.

**Key Learning Points**

- Handling Runtime Issue in Original Model      
    In **Part I**, the original Random Forest with `n_estimators = 500` took a very long time to run.   
    So, I first reduced the number of trees to `200`, and this reduced runtime    

- Applying Random Forest Model to Live Dataset  
    I applied this model to the live Tokyo Q3 2025 dataset in Part II with `n_estimators = 200`     
    The model still did not finish running, even after more than 6 minutes on live dataset.     
    To keep the modelling process consistent and workable across all datasets, I went back to the original Random Forest model in Part I         
    I reduced `n_estimators` even more to `100`.      

    This allowed the model to run properly on both the original and live datasets.  

- Balancing RMSE and Runtime        
    As a result, the RMSE increased slightly, but the model became usable and comparable across datasets.   
    Finding the balance between RMSE and runtime with trial and error was crucial for making my models workable.    

- Experience with other models      
    No such changes were needed for the other models.

**Overall Learning**
- Scalability       
    This experience showed me the importance of considering runtime and scalability, not just predictive accuracy.      

- Objective of Horserace Table      
    It made me understand the need for the horserace table to compare models and their performance not only for base dataset, but also for 'live' datasets.  

- Structure and Clear Documentation       
    This trial-and-error process also taught me the value of a well-structured notebook, which made it easier to adjust individual models when needed.  
    A lot less effort went into fixing individual models because I had a clear idea of how I wanted to structure my jupyter notebook.   
    Clear documentation also proved to be extremely valuable to me here.