# Building a prediction model on house prices
Data Analysis 3 - Assignment 1  
Submitted by: Zariza Chowdhury (ID: 2500086)    
Deadline: 2 February 2026

## Business Case
My business case is to operate a chain of Airbnbs.      
The task is to build a pricing model.

## Part I. Modelling

### Step 0: Setup
Import the necessary libraries

In [40]:
# Core libraries
import numpy as np
import pandas as pd
import time
import warnings
warnings.filterwarnings("ignore")

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn: preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Scikit-learn: models
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Scikit-learn: evaluation
from sklearn.metrics import mean_squared_error

# Interpretable ML
import shap

### Step 1: Data Selection, Wrangling and Feature Engineering

##### A. Dataset Selection:
- **Source**: Inside Airbnb - *https://insideairbnb.com/get-the-data/*
- **Dataset**: *listings.csv* (loaded directly from GitHub Repo)
- **City, Country**: Tokyo, Japan
- **Time Period**: Q4 2024
- **Reproducibility**: Data is uploaded and stored in a public GitHub repo and loaded directly via a raw URL

Load the Dataset

In [41]:
# Load dataset
url = "https://raw.githubusercontent.com/zarizachow/Data-Analysis-3/refs/heads/main/Assignment-1/Data/Raw/Tokyo_listings/Tokyo_2024-30-Dec/listings.csv"
df = pd.read_csv(url)

# Basic inspection
print("Shape (rows, columns):", df.shape)

print("\nColumn names:")
print(df.columns)

print("\nData types and missing values:")
df.info()

print("\nFirst 5 rows:")
df.head()

Shape (rows, columns): (21058, 75)

Column names:
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nig

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,197677,https://www.airbnb.com/rooms/197677,20241230011552,2024-12-30,city scrape,Oshiage Holiday Apartment,,,https://a0.muscache.com/pictures/38437056/d27f...,964081,...,4.84,4.56,4.8,M130003350,f,1,1,0,0,1.13
1,776070,https://www.airbnb.com/rooms/776070,20241230011552,2024-12-30,city scrape,Kero-kero house room 1,We have been in airbnb since 2011 and it has g...,We love Nishinippori because is nearer to Toky...,https://a0.muscache.com/pictures/efd9f039-dbd2...,801494,...,4.98,4.84,4.92,M130000243,f,1,0,1,0,1.79
2,905944,https://www.airbnb.com/rooms/905944,20241230011552,2024-12-30,city scrape,4F Spacious Apartment in Shinjuku / Shibuya Tokyo,NEWLY RENOVATED property entirely for you & yo...,Hatagaya is a great neighborhood located 4 min...,https://a0.muscache.com/pictures/miso/Hosting-...,4847803,...,4.91,4.78,4.78,Hotels and Inns Business Act | 渋谷区保健所長 | 31渋健生...,t,8,8,0,0,1.69
3,1016831,https://www.airbnb.com/rooms/1016831,20241230011552,2024-12-30,city scrape,5 mins Shibuya Cat modern sunny Shimokita,"Hi there, I am Wakana and I live with my two f...",The location is walkable distance to famous Sh...,https://a0.muscache.com/pictures/airflow/Hosti...,5596383,...,4.98,4.92,4.9,M130001107,f,1,0,1,0,1.9
4,1196177,https://www.airbnb.com/rooms/1196177,20241230011552,2024-12-30,city scrape,Homestay at Host's House - Senju-Ohashi Station,Our accommodation offers: <br /><br />1. **Gr...,There are shopping mall near Senjuohashi stati...,https://a0.muscache.com/pictures/72890882/05ec...,5686404,...,4.92,4.74,4.82,M130007760,f,1,0,1,0,0.97


##### B. Data Wrangling and Feature Engineering

**Handle Missing Values**

- Identify missing values across numeric and categorical variables - to ensure that all models can be estimated without errors
- Impute numeric variables using simple summary statistics - this is a simple and robust method so the clean dataset is not sensitive to outliers
- Treat missing categorical values as a separate category where needed - to preserve information and avoid dropping observations

In [42]:
# Handle missing values

# Separate numeric and categorical columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(exclude=[np.number]).columns

# Impute numeric variables with median
for col in numeric_cols:
    df[col] = df[col].fillna(df[col].median())

# Impute categorical variables with explicit category
for col in categorical_cols:
    df[col] = df[col].fillna("missing")

# Sanity check
df.isnull().sum().sort_values(ascending=False).head()

neighbourhood_group_cleansed    21058
calendar_updated                21058
minimum_nights_avg_ntm              0
availability_365                    0
availability_90                     0
dtype: int64

**Clean and Standardize Numeric Variables**

- Inspect numeric variables for unrealistic or extreme values – to identify potential data quality issues
- Apply simple cleaning rules and transformations where needed – to ensure the variables are consistent
- Standardize the format of the numeric variables – to support estimation across different predictive models

In [43]:
# Clean and standardize numeric variables

# Fix price variable (it may contain strings or missing)
df["price"] = df["price"].replace("missing", np.nan)

df["price"] = (
    df["price"]
    .astype(str)
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
)

df["price"] = pd.to_numeric(df["price"], errors="coerce")

# Impute missing price values with median
df["price"] = df["price"].fillna(df["price"].median())

# Update numeric columns after cleaning price
numeric_cols = df.select_dtypes(include=[np.number]).columns

# Inspect summary statistics
df[numeric_cols].describe().T

# Cap extreme values (simple winsorization)
for col in numeric_cols:
    lower = df[col].quantile(0.01)
    upper = df[col].quantile(0.99)
    df[col] = df[col].clip(lower, upper)

# Sanity check after cleaning
df[numeric_cols].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,21058.0,7.505109e+17,5.214894e+17,8998859.0,46321860.0,9.859087e+17,1.188786e+18,1.313531e+18
scrape_id,21058.0,20241230000000.0,10.35962,20241230000000.0,20241230000000.0,20241230000000.0,20241230000000.0,20241230000000.0
host_id,21058.0,332607600.0,198555500.0,6648140.0,154225900.0,330316900.0,527232500.0,661916700.0
host_listings_count,21058.0,24.29951,30.15113,1.0,4.0,11.0,32.0,141.0
host_total_listings_count,21058.0,31.46866,41.09053,1.0,6.0,15.0,39.0,214.91
neighbourhood_group_cleansed,0.0,,,,,,,
latitude,21058.0,35.69799,0.04157748,35.55644,35.68777,35.70394,35.72249,35.77726
longitude,21058.0,139.738,0.06482399,139.4746,139.6993,139.7276,139.7923,139.8767
accommodates,21058.0,4.436984,2.955584,1.0,2.0,4.0,6.0,16.0
bathrooms,21058.0,1.130331,0.3829521,0.5,1.0,1.0,1.0,3.0


In [44]:
# Ensure target variable (price) is numeric (avoid errors in models)

df["price"] = (
    df["price"]
    .astype(str)
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
)

df["price"] = pd.to_numeric(df["price"], errors="coerce")

# Drop rows where price is missing (target must be observed)
df = df.dropna(subset=["price"])

# Sanity check
df["price"].dtype, df["price"].isna().sum(), df.shape

(dtype('float64'), np.int64(0), (21058, 75))

In [46]:
# Drop unusable variables (except id)

# Columns with all missing values
all_missing_cols = [
    "neighbourhood_group_cleansed",
    "calendar_updated"
]

# System-generated id columns
id_cols = [
    "scrape_id"
]

cols_to_drop = [col for col in all_missing_cols + id_cols if col in df.columns]
df = df.drop(columns=cols_to_drop)

# Sanity check
print("Dropped columns:", cols_to_drop)
print("New shape:", df.shape)

Dropped columns: []
New shape: (21058, 72)


**Variable Selection for Modelling**

- Exclude id, URLs, dates, and free-text fields that are not useful for prediction  
- Keep structured listing, host, location, and amenity variables  
- Use the same variables across all datasets for out-of-sample comparison

**Extract Amenities**

- Parse the amenities text field into structured variables – to make the data usable  
- Create binary indicators for selected amenities – to capture key listing features  
- Use amenity features as additional inputs in the models

In [47]:
# Extract amenities

# Convert amenities column to string
df["amenities"] = df["amenities"].astype(str)

# List of selected amenities to extract
amenities_list = [
    "Wifi",
    "Kitchen",
    "Air conditioning",
    "Heating",
    "Washer",
    "Dryer",
    "Elevator",
    "TV"
]

# Create binary indicators for each amenity
for amenity in amenities_list:
    df[f"amenity_{amenity.lower().replace(' ', '_')}"] = (
        df["amenities"].str.contains(amenity, case=False, regex=False).astype(int)
    )

# Drop original amenities text field
df = df.drop(columns=["amenities"])

# Sanity check
df.filter(like="amenity_").head()

Unnamed: 0,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_elevator,amenity_tv
0,1,1,1,1,1,1,0,1
1,1,0,1,1,0,1,0,1
2,1,1,0,1,0,1,0,1
3,1,1,0,1,1,1,0,1
4,1,0,0,1,1,1,0,1


**Save Cleaned Dataset**

In [50]:
# Save cleaned Tokyo Q4 2024 dataset

output_path = "Data/Cleaned/Tokyo_listings/tokyo_listings_q4_2024_clean.csv"

# Overwrite file if it already exists
df.to_csv(output_path, index=False)

print("Cleaned dataset saved (overwritten if existed):")
print(output_path)

Cleaned dataset saved (overwritten if existed):
Data/Cleaned/Tokyo_listings/tokyo_listings_q4_2024_clean.csv


**Encode Categorical Variables**

- Select relevant categorical variables  
- Convert categorical variables into numeric form  
- Use the same encoding across all datasets

In [51]:
# Encode categorical variables

# Identify categorical variables
categorical_cols = df.select_dtypes(exclude=[np.number]).columns.tolist()

# Exclude variables not used for modelling
categorical_cols = [
    col for col in categorical_cols
    if col not in ["id"]  # keep id as primary key, but not as a feature
]

# Sanity check
categorical_cols

['listing_url',
 'last_scraped',
 'source',
 'name',
 'description',
 'neighborhood_overview',
 'picture_url',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'neighbourhood',
 'neighbourhood_cleansed',
 'property_type',
 'room_type',
 'bathrooms_text',
 'has_availability',
 'calendar_last_scraped',
 'first_review',
 'last_review',
 'license',
 'instant_bookable']

### Step 2: Build Predictive Models

##### A. OLS Model

- Define target variable (`price`) and feature matrix (`X`)  
- Split data into training and test sets  
- Fit an OLS baseline model and generate predictions

In [52]:
# OLS model (with imputation in pipeline)

from sklearn.impute import SimpleImputer

# Define target and features
y = df["price"]
X = df.drop(columns=["price", "id"])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Identify numeric and categorical features
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing pipelines
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# OLS pipeline
ols_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", LinearRegression())
    ]
)

# Fit model
ols_model.fit(X_train, y_train)

# Predictions
y_pred_ols = ols_model.predict(X_test)

# Evaluation (RMSE)
rmse_ols = np.sqrt(mean_squared_error(y_test, y_pred_ols))
print("OLS Model RMSE:", rmse_ols)

OLS Model RMSE: 12740.712267742509


##### B. LASSO Model

- Use the same target variable and feature set as in the OLS model  
- Fit a LASSO model to allow coefficient shrinkage  
- Predict prices on the test set and evaluate model performance

In [54]:
# Restrict categorical variables before LASSO to avoid huge dummy matrix

# Define categorical variables to keep (class-style)
categorical_keep = [
    "room_type",
    "property_type",
    "neighbourhood_cleansed",
    "host_is_superhost",
    "instant_bookable"
]

# Keep only those that exist in the dataset
categorical_keep = [c for c in categorical_keep if c in df.columns]

# Define target and features
y = df["price"]
X = df.drop(columns=["price", "id"])

# Drop all other object/string columns
object_cols = X.select_dtypes(exclude=[np.number]).columns.tolist()
drop_cols = [c for c in object_cols if c not in categorical_keep]

X = X.drop(columns=drop_cols)

# Sanity check
print("Categorical variables kept:", categorical_keep)
print("Number of object/string columns dropped:", len(drop_cols))
print("Final feature shape:", X.shape)

Categorical variables kept: ['room_type', 'property_type', 'neighbourhood_cleansed', 'host_is_superhost', 'instant_bookable']
Number of object/string columns dropped: 28
Final feature shape: (21058, 49)


In [55]:
# LASSO model

# Train-test split (using the restricted X, y)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_lasso = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# LASSO pipeline
lasso_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor_lasso),
        ("model", Lasso(alpha=1.0, max_iter=3000))
    ]
)

# Fit model
lasso_model.fit(X_train, y_train)

# Predictions
y_pred_lasso = lasso_model.predict(X_test)

# Evaluation (RMSE)
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
print("LASSO Model RMSE:", rmse_lasso)

LASSO Model RMSE: 15064.244563773585


##### C. Random Forest Model

- Use the same target variable and working feature set (`X`, `y`)  
- Fit a Random Forest model and generate predictions  
- Evaluate performance on the test set (RMSE)

In [56]:
# Random Forest model

# Train-test split (using the same working X, y)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing (class-style)
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median"))
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_rf = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Random Forest pipeline
rf_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor_rf),
        ("model", RandomForestRegressor(
            n_estimators=500,
            random_state=42,
            n_jobs=-1
        ))
    ]
)

# Fit model
rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluation (RMSE)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print("Random Forest Model RMSE:", rmse_rf)

Random Forest Model RMSE: 12215.790144828907


##### D. Boosting Model (Gradient Boosting)

- **Chosen model**: Gradient Boosting
- **Reason for choosing this model**: This is common boosting method used in class that performs well on tabular data and allows feature importance analysis

In [57]:
# Gradient Boosting model

# Train-test split (using the same working X, y)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing (same as Random Forest)
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median"))
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_gb = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Gradient Boosting pipeline
gb_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor_gb),
        ("model", GradientBoostingRegressor(random_state=42))
    ]
)

# Fit model
gb_model.fit(X_train, y_train)

# Predictions
y_pred_gb = gb_model.predict(X_test)

# Evaluation (RMSE)
rmse_gb = np.sqrt(mean_squared_error(y_test, y_pred_gb))
print("Gradient Boosting Model RMSE:", rmse_gb)

Gradient Boosting Model RMSE: 14187.44720172246


##### E. Decision Tree

- **Chosen model**: Decision Tree (CART)  
- **Reason for choosing the model**: This is a simple tree-based model which provides a clear baseline to compare with ensemble methods

In [58]:
# Decision Tree model (CART)

# Train-test split (using the same working X, y)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns

# Preprocessing (same as RF and Boosting)
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median"))
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor_dt = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Decision Tree pipeline
dt_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor_dt),
        ("model", DecisionTreeRegressor(random_state=42))
    ]
)

# Fit model
dt_model.fit(X_train, y_train)

# Predictions
y_pred_dt = dt_model.predict(X_test)

# Evaluation (RMSE)
rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred_dt))
print("Decision Tree Model RMSE:", rmse_dt)

Decision Tree Model RMSE: 18161.323384531686


### Step 3: Compare the Models in Terms of Fit and Time

##### A. Horserace Table

The horserace table compares the 5 models in terms of predictive accuracy and computation time.

- **RMSE** is used to measure out-of-sample prediction error. Lower RMSE values indicate better predictive performance.
- **Time** captures total model training and prediction time
- All models use the same data and train–test split
- This ensures results are directly comparable across models

| Model                    | RMSE (Test Set) | Runtime (seconds) |
|--------------------------|----------------|-------------------|
| OLS                      | 12,740.71      | 4.0               |
| LASSO                    | 15,064.24      | 2.0               |
| Random Forest            | 12,215.79      | 50.3              |
| Gradient Boosting        | 14,187.45      | 10.4              |
| Decision Tree (CART)     | 18,161.32      | 0.8               |

##### B. Discussion of Performance

The horserace table shows clear differences across the models, in terms of fit and time.

- **Random Forest** 
This model has the lowest RMSE, so it gives the most accurate predictions, but it also takes much longer to run
- **OLS** 
OLS model performs quite well given how simple it is, with RMSE close to Random Forest and very low runtime
- **Gradient Boosting** 
This model improves over a single decision tree, but Random Forest performs better
- **LASSO**
LASSO performs worse than OLS, which suggests that regularising does not help prediction here
- **Decision Tree (CART)** 
CART performs the worst, which might be due to overfitting and limited generalization

Overall, more complex models improve accuracy, but requires much higher computation time.   
Random Forest performs best, but OLS has a strong and efficient baseline.

### Step 4: Analyzing Random Forest Model and Gradient Boosting Model

##### A. Feature Importance of Random Forest
- Feature importance shows which variables contribute most to predicting prices in this model
- Since this is a black-box model without an interpretable formula, feature importance helps interpret its behaviour
- Feature importance values are based on how much each feature improves prediction across the trees
- The focus is to identify the most important features, not the exact numerical values

In [59]:
# Random Forest Feature Importance

# Get fitted Random Forest model
rf_estimator = rf_model.named_steps["model"]

# Get feature names after preprocessing
preprocessor = rf_model.named_steps["preprocessor"]

# Numeric feature names
num_features = preprocessor.named_transformers_["num"].feature_names_in_

# Categorical feature names (after one-hot encoding)
cat_ohe = preprocessor.named_transformers_["cat"].named_steps["onehot"]
cat_features = cat_ohe.get_feature_names_out(
    preprocessor.named_transformers_["cat"].feature_names_in_
)

# Combine all feature names
feature_names = np.concatenate([num_features, cat_features])

# Extract feature importances
importances = rf_estimator.feature_importances_

# Create importance table
rf_importance_df = (
    pd.DataFrame({
        "feature": feature_names,
        "importance": importances
    })
    .sort_values("importance", ascending=False)
)

# Show top 10 features
rf_importance_df.head(10)

Unnamed: 0,feature,importance
5,accommodates,0.324738
3,latitude,0.070139
4,longitude,0.048778
6,bathrooms,0.048505
0,host_id,0.035186
8,beds,0.030867
20,availability_365,0.029629
19,availability_90,0.027112
35,reviews_per_month,0.020788
2,host_total_listings_count,0.015519


**Top 10 Features - Random Forest**

| Feature                     | Importance |
|----------------------------|------------|
| accommodates                | 0.324738   |
| latitude                    | 0.070139   |
| longitude                   | 0.048778   |
| bathrooms                   | 0.048505   |
| host_id                     | 0.035186   |
| beds                        | 0.030867   |
| availability_365            | 0.029629   |
| availability_90             | 0.027112   |
| reviews_per_month           | 0.020788   |
| host_total_listings_count   | 0.015519   |

The Random Forest model is mainly driven by capacity (`accommodates`) and location (`latitude` and `longitude`).        
Size of the property (`bathrooms`, `beds`) and availability variables also play an important role in price prediction.

##### B. Feature Importance of Gradient Boosting

This section examines which features are most important in the Gradient Boosting model.

- Feature importance highlights which variables drive price predictions
- Gradient Boosting is also a black-box model, so feature importance helps with interpretation
- Importance values reflect how much each feature contributes to improving predictions
- The focus is on identifying the most important features rather than exact numerical values

In [60]:
# Gradient Boosting Feature Importance

# Get fitted Gradient Boosting model
gb_estimator = gb_model.named_steps["model"]

# Get preprocessor
preprocessor = gb_model.named_steps["preprocessor"]

# Numeric feature names
num_features = preprocessor.named_transformers_["num"].feature_names_in_

# Categorical feature names (after one-hot encoding)
cat_ohe = preprocessor.named_transformers_["cat"].named_steps["onehot"]
cat_features = cat_ohe.get_feature_names_out(
    preprocessor.named_transformers_["cat"].feature_names_in_
)

# Combine all feature names
feature_names = np.concatenate([num_features, cat_features])

# Extract feature importances
importances = gb_estimator.feature_importances_

# Create importance table
gb_importance_df = (
    pd.DataFrame({
        "feature": feature_names,
        "importance": importances
    })
    .sort_values("importance", ascending=False)
)

# Show top 10 features
gb_importance_df.head(10)

Unnamed: 0,feature,importance
5,accommodates,0.516623
8,beds,0.090508
6,bathrooms,0.06338
3,latitude,0.061143
7,bedrooms,0.052976
4,longitude,0.029512
88,neighbourhood_cleansed_Shinjuku Ku,0.019219
19,availability_90,0.018191
0,host_id,0.013844
86,neighbourhood_cleansed_Shibuya Ku,0.010434


**Top 10 Features – Gradient Boosting**

| Feature                                | Importance |
|----------------------------------------|------------|
| accommodates                           | 0.516623   |
| beds                                   | 0.090508   |
| bathrooms                              | 0.063380   |
| latitude                               | 0.061143   |
| bedrooms                               | 0.052976   |
| longitude                              | 0.029512   |
| neighbourhood_cleansed_Shinjuku Ku     | 0.019219   |
| availability_90                        | 0.018191   |
| host_id                                | 0.013844   |
| neighbourhood_cleansed_Shibuya Ku      | 0.010434   |

The Gradient Boosting model is mainly driven by capacity (`accommodates`) and property size (`beds`, `bathrooms`, `bedrooms`).      
Location (`latitude`, `longitude`, and neighbourhood) and availability also contribute to price prediction.

##### C. Compare the 10 Most Important Features of Random Forest and Gradient Boosting Models

Both models rely on similar core drivers of price predictor, but the models place different weight on each feature.

**Similarities: Important features that in top 10 for both models**:
- `accommodates`
- `latitude`, `longitude`
- `bathrooms`, `beds`
- `availability_90`
- `host_id`

**Differences**:
- Random Forest puts more weight on availability and reviews (`availability_365`, `reviews_per_month`, `host_total_listings_count`).
- Gradient Boosting puts more weight on property size variables (`bedrooms`) and neighbourhood variables (`neighbourhood_cleansed_Shinjuku Ku`, `neighbourhood_cleansed_Shibuya Ku`).

Overall, both models show that capacity and location are the main predictors of price.      
However, Gradient Boosting emphasizes neighbourhood variables more, and Random Forest prioritises reviews and host-related variables.

## Part II. Validity

This section tests how well the models perform on new/'live' datasets.

2 additional datasets are used:
- A **later time period** for the same city (Tokyo) - `Q3 2025`
- A **different city** from the same region - 

The same data wrangling steps and the same 5 predictive models from Part I are applied to these new datasets.       
Model performance is then compared to assess how well the models generalize to new data and settings.

### Step 5: Adding 2 'Live' Datasets