# Airbnb Listings & Reviews — End-to-End KDD/EDA & Modeling

**Author:** Your Name • **Last updated:** 2025-11-11

This notebook walks through a principled, **textbook-quality** KDD process on the Kaggle dataset **“Airbnb Listings & Reviews”**: data understanding, cleaning/preprocessing, exploratory analysis, outlier handling, feature selection, clustering, and supervised modeling to **predict listing income**. We optimize on log-revenue for stability, report metrics in $ on the natural scale, and compare strong baselines to tree ensembles.

> **Compute note:** The notebook is designed to run on Colab with **limited compute**. We use sampling for heavy steps (text, permutation importance, clustering diagnostics) and keep models efficient.


## 0. Environment setup & data access

You can provide data in one of three ways:
1. **Kaggle API** (recommended): upload your `kaggle.json` (Account → Create New API Token) to Colab `/root/.kaggle/kaggle.json`.
2. **Manually**: upload the extracted dataset folder to `/content/data/airbnb-listings-reviews`.
3. **Google Drive**: mount Drive and point `DATA_DIR` accordingly.

> This notebook auto-detects files like `listings*.csv`, `reviews*.csv`, and `calendar*.csv` inside `DATA_DIR`.

In [None]:

# If using Kaggle API, run this cell once you have uploaded kaggle.json via the Colab file browser.
# It will download the dataset into /content/data/airbnb-listings-reviews
# You may skip/ignore if you plan to provide files manually.
import os, subprocess, json, pathlib, shutil

DATA_DIR = pathlib.Path("/content/data/airbnb-listings-reviews")  # <-- Change if needed
DATA_DIR.mkdir(parents=True, exist_ok=True)

def setup_kaggle_and_download():
    kaggle_json = "/root/.kaggle/kaggle.json"
    if not os.path.exists(kaggle_json):
        print("kaggle.json not found at /root/.kaggle/kaggle.json — please upload it first.")
        return
    os.chmod("/root/.kaggle/kaggle.json", 0o600)
    print("Installing Kaggle and downloading dataset...")
    subprocess.run(["pip","install","-q","kaggle"], check=False)
    # Dataset: https://www.kaggle.com/datasets/mysarahmadbhat/airbnb-listings-reviews
    subprocess.run([
        "kaggle","datasets","download","-d","mysarahmadbhat/airbnb-listings-reviews",
        "-p", str(DATA_DIR)
    ], check=False)
    # Unzip
    for f in os.listdir(DATA_DIR):
        if f.endswith(".zip"):
            subprocess.run(["unzip","-o", str(DATA_DIR/f), "-d", str(DATA_DIR)], check=False)

# Uncomment to use:
# setup_kaggle_and_download()

print("DATA_DIR =", DATA_DIR)
print("Place the dataset CSV files inside the path above if you are not using Kaggle API.")


In [None]:

# Core packages (Colab has most preinstalled; we pin minimal versions for compatibility)
%pip -q install -U pandas numpy scikit-learn matplotlib joblib textblob


## 1. Imports & global configuration

In [None]:

import os, gc, re, json, math, warnings, textwrap, random
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from pathlib import Path
from collections import Counter

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (7,4)
plt.rcParams['axes.grid'] = True

from sklearn.model_selection import train_test_split, GroupKFold, KFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.inspection import permutation_importance
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_selection import mutual_info_regression
from sklearn.ensemble import IsolationForest

SEED = 42
rnd = np.random.RandomState(SEED)
np.random.seed(SEED)
random.seed(SEED)

DATA_DIR = Path("/content/data/airbnb-listings-reviews")  # adjust if needed
print("Using DATA_DIR:", DATA_DIR)


## 2. Helper functions

In [None]:

def first_existing(dirpath: Path, names):
    for n in names:
        p = dirpath / n
        if p.exists():
            return p
    return None

def to_money(s):
    if pd.isna(s): return np.nan
    return pd.to_numeric(str(s).replace("$","").replace(",","").strip(), errors="coerce")

def schema_report(df, name):
    print(f"==== {name} SHAPE {df.shape} ====")
    head = df.head(3)
    cols = pd.DataFrame({
        "column": df.columns,
        "dtype": [df[c].dtype for c in df.columns],
        "null_pct": [df[c].isna().mean().round(4) for c in df.columns],
        "n_unique": [df[c].nunique(dropna=True) for c in df.columns]
    })
    return head, cols.sort_values(["null_pct","n_unique"], ascending=[False, True])

def pct_out_of_bounds(s, low, high):
    return round(100*((s < low) | (s > high)).mean(),3)

def winsorize(s, p=0.01):
    lo, hi = s.quantile([p, 1-p])
    return s.clip(lo, hi)

def iqr_cap(s, k=3.0):
    q1, q3 = s.quantile([.25,.75])
    lo, hi = q1 - k*(q3-q1), q3 + k*(q3-q1)
    return s.clip(lo, hi)


## 3. File discovery & initial load (lightweight)

In [None]:

candidates = {
    "listings": ["listings.csv","listings_clean.csv","listings_summary.csv","listings_cities.csv"],
    "reviews":  ["reviews.csv","reviews_clean.csv"],
    "calendar": ["calendar.csv","calendar_summary.csv"]
}

fp_listings = first_existing(DATA_DIR, candidates["listings"])
fp_reviews  = first_existing(DATA_DIR, candidates["reviews"])
fp_calendar = first_existing(DATA_DIR, candidates["calendar"])

print("Found files:")
print(" listings:", fp_listings)
print(" reviews :", fp_reviews)
print(" calendar:", fp_calendar)

assert fp_listings is not None, "Listings CSV not found. Please check DATA_DIR and file names."

preview = pd.read_csv(fp_listings, nrows=500, low_memory=False)
print("Listings columns preview:", preview.columns.tolist())
del preview; gc.collect();


## 4. Load listings & basic typing / normalization

In [None]:

listings = pd.read_csv(fp_listings, low_memory=False)

cast_cats = ["city","neighbourhood","room_type","property_type","instant_bookable","host_is_superhost"]
money_cols = ["price","cleaning_fee","security_deposit"]

if "id" in listings.columns:
    listings["id"] = pd.to_numeric(listings["id"], errors="coerce").astype("Int64")

for c in cast_cats:
    if c in listings.columns:
        listings[c] = listings[c].astype("category")

for c in money_cols:
    if c in listings.columns:
        listings[c+"_num"] = listings[c].apply(to_money)

for c in ["instant_bookable","host_is_superhost"]:
    if c in listings.columns:
        listings[c] = listings[c].astype(str).str.lower().map({"t":True,"true":True,"f":False,"false":False}).astype("boolean")

for c in ["last_review","host_since","first_review","last_scraped"]:
    if c in listings.columns:
        listings[c] = pd.to_datetime(listings[c], errors="coerce")

listings = listings.dropna(subset=["id"]).copy()
listings["id"] = listings["id"].astype(int)
if "last_scraped" in listings.columns:
    listings = listings.sort_values(["id","last_scraped"], ascending=[True, False])
listings = listings.drop_duplicates(subset=["id"], keep="first").reset_index(drop=True)

head_listings, dict_listings = schema_report(listings, "LISTINGS")
display(head_listings)
display(dict_listings.head(20))


## 5. Sanity checks (ranges & keys)

In [None]:

num_report = {}

if {"latitude","longitude"}.issubset(listings.columns):
    num_report["lat_oob%"] = pct_out_of_bounds(listings["latitude"], -90, 90)
    num_report["lon_oob%"] = pct_out_of_bounds(listings["longitude"], -180, 180)

if "price" in listings.columns and "price_num" not in listings.columns:
    listings["price_num"] = listings["price"].apply(to_money)
elif "price_num" not in listings.columns and "price" not in listings.columns:
    listings["price_num"] = np.nan

bounds = {
    "price_num": (0, 10000),
    "minimum_nights": (1, 365),
    "accommodates": (1, 16)
}

for col, (lo, hi) in bounds.items():
    if col in listings.columns:
        num_report[f"{col}_oob%"] = pct_out_of_bounds(listings[col], lo, hi)

num_report


## 6. Optional light load for reviews/calendar

In [None]:

reviews = None
calendar = None

if fp_reviews is not None:
    try:
        reviews = pd.read_csv(fp_reviews, usecols=[c for c in ["listing_id","date"] if c], parse_dates=["date"], low_memory=False)
        print("Loaded reviews shape:", reviews.shape)
    except Exception as e:
        print("Skipping reviews load:", e)

if fp_calendar is not None:
    try:
        calendar = pd.read_csv(fp_calendar, usecols=[c for c in ["listing_id","date","available","price"] if c], parse_dates=["date"], low_memory=False)
        print("Loaded calendar shape:", calendar.shape)
    except Exception as e:
        print("Skipping calendar load:", e)

if reviews is not None and "listing_id" in reviews and "id" in listings:
    orphan_r = (~reviews["listing_id"].isin(listings["id"])).mean()
    print("Orphan reviews (%):", round(100*orphan_r,3))

if calendar is not None and "listing_id" in calendar and "id" in listings:
    orphan_c = (~calendar["listing_id"].isin(listings["id"])).mean()
    print("Orphan calendar (%):", round(100*orphan_c,3))


## 7. Target creation: monthly revenue (calendar if present, else availability proxy)

In [None]:

target = None

if calendar is not None and set(["listing_id","date","available"]).issubset(calendar.columns):
    cal = calendar.copy()
    if "price" in cal.columns:
        cal["price_num"] = cal["price"].apply(to_money)
    else:
        cal = cal.merge(listings[["id","price_num"]].rename(columns={"id":"listing_id"}), on="listing_id", how="left")
        cal["price_num"] = cal["price_num"].fillna(cal["price_num"].median())

    cal["booked"] = (cal["available"].astype(str).str.lower() == "f").astype(int)
    end_date = cal["date"].max()
    start_date = end_date - pd.Timedelta(days=29)
    cal_30 = cal[(cal["date"]>=start_date)&(cal["date"]<=end_date)].copy()
    cal_30["rev_day"] = cal_30["booked"] * cal_30["price_num"]
    target = (cal_30.groupby("listing_id")["rev_day"].sum()
              .rename("monthly_revenue").reset_index()
              .rename(columns={"listing_id":"id"}))

elif {"price_num","availability_30"}.issubset(listings.columns):
    proxy = listings[["id","price_num","availability_30"]].copy()
    proxy["availability_30"] = proxy["availability_30"].clip(0,30)
    proxy["monthly_revenue"] = proxy["price_num"] * (30 - proxy["availability_30"])
    target = proxy[["id","monthly_revenue"]]

if target is None:
    raise RuntimeError("Could not construct target. Need calendar or (price_num & availability_30).")

listings = listings.merge(target, on="id", how="left")
listings["y_log"] = np.log1p(listings["monthly_revenue"])

print("Target coverage (non-null y):", (~listings["y_log"].isna()).mean())


## 8. Imputation & outlier handling (light-touch)

In [None]:

num_cols = [c for c in [
    "accommodates","bathrooms","bedrooms","beds","price_num","cleaning_fee_num",
    "security_deposit_num","minimum_nights","availability_30","availability_60","availability_365",
    "number_of_reviews","review_scores_rating"
] if c in listings.columns]

for c in ["review_scores_rating"]:
    if c in listings.columns:
        listings[c+"_was_missing"] = listings[c].isna().astype(int)

if "cleaning_fee_num" in listings.columns:
    listings["cleaning_fee_num"] = listings["cleaning_fee_num"].fillna(0)

for c in num_cols:
    if c in listings.columns:
        listings[c] = listings[c].fillna(listings[c].median())

for c in ["price_num","minimum_nights","accommodates","beds","bedrooms","bathrooms","cleaning_fee_num","number_of_reviews"]:
    if c in listings.columns:
        listings[c] = winsorize(listings[c], 0.01)


## 9. Amenities & lightweight text features

In [None]:

top_amenities = []
if "amenities" in listings.columns:
    def split_amen(s):
        if pd.isna(s): return []
        s = re.sub(r'[\{\}\[\]\"]','', str(s))
        return [t.strip().lower() for t in s.split(",") if t.strip()]

    amen_lists = listings["amenities"].map(split_amen)
    top_amenities = pd.Series(Counter([a for lst in amen_lists for a in lst])).sort_values(ascending=False).head(30).index.tolist()
    for a in top_amenities:
        col = "amen__" + re.sub('[^a-z0-9]+','_',a)
        listings[col] = amen_lists.apply(lambda lst: int(a in lst))

for c in ["name","description"]:
    if c in listings.columns:
        listings[f"{c}_len_chars"] = listings[c].astype(str).str.len()
        listings[f"{c}_len_words"] = listings[c].astype(str).str.split().map(len)


## 10. Geography bins (coarse)

In [None]:

if {"latitude","longitude"}.issubset(listings.columns):
    listings["lat_bin"] = listings["latitude"].round(2).astype(str)
    listings["lon_bin"] = listings["longitude"].round(2).astype(str)


## 11. Build modeling table

In [None]:

features_num = [c for c in [
    "accommodates","bathrooms","bedrooms","beds","price_num","cleaning_fee_num",
    "security_deposit_num","minimum_nights","availability_30","availability_60","availability_365",
    "number_of_reviews","review_scores_rating","name_len_chars","name_len_words",
    "description_len_chars","description_len_words"
] if c in listings.columns]

amen_cols = [c for c in listings.columns if c.startswith("amen__")]
geo_cols  = [c for c in ["lat_bin","lon_bin"] if c in listings.columns]
cat_cols  = [c for c in ["city","neighbourhood","room_type","property_type","instant_bookable","host_is_superhost"] if c in listings.columns] + geo_cols

model_cols = ["id","monthly_revenue","y_log"] + features_num + cat_cols + amen_cols
model_df = listings[[c for c in model_cols if c in listings.columns]].dropna(subset=["y_log"]).reset_index(drop=True)
print("Modeling table shape:", model_df.shape)


## 12. Quick EDA snapshots (lightweight)

In [None]:

eda_df = model_df.sample(min(len(model_df), 150_000), random_state=42)
print("Zero-share (revenue == 0):", (eda_df["monthly_revenue"]<=0).mean())

plt.hist(eda_df["y_log"], bins=50)
plt.title("Distribution of log1p(monthly_revenue)")
plt.xlabel("y_log")
plt.ylabel("count")
plt.show()

num_cols_show = [c for c in ["price_num","accommodates","minimum_nights","beds","bedrooms",
                             "number_of_reviews","review_scores_rating","cleaning_fee_num",
                             "availability_30","availability_60","availability_365"]
                 if c in eda_df.columns]
corrs = eda_df[num_cols_show + ["y_log"]].corr(numeric_only=True)["y_log"].sort_values(ascending=False)
display(corrs.to_frame("corr_with_ylog").head(15))


## 13. Optional: multi-dimensional outlier flag

In [None]:

num_probe = [c for c in ["price_num","accommodates","minimum_nights","beds","bedrooms","bathrooms",
                         "cleaning_fee_num","number_of_reviews","availability_30","availability_60","availability_365"]
             if c in model_df.columns]
try:
    sub = model_df.sample(min(30000, len(model_df)), random_state=42)
    X_iso = sub[[c for c in num_probe if model_df[c].dtype.kind in "fi"]].fillna(0)
    iso = IsolationForest(n_estimators=100, contamination=0.01, random_state=42)
    sub["iso_flag"] = (iso.fit_predict(X_iso) == -1).astype(int)
    iso_map = dict(zip(sub["id"], sub["iso_flag"]))
    model_df["is_outlier_modelspace"] = model_df["id"].map(iso_map).fillna(0).astype(int)
except Exception as e:
    print("Skipping IsolationForest:", e)
    model_df["is_outlier_modelspace"] = 0


## 14. Feature selection (filter → MI → permutation)

In [None]:

# Low-variance amenity dummies
low_var_drop = []
for c in [col for col in model_df.columns if col.startswith("amen__")]:
    p = model_df[c].mean()
    if p < 0.001 or p > 0.999:
        low_var_drop.append(c)

model_df = model_df.drop(columns=low_var_drop)
amen_cols = [c for c in model_df.columns if c.startswith("amen__")]

mi_cols = [c for c in model_df.columns if (c in num_probe) or (c in amen_cols) or (c=="is_outlier_modelspace")]
X_mi = model_df[mi_cols].fillna(0).astype(float)
y_mi = model_df["y_log"].values
mi = mutual_info_regression(X_mi, y_mi, random_state=42)
mi_rank = pd.DataFrame({"feature": mi_cols, "mi": mi}).sort_values("mi", ascending=False)
display(mi_rank.head(25))

keep_mi = set(mi_rank.head(50)["feature"])

cat_cols  = [c for c in ["city","neighbourhood","room_type","property_type","instant_bookable","host_is_superhost","lat_bin","lon_bin"] if c in model_df.columns]
num_cols_final = sorted(set(num_probe + ["description_len_chars","description_len_words","name_len_chars","name_len_words","is_outlier_modelspace"]).intersection(model_df.columns))

FINAL_NUM = [c for c in num_cols_final if c in keep_mi.union(set(num_cols_final))]
FINAL_CAT = cat_cols
top_amen_by_mi = mi_rank[mi_rank["feature"].isin(amen_cols)].head(20)["feature"].tolist()
FINAL_BIN = top_amen_by_mi

FINAL_FEATURES = FINAL_NUM + FINAL_CAT + FINAL_BIN
print("Final features (#):", len(FINAL_FEATURES))


## 15. Clustering for segments (MiniBatchKMeans)

In [None]:

basic_amen = ["amen__wifi","amen__kitchen","amen__washer","amen__air_conditioning","amen__heating","amen__tv","amen__self_check_in"]
for a in basic_amen:
    if a not in model_df.columns: model_df[a] = 0
model_df["amen_count_basic"] = model_df[basic_amen].sum(axis=1)

cluster_num = [c for c in [
    "price_num","accommodates","bedrooms","bathrooms","minimum_nights",
    "availability_30","number_of_reviews","review_scores_rating",
    "cleaning_fee_num","amen_count_basic"
] if c in model_df.columns]

from sklearn.preprocessing import StandardScaler
Xc = model_df[cluster_num].copy().fillna(model_df[cluster_num].median())
scaler_c = StandardScaler()
Z = scaler_c.fit_transform(Xc)

from sklearn.metrics import silhouette_score
idx = rnd.choice(len(Z), size=min(50000, len(Z)), replace=False)
Z_sub = Z[idx]

cand_K = [3,4,5,6,8,10]
sil = {}; inertia = {}
for k in cand_K:
    km = MiniBatchKMeans(n_clusters=k, random_state=42, batch_size=2048)
    km.fit(Z_sub)
    inertia[k] = km.inertia_
    s_idx = rnd.choice(len(Z_sub), size=min(15000, len(Z_sub)), replace=False)
    sil[k] = silhouette_score(Z_sub[s_idx], km.predict(Z_sub[s_idx]))

best_k = max(sil, key=sil.get)
print("Silhouette by K:", sil)
print("Chosen K:", best_k)

kmeans = MiniBatchKMeans(n_clusters=best_k, random_state=42, batch_size=4096)
model_df["cluster_id"] = kmeans.fit_predict(Z)

def profile_clusters(df, by="cluster_id"):
    num_stats = (df.groupby(by)[cluster_num + ["monthly_revenue","y_log"]]
                   .median()
                   .add_prefix("med_"))
    prop_room = (pd.crosstab(df[by], df.get("room_type","Unknown"), normalize="index")
                   .add_prefix("prop_room__"))
    n = df.groupby(by).size().to_frame("n")
    return pd.concat([n, num_stats, prop_room], axis=1).sort_values("med_monthly_revenue", ascending=False)

cluster_profile = profile_clusters(model_df)
display(cluster_profile)


## 16. Preprocessors (linear vs tree-friendly)

In [None]:

pre_linear = ColumnTransformer([
    ("num", StandardScaler(), [c for c in FINAL_NUM + FINAL_BIN if c in model_df.columns]),
    ("cat", OneHotEncoder(handle_unknown="ignore", min_frequency=50), [c for c in FINAL_CAT if c in model_df.columns]),
], remainder="drop")

pre_tree = ColumnTransformer([
    ("num", "passthrough", [c for c in FINAL_NUM + FINAL_BIN if c in model_df.columns]),
    ("cat", OneHotEncoder(handle_unknown="ignore", min_frequency=50), [c for c in FINAL_CAT if c in model_df.columns]),
], remainder="drop")


## 17. Modeling & evaluation (baselines → Ridge → HGB → RF)

In [None]:

target = "y_log"
groups = model_df["city"].astype(str) if "city" in model_df.columns else None

X_cols = [c for c in FINAL_FEATURES if c in model_df.columns]
X = model_df[X_cols]; y = model_df[target]
ids = model_df["id"]

def eval_metrics(y_true_log, y_pred_log, y_true_nat):
    y_pred_nat = np.expm1(y_pred_log)
    rmse = mean_squared_error(y_true_nat, y_pred_nat, squared=False)
    mae  = mean_absolute_error(y_true_nat, y_pred_nat)
    r2   = r2_score(y_true_log, y_pred_log)
    return {"RMSE_$": rmse, "MAE_$": mae, "R2_log": r2}

gkf = GroupKFold(n_splits=5) if groups is not None else KFold(n_splits=5, shuffle=True, random_state=42)
results = []

# Baselines
for fold, (tr, va) in enumerate(gkf.split(X, y, groups=groups)):
    y_tr, y_va = y.iloc[tr], y.iloc[va]
    y_va_nat = np.expm1(y_va)

    pred_null = np.full_like(y_va, y_tr.mean(), dtype=float)
    res_null = eval_metrics(y_va, pred_null, y_va_nat)
    results.append({"model":"NULL_global","fold":fold, **res_null})

    if groups is not None:
        city_tr_mean = y.iloc[tr].groupby(groups.iloc[tr]).mean()
        pred_city = groups.iloc[va].map(city_tr_mean).fillna(y_tr.mean()).values
        res_city = eval_metrics(y_va, pred_city, y_va_nat)
        results.append({"model":"BASE_city_mean","fold":fold, **res_city})

# Ridge
ridge = Pipeline([("pre", pre_linear), ("est", Ridge(alpha=1.0, random_state=42))])
for fold, (tr, va) in enumerate(gkf.split(X, y, groups=groups)):
    ridge.fit(X.iloc[tr], y.iloc[tr])
    pred = ridge.predict(X.iloc[va])
    res = eval_metrics(y.iloc[va], pred, np.expm1(y.iloc[va]))
    results.append({"model":"Ridge(a=1.0)","fold":fold, **res})

# HGB
hgb = Pipeline([("pre", pre_tree), ("est", HistGradientBoostingRegressor(
    learning_rate=0.08, max_leaf_nodes=31, random_state=42))])
for fold, (tr, va) in enumerate(gkf.split(X, y, groups=groups)):
    hgb.fit(X.iloc[tr], y.iloc[tr])
    pred = hgb.predict(X.iloc[va])
    res = eval_metrics(y.iloc[va], pred, np.expm1(y.iloc[va]))
    results.append({"model":"HGB","fold":fold, **res})

# RF
rf = Pipeline([("pre", pre_tree), ("est", RandomForestRegressor(
    n_estimators=200, max_depth=18, min_samples_leaf=5, n_jobs=-1, random_state=42))])
for fold, (tr, va) in enumerate(gkf.split(X, y, groups=groups)):
    rf.fit(X.iloc[tr], y.iloc[tr])
    pred = rf.predict(X.iloc[va])
    res = eval_metrics(y.iloc[va], pred, np.expm1(y.iloc[va]))
    results.append({"model":"RF","fold":fold, **res})

res_df = pd.DataFrame(results)
display(res_df.groupby("model")[["RMSE_$","MAE_$","R2_log"]].agg(["mean","std"]).sort_values(("RMSE_$","mean")))


## 18. Out-of-fold predictions & by-city report

In [None]:

# OOF with HGB
oof_pred = np.zeros(len(model_df))
for tr, va in gkf.split(X, y, groups=groups):
    hgb.fit(X.iloc[tr], y.iloc[tr])
    oof_pred[va] = hgb.predict(X.iloc[va])

oof = pd.DataFrame({
    "id": ids,
    "city": groups if groups is not None else "ALL",
    "y_log_true": y,
    "y_log_pred": oof_pred,
    "rev_true": np.expm1(y),
    "rev_pred": np.expm1(oof_pred),
})
city_report = (oof.groupby("city")
                  .apply(lambda g: pd.Series({
                      "RMSE_$": mean_squared_error(g["rev_true"], g["rev_pred"], squared=False),
                      "MAE_$": mean_absolute_error(g["rev_true"], g["rev_pred"]),
                      "R2_log": r2_score(g["y_log_true"], g["y_log_pred"]),
                      "n": len(g)
                  }))
                  .sort_values("RMSE_$"))
display(city_report.head(20))

oof["resid_log"] = oof["y_log_true"] - oof["y_log_pred"]
print("Mean residual (log-scale):", oof["resid_log"].mean())
plt.hist(oof["resid_log"], bins=50); plt.title("Residuals (log-scale)"); plt.show()


## 19. Permutation importance (snapshot)

In [None]:

val_idx = rnd.choice(len(X), size=min(20000, len(X)), replace=False)
hgb.fit(X.iloc[~np.isin(np.arange(len(X)), val_idx)], y.iloc[~np.isin(np.arange(len(X)), val_idx)])
pi = permutation_importance(hgb, X.iloc[val_idx], y.iloc[val_idx], n_repeats=5, random_state=42)
feat_names = hgb.named_steps["pre"].get_feature_names_out()
perm_df = pd.DataFrame({"feature": feat_names, "pi": pi.importances_mean}).sort_values("pi", ascending=False).head(30)
display(perm_df)


## 20. Persist artifacts (optional)

In [None]:

from joblib import dump
ARTIFACTS = Path("/content/artifacts"); ARTIFACTS.mkdir(exist_ok=True, parents=True)

dump(hgb, ARTIFACTS/"final_hgb_model.joblib")
dump(pre_linear, ARTIFACTS/"pre_linear.joblib")
dump(pre_tree, ARTIFACTS/"pre_tree.joblib")
oof.to_parquet(ARTIFACTS/"oof_predictions.parquet")
city_report.to_csv(ARTIFACTS/"city_report.csv")

print("Saved artifacts to:", ARTIFACTS)


## 21. Executive summary & recommendations

**Key findings (confirm with your run):**
- **Availability** (e.g., `availability_30`) is a strong negative driver of revenue.
- **City** and **room_type** show large fixed effects; keep as categoricals.
- **Accommodates/bedrooms** help with diminishing returns; **price** is nonlinear.
- **Amenities** like wifi/AC/washer/self check-in are commonly beneficial where missing.
- Unsupervised **clusters** yield actionable segments for targeted pricing/amenity strategies.

**Recommendations:**
1. For high-availability/low-revenue listings, test targeted **price reductions** and add **must-have amenities**.
2. Lower **minimum_nights** to unlock weekend/short-stay demand where appropriate.
3. Use **cluster_id** and city-level reports to prioritize experimentation.
4. Extend with **text features** (TF‑IDF on descriptions/reviews) and/or **gradient boosting libraries** if compute allows.

**Limitations & next steps:** seasonality/event effects, currency differences across regions, and zero-inflation (consider two-stage models).