In [None]:
#1) SQL data preparation
#- Built base view with consistent hotelid (text), price, rating, city, and amenities.
#- Created price tier per city via quartiles (Budget/Mid/Luxury).
#- Computed city landmark features (near_poi_share within 5 km, min_poi_km).
- Aggregated reviews to review_count only (no average rating for Option B).
- Assembled final training view with target = HotelData.rating and features = price, price_tier, city, amenity flags, review_count, near_poi_share, min_poi_km.
#2) Data loading in Python
- Connected to PostgreSQL and loaded the final feature view into a pandas DataFrame.
- Split into train/test sets for evaluation.
3) Modeling pipeline
- Preprocessing with ColumnTransformer:
- Numeric: StandardScaler (with_mean=False) for numeric columns.
- Categorical: OneHotEncoder(handle_unknown="ignore") for city and price_tier.
- Model: RandomForestRegressor inside a Pipeline.
4) Evaluation
- Trained the pipeline and computed metrics: RMSE and R² on the test set.
- Trained a simple baseline (price-only LinearRegression) for sanity check.
5) Troubleshooting handled
- Fixed SQL identifier/type mismatches (hotelid as text; rating cast to float).
- Resolved cast errors for text ratings in reviews; cleaned numeric strings where needed.
- Corrected Python f-string credentials and sklearn version differences (RMSE computation).
- Fixed permutation importance by running it on the transformed matrix and aligning feature names.
6) Model interpretation
- Computed permutation importance on transformed features to rank top drivers among numeric and one-hot categorical levels.
7) Artifacts and inference
- Saved the trained pipeline to model_rating_rf.pkl.
- Saved the feature schema (numeric/categorical column lists) to model_schema.pkl.
- Implemented an inference helper that:
- Ensures missing columns are added with safe defaults.
- Produces rating predictions for new batches.
8) Current outcome
- End-to-end workflow is complete: data prepared in SQL, model trained and evaluated, importances computed, artifacts stored, and batch inference verified.
9) Optional next steps
- Add K-fold cross-validation and small hyperparameter search (n_estimators, max_depth, min_samples_split).
- Create a metrics JSON and top-20 features CSV for reporting.
- Package an API endpoint (FastAPI/Flask) or CLI for batch scoring.


In [3]:
!pip install sqlalchemy psycopg2-binary

from sqlalchemy import create_engine
import pandas as pd



In [1]:
!pip install sqlalchemy psycopg2-binary scikit-learn pandas numpy joblib

import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import joblib


Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp313-cp313-win_amd64.whl.metadata (11 kB)
Collecting joblib
  Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.16.2-cp313-cp313-win_amd64.whl.metadata (60 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.2-cp313-cp313-win_amd64.whl (8.7 MB)
   ---------------------------------------- 0.0/8.7 MB ? eta -:--:--
   - -------------------------------------- 0.3/8.7 MB ? eta -:--:--
   --- ------------------------------------ 0.8/8.7 MB 2.6 MB/s eta 0:00:03
   ------ --------------------------------- 1.3/8.7 MB 3.1 MB/s eta 0:00:03
   -------- ------------------------------- 1.8/8.7 MB 2.7 MB/s eta 0:00:03
   --------- ------------------------------ 2.1/8.7 MB 2.6 MB/s eta 0:00:03
   ---------- ----------------------------- 2.4/8.7 MB 2.0 MB/s eta 0:00:04
  

In [5]:
y = df["y_rating"].astype(float)

num_features = [
    "price", "review_count", "avg_review_rating",
    "near_poi_share", "min_poi_km"
]
cat_features = ["city", "price_tier"]

# Binary flags are already numeric; add them to numeric block
binary_flags = ["has_wifi", "has_breakfast", "has_pool", "has_parking", "has_ac"]
num_features_all = num_features + binary_flags

X = df[num_features_all + cat_features].copy()


In [42]:
# taking dataset from postgress
USER = "postgres"      
PWD  = "1234"
HOST = "localhost"
PORT = 5432
DB   = "goibibo"

engine = create_engine(f"postgresql+psycopg2://{USER}:{PWD}@{HOST}:{PORT}/{DB}")

df = pd.read_sql("SELECT * FROM ml_rating_features_b", con=engine)
print(df.shape)
df.head()


(8040, 13)


Unnamed: 0,hotelid,y_rating,price,city,price_tier,has_wifi,has_breakfast,has_pool,has_parking,has_ac,review_count,near_poi_share,min_poi_km
0,628752c1d04899399ca38ad5,4.3,1785.0,deoghar,Mid,1,0,0,1,1,0,0.917431,0.072
1,628752c1d04899399ca3919e,4.2,6237.0,lonavala,Luxury,1,0,1,1,1,0,0.736715,0.001
2,628752bfd04899399ca37cf2,4.1,1357.0,pondicherry,Mid,1,0,1,1,1,0,0.695652,1.1
3,628752c1d04899399ca38915,3.9,4081.0,darjeeling,Luxury,0,0,0,0,0,0,0.761878,0.001
4,628752c0d04899399ca3874b,4.4,935.0,madikeri,Budget,0,0,0,1,0,0,0.885612,0.009


In [20]:
df = pd.read_sql("SELECT * FROM ml_rating_features_v2", con=engine)


In [6]:
#Train/validation split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [8]:
#Build preprocessing + model pipeline

numeric_transformer = Pipeline(steps=[
    ("scaler", StandardScaler(with_mean=False))
])

categorical_transformer = Pipeline(steps=[
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features_all),
        ("cat", categorical_transformer, cat_features)
    ]
)

model = RandomForestRegressor(
    n_estimators=400,
    max_depth=None,
    min_samples_split=4,
    random_state=42,
    n_jobs=-1
)

pipe = Pipeline(steps=[("preprocess", preprocess), ("model", model)])


In [13]:
#Train and evaluate
from sklearn.metrics import mean_squared_error, r2_score
preds = pipe.predict(X_test)

mse = mean_squared_error(y_test, preds)
rmse = mse ** 0.5
r2   = r2_score(y_test, preds)
print({"RMSE": rmse, "R2": r2})




{'RMSE': 0.7717361295083767, 'R2': -0.003763453482726442}


In [15]:
!pip install -U scikit-learn



In [18]:
#Simple baseline for sanity
from sklearn.metrics import mean_squared_error, r2_score

mse_base = mean_squared_error(y_test, base_preds)
rmse_base = mse_base ** 0.5
r2_base = r2_score(y_test, base_preds)
print({"Baseline_RMSE": rmse_base, "Baseline_R2": r2_base})



{'Baseline_RMSE': 0.7586202574063992, 'Baseline_R2': 0.03006510309764976}


In [36]:
pre = pipe.named_steps["preprocess"]
Xt_test = pre.transform(X_test)

# Get transformed feature names robustly
if hasattr(pre, "get_feature_names_out"):
    feature_names = pre.get_feature_names_out()
else:
    # manual fallback
    tx = {n: (est, cols) for n, est, cols in pre.transformers_ if n != "remainder"}
    num_names = list(tx["num"][1])
    ohe = tx["cat"][0].named_steps["onehot"]
    cat_levels = list(ohe.get_feature_names_out(tx["cat"][1]))
    feature_names = np.array(num_names + cat_levels, dtype=object)


In [38]:
#Feature importance (permutation)
from sklearn.inspection import permutation_importance
import numpy as np
import pandas as pd

pre = pipe.named_steps["preprocess"]
Xt_test = pre.transform(X_test)

# Dense matrix for permutation_importance
Xt_test_dense = Xt_test.toarray() if hasattr(Xt_test, "toarray") else Xt_test

# Get feature names
if hasattr(pre, "get_feature_names_out"):
    feature_names = pre.get_feature_names_out()
else:
    tx = {n: (est, cols) for n, est, cols in pre.transformers_ if n != "remainder"}
    num_names = list(tx["num"][1])
    ohe = tx["cat"][0].named_steps["onehot"]
    cat_levels = list(ohe.get_feature_names_out(tx["cat"][1]))
    feature_names = np.array(num_names + cat_levels, dtype=object)

est = pipe.named_steps["model"]

r = permutation_importance(
    est, Xt_test_dense, y_test, n_repeats=5, random_state=42, n_jobs=-1
)

print(len(r.importances_mean), len(feature_names))  # should match
importances = pd.Series(r.importances_mean, index=feature_names).sort_values(ascending=False)
importances.head(20)


83 83


num__price                0.092072
cat__city_mumbai          0.015803
num__has_parking          0.008914
cat__city_auli            0.005787
cat__city_ranchi          0.005614
num__has_pool             0.004296
cat__price_tier_Budget    0.003756
cat__city_manali          0.003335
num__has_wifi             0.002721
cat__city_madikeri        0.002350
cat__city_almora          0.002216
cat__city_munnar          0.001662
cat__city_pushkar         0.001603
cat__city_mathura         0.001321
cat__city_abu             0.001080
cat__city_dhanaulti       0.001075
cat__city_ajmer           0.000941
cat__city_gangtok         0.000739
cat__city_khajuraho       0.000288
cat__city_hampi           0.000162
dtype: float64

In [39]:
#Save artifacts

joblib.dump(pipe, "model_rating_rf.pkl")
# Save columns to ensure consistent inference schema later
schema = {
    "num_features": num_features_all,
    "cat_features": cat_features
}
joblib.dump(schema, "model_schema.pkl")


['model_schema.pkl']

In [40]:
# Inference function example
def predict_rating(batch_df: pd.DataFrame):
    mdl = joblib.load("model_rating_rf.pkl")
    sch = joblib.load("model_schema.pkl")
    cols = sch["num_features"] + sch["cat_features"]
    # Ensure missing columns exist
    for c in cols:
        if c not in batch_df:
            batch_df[c] = 0 if c in sch["num_features"] else "unknown"
    return mdl.predict(batch_df[cols])

# Example usage
sample = X_test.iloc[:5].copy()
predict_rating(sample)


array([3.85285506, 3.72338726, 4.48290476, 3.92524786, 4.34992143])

In [59]:
 #Review Sentiment Classification

#Goal: Predict sentiment category (Positive/Neutral/Negative) from review text (if available)
#Models: Naive Bayes / Logistic Regression, evaluation via accuracy and F1-score

Objective
- Predict sentiment class (Positive/Neutral/Negative) from review_text to support downstream analytics and QA.
Data & Labels
- Built SQL view ml_review_text with columns: review_text and sentiment_label (derived from numeric rating: ≥4.0 Positive, ≤2.5 Negative, else Neutral).
Models Trained
- Pipeline A: TF‑IDF + Logistic Regression (class_weight="balanced", max_iter=200).
- Pipeline B: TF‑IDF + Multinomial Naive Bayes.
Evaluation (held‑out test split)
- Logistic Regression: accuracy ≈ 0.774, weighted F1 ≈ 0.793. Class-wise F1: Negative ≈ 0.78, Neutral ≈ 0.43, Positive ≈ 0.87.
- Multinomial NB: accuracy ≈ 0.819, weighted F1 ≈ 0.784. Class-wise F1: Negative ≈ 0.77, Neutral ≈ 0.22, Positive ≈ 0.90.
Choice
- Selected Logistic Regression as final model because it maintains stronger balance across classes, 
  notably a substantially better Neutral-class F1 than NB (0.43 vs 0.22),while retaining high Positive performance.
Artifacts
- Saved model: review_sentiment_tfidf.pkl (TF‑IDF + Logistic Regression).
- Saved labels: review_sentiment_labels.pkl (class order for consistent inference).
Inference
- predict_sentiment(texts) loads artifacts and returns class predictions (and probabilities if available).
  Verified examples produce sensible outputs (e.g., "helpful staff" → Positive; "dirty sheets" → Negative).
- Monitoring & reporting:
- Persist classification reports and confusion matrix each run; track per-class F1, especially Neutral.

- Deployment:
- Wrap predict_sentiment in an API for batch scoring; version artifacts (e.g., v1) and log model metadata (vocab_size, classes).


In [45]:
USER = "postgres"        
PWD  = "1234"           
HOST = "localhost"
PORT = 5432
DB   = "goibibo"

conn_str = f"postgresql+psycopg2://{USER}:{PWD}@{HOST}:{PORT}/{DB}"
engine = create_engine(conn_str)

In [46]:
import pandas as pd
from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split

engine = create_engine(conn_str)  # reuse from earlier
df = pd.read_sql("SELECT review_text, sentiment_label FROM ml_review_text", con=engine)

# Drop empties and keep 3 classes only
df = df.dropna(subset=["review_text", "sentiment_label"])
X_text = df["review_text"].astype(str)
y = df["sentiment_label"].astype("category")

X_train, X_test, y_train, y_test = train_test_split(
    X_text, y, test_size=0.2, random_state=42, stratify=y
)


In [60]:
df = pd.read_sql("SELECT review_text, sentiment_label FROM ml_review_text LIMIT 50", con=engine)
display(df)         
print(df.shape)

Unnamed: 0,review_text,sentiment_label
0,Good and excellent hotel with budget. Very nic...,Positive
1,"nice n neat rooms, good services, not having r...",Positive
2,very nice hotel to stay at any time..nice room...,Positive
3,"nice hotel in vizag,and value of money, friend...",Positive
4,very nice and very comfartable and very good ...,Positive
5,"Good Hotel, very good room neat and clean ,Hop...",Positive
6,srives Vere poor mayenetenas good staf peopl...,Positive
7,Hotel location is good but service is very poo...,Negative
8,Hotel is good is staff are very friendly good ...,Positive
9,"Best Hotel , rooms is very good , very good s...",Positive


(50, 2)


In [None]:
#Step 3 — Build TF‑IDF + model pipelines

#Option A: Logistic Regression (strong baseline).

#Option B: Multinomial Naive Bayes.

In [48]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# Common vectorizer
tfidf = TfidfVectorizer(
    lowercase=True,
    stop_words="english",
    ngram_range=(1,2),
    max_df=0.9,
    min_df=5
)

logreg_pipe = Pipeline([
    ("tfidf", tfidf),
    ("clf", LogisticRegression(max_iter=200, n_jobs=-1, class_weight="balanced"))
])

nb_pipe = Pipeline([
    ("tfidf", tfidf),
    ("clf", MultinomialNB())
])


In [49]:
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

def train_eval(pipe, name):
    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_test)
    acc = accuracy_score(y_test, preds)
    f1w = f1_score(y_test, preds, average="weighted")
    print(name, {"accuracy": acc, "f1_weighted": f1w})
    print(classification_report(y_test, preds))
    return pipe

logreg_model = train_eval(logreg_pipe, "LogisticRegression")
nb_model = train_eval(nb_pipe, "MultinomialNB")


LogisticRegression {'accuracy': 0.7744827948102379, 'f1_weighted': 0.7922559944874592}
              precision    recall  f1-score   support

    Negative       0.77      0.79      0.78     27715
     Neutral       0.35      0.56      0.43     20771
    Positive       0.93      0.81      0.87    103968

    accuracy                           0.77    152454
   macro avg       0.68      0.72      0.69    152454
weighted avg       0.82      0.77      0.79    152454

MultinomialNB {'accuracy': 0.8182664934996786, 'f1_weighted': 0.7843372942296177}
              precision    recall  f1-score   support

    Negative       0.74      0.80      0.77     27715
     Neutral       0.55      0.14      0.22     20771
    Positive       0.85      0.96      0.90    103968

    accuracy                           0.82    152454
   macro avg       0.71      0.63      0.63    152454
weighted avg       0.79      0.82      0.78    152454



In [51]:
#Pick the best and persist
import joblib

best_model = logreg_model  # or nb_model if it wins
joblib.dump(best_model, "review_sentiment_tfidf.pkl")

# Save label order for consistent inference
label_order = list(best_model.classes_)
joblib.dump(label_order, "review_sentiment_labels.pkl")


['review_sentiment_labels.pkl']

In [52]:
#Inference helper
import numpy as np

def predict_sentiment(texts):
    mdl = joblib.load("review_sentiment_tfidf.pkl")
    labels = joblib.load("review_sentiment_labels.pkl")
    probs = mdl.predict_proba(texts) if hasattr(mdl, "predict_proba") else None
    preds = mdl.predict(texts)
    return preds, probs, labels

# Example
preds, probs, labels = predict_sentiment(pd.Series([
    "Room was clean and staff were very helpful",
    "Noisy AC, dirty sheets, terrible experience"
]))
print(preds.tolist())


['Positive', 'Negative']


In [None]:
#Emerging Location Clustering

#Goal: Identify clusters of similar-performing cities to guide expansion
#Technique: K-means or hierarchical clustering on listing growth, price level, rating trends

In [61]:
import pandas as pd
from sqlalchemy import create_engine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

engine = create_engine(conn_str)
city = pd.read_sql("SELECT * FROM ml_city_features", con=engine)

# Keep available numeric features (no dates => no growth/trend)
city2 = city.copy()
city2 = city2.dropna(subset=["price_level"])  # ensure price exists
city2["listings_90d"] = city2["listings_90d"].fillna(0)

X = city2[["listings_90d","price_level"]].astype(float)

scaler = StandardScaler()
Xs = scaler.fit_transform(X)

In [64]:
df = pd.read_sql("SELECT * FROM ml_city_features LIMIT 10;", con=engine)
display(df)         
print(df.shape)

Unnamed: 0,city,listings_90d,listings_prev90d,listing_growth,price_level,rating_trend
0,abu,105,,,2400.0,0.0
1,agra,298,,,1226.5,0.0
2,ajmer,123,,,1297.0,0.0
3,alleppey,156,,,2337.0,0.0
4,almora,25,,,2543.0,0.0
5,auli,43,,,3622.0,0.0
6,bankura,9,,,1837.0,0.0
7,binsar,20,,,3136.5,0.0
8,chandigarh,300,,,1464.0,0.0
9,cherrapunji,7,,,2249.0,0.0


(10, 6)


In [66]:
#Pick k by silhouette and fit
ks = [3,4,5,6]
scores = {}
for k in ks:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(Xs)
    scores[k] = silhouette_score(Xs, labels)
best_k = max(scores, key=scores.get)
print("silhouette:", scores, "best_k:", best_k)

km = KMeans(n_clusters=best_k, random_state=42, n_init=10)
city2["cluster"] = km.fit_predict(Xs)


silhouette: {3: 0.4595548318325597, 4: 0.393625176279665, 5: 0.4278450652851638, 6: 0.4280500043515687} best_k: 3


In [67]:
#Summarize clusters and export
summary = city2.groupby("cluster")[["listings_90d","price_level"]].median().round(2)
display(summary)

city2.to_csv("city_clusters.csv", index=False)

import joblib
joblib.dump({"scaler": scaler, "kmeans": km, "features": ["listings_90d","price_level"]},
            "city_clustering_kmeans.pkl")


Unnamed: 0_level_0,listings_90d,price_level
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
0,45.5,2175.5
1,229.5,2104.75
2,6.0,8423.75


['city_clustering_kmeans.pkl']

In [68]:
summary.to_csv("city_cluster_summary.csv")
city2.to_csv("city_clusters.csv", index=False)


In [69]:
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine(conn_str)
city = pd.read_sql("SELECT * FROM ml_city_features", con=engine)

# Filter: ensure growth is defined and enough volume in either window
city = city.dropna(subset=["listing_growth","price_level","rating_trend"])
city = city[(city["listings_90d"] >= 10) | (city["listings_prev90d"] >= 10)]


In [70]:
df = pd.read_sql("SELECT * FROM public.ml_city_features LIMIT 10 ", con=engine)
display(df)         
print(df.shape)

Unnamed: 0,city,listings_90d,listings_prev90d,listing_growth,price_level,rating_trend
0,abu,60,45,0.333333,2409.0,-7.141686e-09
1,agra,166,132,0.257576,1206.5,2.086595e-08
2,ajmer,62,61,0.016393,1233.0,-3.871784e-09
3,alleppey,72,84,-0.142857,2636.0,1.127695e-08
4,almora,17,8,1.125,2543.0,-9.350026e-08
5,auli,18,25,-0.28,3622.0,6.616023e-08
6,bankura,7,2,2.5,1837.0,4.09203e-08
7,binsar,13,7,0.857143,3382.0,1.571005e-08
8,chandigarh,157,143,0.097902,1484.0,-6.180459e-09
9,cherrapunji,3,4,-0.25,2249.0,-7.644415e-09


(10, 6)


In [71]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

X = city[["listing_growth","price_level","rating_trend"]].astype(float)

scaler = StandardScaler()
Xs = scaler.fit_transform(X)

scores = {}
for k in [3,4,5,6]:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(Xs)
    scores[k] = silhouette_score(Xs, labels)
best_k = max(scores, key=scores.get)
print("silhouette:", scores, "best_k:", best_k)

km = KMeans(n_clusters=best_k, random_state=42, n_init=10)
city["cluster"] = km.fit_predict(Xs)


silhouette: {3: 0.2863863411537589, 4: 0.27793215328953397, 5: 0.2809281747653934, 6: 0.26948262525259375} best_k: 3


In [72]:
summary = city.groupby("cluster")[["listing_growth","price_level","rating_trend"]].median().round(3)
display(summary)

city.to_csv("city_clusters_v2.csv", index=False)
summary.to_csv("city_cluster_summary_v2.csv")

import joblib
joblib.dump({"scaler": scaler, "kmeans": km, "features": ["listing_growth","price_level","rating_trend"]},
            "city_clustering_kmeans_v2.pkl")

Unnamed: 0_level_0,listing_growth,price_level,rating_trend
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.048,1580.0,-0.0
1,-0.1,2636.0,0.0
2,1.125,2543.0,0.0


['city_clustering_kmeans_v2.pkl']

In [None]:
Objective
- Identify clusters of similar-performing cities to guide expansion, using three features: listing_growth, price_level, rating_trend.
Data view
- Built SQL view ml_city_features with:
- listings_90d, listings_prev90d (recent vs prior 90-day counts).
- listing_growth = (listings_90d - listings_prev90d) / listings_prev90d.
- price_level = median recent price.
- rating_trend = monthly rating slope (proxy from temporal ordering).
- Applied support filter in Python: keep cities where max(listings_90d, listings_prev90d) ≥ 10 and drop rows with nulls.
Modeling
- Features used for clustering: ["listing_growth", "price_level", "rating_trend"].
- Standardized features with StandardScaler.
- Chose number of clusters by silhouette over k ∈ {3,4,5,6}; best_k = 3 on this data.
- Trained K-means (random_state=42, n_init=10) and assigned cluster labels to cities.
Results (cluster medians)
- Cluster 0: listing_growth ≈ 0.048, price_level ≈ 1580, rating_trend ≈ ~0.
- Cluster 1: listing_growth ≈ -0.100, price_level ≈ 2636, rating_trend ≈ ~0.
- Cluster 2: listing_growth ≈ 1.125, price_level ≈ 2543, rating_trend ≈ ~0.
- Silhouette scores by k (example): {3: ~0.286, 4: ~0.278, 5: ~0.281, 6: ~0.269}; selected k=3.
Interpretation
- Cluster 2 (high growth, mid–high price): expansion priority; add supply and marketing.
- Cluster 0 (modest growth, low–mid price): selective growth; maintain competitive pricing.
- Cluster 1 (flat/negative growth, high price): cautious investment; focus on quality/price alignment.
Deliverables saved
- City assignments: city_clusters_v2.csv
- Cluster summary (medians): city_cluster_summary_v2.csv
- Model artifacts: city_clustering_kmeans_v2.pkl
- Meta: features used and silhouette scores (v2).

In [None]:
Rating Prediction Model

Built features from price tier, amenities, city/landmark metrics, and review_count; trained a RandomForestRegressor in a Pipeline; evaluated with RMSE and R²; saved the trained pipeline and schema; added batch inference helper.

Review Sentiment Classification

Prepared review_text with 3-class labels; trained TF‑IDF + Logistic Regression and TF‑IDF + Multinomial Naive Bayes; selected Logistic Regression for better class balance; evaluated with accuracy and weighted F1; saved model and labels; added inference helper.

Emerging Location Clustering

Created ml_city_features with listing_growth, price_level, rating_trend; standardized features; chose k by silhouette (best_k = 3); trained K‑means; exported city assignments, cluster medians, and model artifact.