# Pr√©diction prix au m¬≤ ‚Äî Appartements 2020 (version clean)

**Objectif :** entra√Æner et √©valuer des mod√®les de r√©gression pour pr√©dire `prix_m2` √† partir de variables bien expliqu√©es (surface, pi√®ces, g√©oloc, etc.), avec un feature engineering territorial `nb_ventes_commune` calcul√© **sans fuite de donn√©es**.

- Dataset : `../data/prod/df_model_appart_2020.parquet.gz`
- Cible : `prix_m2`
- Mod√®les : RandomForest, GradientBoosting, LightGBM (si disponible)


In [1]:
# üì¶ Imports & settings
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

RANDOM_STATE = 42
DATA_PATH = "../data/prod/df_model_appart_2020.parquet.gz"

pd.set_option("display.max_columns", None)
pd.options.display.float_format = "{:,.0f}".format

# LightGBM (optionnel)
try:
    from lightgbm import LGBMRegressor
    HAS_LGBM = True
except Exception as e:
    HAS_LGBM = False
    LGBMRegressor = None
    print("‚ö†Ô∏è LightGBM non disponible (on continue sans). D√©tail:", repr(e))


‚ö†Ô∏è LightGBM non disponible (on continue sans). D√©tail: ModuleNotFoundError("No module named 'lightgbm'")


## 1) Chargement des donn√©es + sanity checks

In [2]:
df = pd.read_parquet(DATA_PATH, engine="pyarrow")
print("Shape:", df.shape)
display(df.head(3))
display(df["prix_m2"].describe())


Shape: (190522, 7)


Unnamed: 0,surface_reelle_bati,nombre_pieces_principales,latitude,longitude,has_dependance,nom_commune,prix_m2
0,62,3,46,5,True,Bourg-en-Bresse,2194
1,47,2,46,5,True,Saint-Laurent-sur-Sa√¥ne,1532
2,46,2,46,5,False,Bourg-en-Bresse,1522


count   190,522
mean      3,688
std       2,595
min         465
25%       1,950
50%       2,963
75%       4,462
max      14,167
Name: prix_m2, dtype: float64

## 2) Pr√©paration : cible, features, nettoyage minimal
On fait **le minimum** ici : suppression des cibles manquantes et valeurs non-positives.

In [3]:
# --- Nettoyage minimal ---
df = df.dropna(subset=["prix_m2"]).copy()
df = df[df["prix_m2"] > 0].copy()

# --- Encodage boolean (comme dans ton notebook) ---
if "has_dependance" in df.columns:
    df["has_dependance"] = df["has_dependance"].astype(int)

# --- D√©finir features de base et target (m√™mes noms que ton notebook) ---
FEATURES_BASE = [
    "surface_reelle_bati",
    "nombre_pieces_principales",
    "latitude",
    "longitude",
    "has_dependance",
]
TARGET = "prix_m2"

# On conserve la commune pour faire le feature engineering (et potentiellement l'encoder plus tard)
X = df[FEATURES_BASE + ["nom_commune"]].copy()
y = df[TARGET].copy()

display(X.head(3))


Unnamed: 0,surface_reelle_bati,nombre_pieces_principales,latitude,longitude,has_dependance,nom_commune
0,62,3,46,5,1,Bourg-en-Bresse
1,47,2,46,5,1,Saint-Laurent-sur-Sa√¥ne
2,46,2,46,5,0,Bourg-en-Bresse


## 3) Split train/test (AVANT feature engineering)
Important : on calcule `nb_ventes_commune` **uniquement sur le train**, puis on applique au test (anti-leakage).

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

# --- Cr√©er feature nb_ventes_commune sur TRAIN uniquement ---
commune_sales = (
    X_train.groupby("nom_commune")
        .size()
        .rename("nb_ventes_commune")
)

median_sales = commune_sales.median()

# --- Appliquer au train et au test ---
X_train = X_train.merge(commune_sales, on="nom_commune", how="left")
X_test  = X_test.merge(commune_sales, on="nom_commune", how="left")

# Valeurs inconnues (communes jamais vues en train)
X_train["nb_ventes_commune"] = X_train["nb_ventes_commune"].fillna(median_sales)
X_test["nb_ventes_commune"]  = X_test["nb_ventes_commune"].fillna(median_sales)

# Features finales (m√™me logique que ton notebook)
FEATURES_FINAL = FEATURES_BASE + ["nb_ventes_commune"]

X_train_final = X_train[FEATURES_FINAL].copy()
X_test_final  = X_test[FEATURES_FINAL].copy()

print("Train:", X_train_final.shape, "Test:", X_test_final.shape)
display(X_train_final.head(3))


Train: (152417, 6) Test: (38105, 6)


Unnamed: 0,surface_reelle_bati,nombre_pieces_principales,latitude,longitude,has_dependance,nb_ventes_commune
0,73,4,51,3,1,50
1,52,3,43,-2,0,171
2,26,1,47,3,0,53


## 4) Baseline (m√©diane) + fonction de reporting

In [5]:
def regression_report(y_true, y_pred, label="model"):
    rmse = root_mean_squared_error(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"{label:>18} | RMSE: {rmse:,.0f} | MAE: {mae:,.0f} | R¬≤: {r2:,.3f}")
    return {"rmse": rmse, "mae": mae, "r2": r2}

# Baseline = m√©diane du train
y_pred_base = np.full(shape=len(y_test), fill_value=float(y_train.median()))
baseline_metrics = regression_report(y_test, y_pred_base, "Baseline(median)")


  Baseline(median) | RMSE: 2,681 | MAE: 1,772 | R¬≤: -0.075


## 5) Mod√®les : entra√Ænement + √©valuation
On garde des mod√®les simples, robustes, et on compare sur le m√™me split.

In [6]:
results = []

# 1) Random Forest
rf_model = RandomForestRegressor(
    n_estimators=400,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    min_samples_leaf=2,
)
rf_model.fit(X_train_final, y_train)
y_pred_rf = rf_model.predict(X_test_final)
m = regression_report(y_test, y_pred_rf, "RandomForest")
results.append({"model": "RandomForest", **m})

# 2) Gradient Boosting
gbr_model = GradientBoostingRegressor(random_state=RANDOM_STATE)
gbr_model.fit(X_train_final, y_train)
y_pred_gbr = gbr_model.predict(X_test_final)
m = regression_report(y_test, y_pred_gbr, "GradBoosting")
results.append({"model": "GradBoosting", **m})

# 3) LightGBM (optionnel)
if HAS_LGBM:
    lgb_model = LGBMRegressor(
        random_state=RANDOM_STATE,
        n_estimators=1200,
        learning_rate=0.05,
        num_leaves=31,
        subsample=0.9,
        colsample_bytree=0.9,
    )
    lgb_model.fit(X_train_final, y_train)
    y_pred_lgb = lgb_model.predict(X_test_final)
    m = regression_report(y_test, y_pred_lgb, "LightGBM")
    results.append({"model": "LightGBM", **m})
else:
    lgb_model = None

results_df = pd.DataFrame(results).sort_values("rmse")
display(results_df)


      RandomForest | RMSE: 1,023 | MAE: 640 | R¬≤: 0.844
      GradBoosting | RMSE: 1,360 | MAE: 968 | R¬≤: 0.723


Unnamed: 0,model,rmse,mae,r2
0,RandomForest,1023,640,1
1,GradBoosting,1360,968,1


## 6) Feature importances (lecture rapide)
Attention : avec `latitude/longitude`, le mod√®le capte surtout la g√©ographie. C'est normal en immobilier.

In [7]:
importances_rf = pd.Series(getattr(rf_model, "feature_importances_", np.nan), index=FEATURES_FINAL)
print("RandomForest importances:")
display(importances_rf.sort_values(ascending=False))

importances_gbr = pd.Series(getattr(gbr_model, "feature_importances_", np.nan), index=FEATURES_FINAL)
print("GradBoosting importances:")
display(importances_gbr.sort_values(ascending=False))

if lgb_model is not None:
    importances_lgb = pd.Series(getattr(lgb_model, "feature_importances_", np.nan), index=FEATURES_FINAL)
    print("LightGBM importances:")
    display(importances_lgb.sort_values(ascending=False))


RandomForest importances:


latitude                    1
longitude                   0
nb_ventes_commune           0
surface_reelle_bati         0
nombre_pieces_principales   0
has_dependance              0
dtype: float64

GradBoosting importances:


latitude                    1
longitude                   0
nb_ventes_commune           0
surface_reelle_bati         0
nombre_pieces_principales   0
has_dependance              0
dtype: float64

## 7) Cross-validation (sur le meilleur mod√®le du tableau)
Objectif : v√©rifier la stabilit√© de la perf (sans sur-optimiser).

In [8]:
from sklearn.model_selection import KFold, cross_val_score

cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

rf_cv = RandomForestRegressor(
    n_estimators=400,
    random_state=RANDOM_STATE,
    n_jobs=-1,          # le mod√®le parall√©lise
    min_samples_leaf=2
)

cv_rmse = -cross_val_score(
    rf_cv,
    X_train_final,
    y_train,
    scoring="neg_root_mean_squared_error",
    cv=cv,
    n_jobs=1            # IMPORTANT: pas de parall√©lisme c√¥t√© CV
)

print(cv_rmse.mean(), cv_rmse.std())


1051.0830831913622 11.134887179932901


## 8) Focus Paris : pr√©dire sur les communes Paris (filtre d√©fensif)
On reprend ton id√©e : isoler les communes contenant 'Paris' en excluant les faux positifs type 'Pariset'.

In [10]:
# Filtre d√©fensif (reprend l'intention de ton notebook : √©viter Pariset/Parisis/etc.)
pattern_exclude = "Seyssinet-Pariset|Le Touq|Fontenay-en-Parisis|Cormeilles-en-Parisis|Villeparisis"

df_paris = df[
    df["nom_commune"].str.contains("Paris", case=False, na=False)
    & ~df["nom_commune"].str.contains(pattern_exclude, case=False, na=False)
].copy()

print("Nb lignes Paris:", len(df_paris))
display(df_paris[["nom_commune", "prix_m2"]].head(5))

# Pr√©parer X Paris avec la m√™me feature nb_ventes_commune (calcul√©e sur train)
df_paris["has_dependance"] = df_paris["has_dependance"].astype(int)

X_paris = df_paris[FEATURES_BASE + ["nom_commune"]].copy()
X_paris = X_paris.merge(commune_sales, on="nom_commune", how="left")
X_paris["nb_ventes_commune"] = X_paris["nb_ventes_commune"].fillna(median_sales)

X_paris_final = X_paris[FEATURES_FINAL]

# Pr√©diction avec le mod√®le RF (comme dans ton notebook).
# Si tu veux utiliser le best_model_name √† la place, on peut adapter.
y_pred_paris = rf_model.predict(X_paris_final)

df_result_paris = df_paris.copy()
df_result_paris["prix_m2_pred"] = y_pred_paris

print("Paris ‚Äì aper√ßu")
display(df_result_paris[[
    "nom_commune",
    "surface_reelle_bati",
    "nombre_pieces_principales",
    "prix_m2",
    "prix_m2_pred"
]].head())

print("Paris uniquement (RF)")
regression_report(df_result_paris["prix_m2"], y_pred_paris, "RF on Paris")


Nb lignes Paris: 12126


Unnamed: 0,nom_commune,prix_m2
178396,Paris 3e Arrondissement,12333
178397,Paris 1er Arrondissement,10000
178398,Paris 1er Arrondissement,13238
178399,Paris 1er Arrondissement,13917
178400,Paris 1er Arrondissement,13090


Paris ‚Äì aper√ßu


Unnamed: 0,nom_commune,surface_reelle_bati,nombre_pieces_principales,prix_m2,prix_m2_pred
178396,Paris 3e Arrondissement,12,1,12333,11332
178397,Paris 1er Arrondissement,27,2,10000,10290
178398,Paris 1er Arrondissement,84,4,13238,8882
178399,Paris 1er Arrondissement,120,5,13917,12456
178400,Paris 1er Arrondissement,24,1,13090,12230


Paris uniquement (RF)
       RF on Paris | RMSE: 1,502 | MAE: 1,007 | R¬≤: 0.605


{'rmse': 1501.7750997777282,
 'mae': 1006.9958070152929,
 'r2': 0.6048612299596574}