# House Prices Regression
Ce notebook montre comment pr√©dire le prix des maisons √† l'aide d'un dataset Kaggle t√©l√©charg√© avec `kagglehub`. Nous utilisons √† la fois `scikit-learn` et une impl√©mentation personnalis√©e (from scratch) de la r√©gression lin√©aire.

In [1]:
import kagglehub
import pandas as pd

# T√©l√©charger la derni√®re version du dataset
path = kagglehub.dataset_download("rishitaverma02/house-prices-advanced-regression-techniques")

print("Chemin vers les fichiers du dataset :", path)

Chemin vers les fichiers du dataset : C:\Users\saade\.cache\kagglehub\datasets\rishitaverma02\house-prices-advanced-regression-techniques\versions\1


In [2]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

## üîπ Chargement du dataset

In [3]:
df = pd.read_csv(path + "/train (1).csv")
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## üîπ S√©lection des variables pertinentes

In [4]:
# On s√©lectionne quelques variables num√©riques simples
df = df[["GrLivArea", "OverallQual", "TotalBsmtSF", "SalePrice"]].dropna()

X = df[["GrLivArea", "OverallQual", "TotalBsmtSF"]]
y = df["SalePrice"]

## üîπ S√©paration des donn√©es en entra√Ænement et test

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## üîπ R√©gression lin√©aire avec `scikit-learn`

In [6]:
model = LinearRegression()
model.fit(X_train, y_train)

print("Intercept :", model.intercept_)
print("Coefficients :", model.coef_)

y_pred_sklearn = model.predict(X_test)
mse_sklearn = mean_squared_error(y_test, y_pred_sklearn)
print("MSE avec sklearn :", round(mse_sklearn, 2))

Intercept : -98427.32987875512
Coefficients : [   47.13601888 28203.61010365    33.17354342]
MSE avec sklearn : 1667657527.16


## üîπ Normalisation des donn√©es pour la descente de gradient

In [7]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

X1_train, X2_train, X3_train = X_train[:, 0], X_train[:, 1], X_train[:, 2]

## üîπ Impl√©mentation de la r√©gression lin√©aire from scratch

In [8]:
n = len(y_train)
learning_rate = 0.01
epochs = 1000

bias = 0
w1, w2, w3 = 0, 0, 0

for i in range(epochs):
    yhat = bias + w1*X1_train + w2*X2_train + w3*X3_train
    mse = mean_squared_error(y_train, yhat)
    if i % 100 == 0:
        print(f"√âpoque {i} : MSE = {round(mse, 2)}")
    
    grad_b = -(2/n) * np.sum(y_train - yhat)
    grad_w1 = -(2/n) * np.sum(X1_train * (y_train - yhat))
    grad_w2 = -(2/n) * np.sum(X2_train * (y_train - yhat))
    grad_w3 = -(2/n) * np.sum(X3_train * (y_train - yhat))
    
    bias -= learning_rate * grad_b
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
    w3 -= learning_rate * grad_w3

√âpoque 0 : MSE = 38885583525.71
√âpoque 100 : MSE = 2206370574.29
√âpoque 200 : MSE = 1636708059.91
√âpoque 300 : MSE = 1624190461.73
√âpoque 400 : MSE = 1623478312.98
√âpoque 500 : MSE = 1623367446.67
√âpoque 600 : MSE = 1623345654.31
√âpoque 700 : MSE = 1623341248.49
√âpoque 800 : MSE = 1623340353.52
√âpoque 900 : MSE = 1623340171.43


In [9]:
X1_test, X2_test, X3_test = X_test[:, 0], X_test[:, 1], X_test[:, 2]
y_pred_scratch = bias + w1*X1_test + w2*X2_test + w3*X3_test
mse_scratch = mean_squared_error(y_test, y_pred_scratch)
print("MSE from scratch :", round(mse_scratch, 2))

MSE from scratch : 1667634641.01


## ‚úÖ Conclusion
- Le mod√®le `sklearn` permet une r√©gression rapide et optimis√©e.
- L'approche manuelle (from scratch) avec descente de gradient permet de mieux comprendre le fonctionnement math√©matique de la r√©gression.
- Les deux MSE sont proches, cela signifie que l‚Äôimpl√©mentation from scratch est correcte.