# House Price Prediction Assignment

**Objective:**
In this notebook, we implement house price prediction using **Linear Regression** and **Random Forest Regressor**. We evaluate both models using metrics like $R^2$, MAE, MSE, and RMSE, and perform a sanity check on a single prediction.

In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

## 1. Load Dataset

We use the cleaned house dataset (`clean_house_l5_dataset.csv`).

In [9]:
CSV_PATH = "../../../dataset/clean_house_l5_dataset.csv"
df = pd.read_csv(CSV_PATH)
df.head()

Unnamed: 0,Size_sqft,Bedrooms,Bathrooms,YearBuilt,Price,Location_City,Location_Rural,Location_Suburb,HouseAge,Rooms_per_1000sqft,Size_per_Bedroom,Is_City,LogPrice
0,1.030281,-1.463643,0.088986,-1.279342,812100.0,1,0,0,1.279342,-1.061465,3.123085,1,13.60738
1,-0.482463,-1.463643,1.347506,1.326476,547000.0,1,0,0,-1.326476,-0.265637,1.30952,1,13.212206
2,0.468877,0.00743,-1.169534,-1.339942,693700.0,1,0,0,1.339942,-0.689547,-0.16397,1,13.449796
3,1.079817,0.742966,1.347506,-0.91574,848300.0,1,0,0,0.91574,-0.199111,-0.307614,1,13.650991
4,0.788954,1.478502,-1.169534,0.962873,806000.0,0,0,1,-0.962873,-0.311002,-0.610027,0,13.59984


## 2. Prepare Features & Target

We drop `Price` and `LogPrice` from the features to prevent data leakage.

In [10]:
y = df["Price"]
X = df.drop(columns=["Price", "LogPrice"])
print(f"Features used: {list(X.columns)}")

Features used: ['Size_sqft', 'Bedrooms', 'Bathrooms', 'YearBuilt', 'Location_City', 'Location_Rural', 'Location_Suburb', 'HouseAge', 'Rooms_per_1000sqft', 'Size_per_Bedroom', 'Is_City']


## 3. Split Data

We split the data into 80% training and 20% testing sets using `random_state=42` for reproducibility.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")

Training set size: 79
Testing set size: 20


## 4. Train Models

We train a simple **Linear Regression** model and a more complex **Random Forest Regressor** with 100 estimators.

In [12]:
# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

## 5. Evaluate Performance

We use a helper function to print evaluation metrics for both models.

In [13]:
def evaluate_model(name, y_true, y_pred):
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    
    print(f"{name} Performance:")
    print(f"  R² Score : {r2:.4f}")
    print(f"  MAE      : {mae:,.2f}")
    print(f"  MSE      : {mse:,.2f}")
    print(f"  RMSE     : {rmse:,.2f}\n")

evaluate_model("Linear Regression", y_test, lr_pred)
evaluate_model("Random Forest", y_test, rf_pred)

Linear Regression Performance:
  R² Score : 0.8478
  MAE      : 63,085.84
  MSE      : 5,718,940,940.60
  RMSE     : 75,623.68

Random Forest Performance:
  R² Score : 0.8594
  MAE      : 52,523.85
  MSE      : 5,283,317,454.95
  RMSE     : 72,686.43



## 6. Single-row Sanity Check

We pick an arbitrary row from the test set to compare the actual price with the predictions from both models.

In [14]:
for i in [5, 12, 18]:
    sample_x = X_test.iloc[[i]]
    sample_y = y_test.iloc[i]

    pred_lr = lr.predict(sample_x)[0]
    pred_rf = rf.predict(sample_x)[0]

    print(f"--- Sanity Check for Row {i} ---")
    print(f"Actual Price: ${sample_y:,.2f}")
    print(f"Linear Regression Prediction: ${pred_lr:,.2f}")
    print(f"Random Forest Prediction: ${pred_rf:,.2f}\n")

--- Sanity Check for Row 5 ---
Actual Price: $419,200.00
Linear Regression Prediction: $411,139.22
Random Forest Prediction: $297,368.00

--- Sanity Check for Row 12 ---
Actual Price: $345,900.00
Linear Regression Prediction: $420,030.94
Random Forest Prediction: $393,837.00

--- Sanity Check for Row 18 ---
Actual Price: $315,000.00
Linear Regression Prediction: $258,460.20
Random Forest Prediction: $330,577.00

