<a href="https://colab.research.google.com/github/tc-wandering/house_price_prediction/blob/main/house_price_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install pandas numpy matplotlib seaborn scikit-learn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_percentage_error

dataset = pd.read_excel("/content/HousePricePrediction.xlsx")

display(dataset.head())
display(dataset.shape)




Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,LotConfig,BldgType,OverallCond,YearBuilt,YearRemodAdd,Exterior1st,BsmtFinSF2,TotalBsmtSF,SalePrice
0,0,60,RL,8450,Inside,1Fam,5,2003,2003,VinylSd,0.0,856.0,208500.0
1,1,20,RL,9600,FR2,1Fam,8,1976,1976,MetalSd,0.0,1262.0,181500.0
2,2,60,RL,11250,Inside,1Fam,5,2001,2002,VinylSd,0.0,920.0,223500.0
3,3,70,RL,9550,Corner,1Fam,5,1915,1970,Wd Sdng,0.0,756.0,140000.0
4,4,60,RL,14260,FR2,1Fam,5,2000,2000,VinylSd,0.0,1145.0,250000.0


(2919, 13)

Data Cleaning: Drop irrelevant or identifier columns. For instance, drop the Id column because it doesn’t affect price. Handle missing values

In [10]:
dataset = pd.read_excel("/content/HousePricePrediction.xlsx")

dataset.drop(['Id'], axis=1, inplace=True)
dataset['SalePrice'] = dataset['SalePrice'].fillna(dataset['SalePrice'].mean())
dataset = dataset.dropna()

display(dataset.isnull().sum())

Unnamed: 0,0
MSSubClass,0
MSZoning,0
LotArea,0
LotConfig,0
BldgType,0
OverallCond,0
YearBuilt,0
YearRemodAdd,0
Exterior1st,0
BsmtFinSF2,0


Categorical encoding: Use one-hot encoding for categorical features (so models can use them). First identify object-type columns; then apply scikit-learn’s OneHotEncoder

In [12]:
cat_cols = dataset.select_dtypes(include=['object']).columns

from sklearn.preprocessing import OneHotEncoder

OH_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
OH_cols = pd.DataFrame(OH_encoder.fit_transform(dataset[cat_cols]))
OH_cols.index = dataset.index
OH_cols.columns = OH_encoder.get_feature_names_out()
dataset_num = dataset.drop(cat_cols, axis=1)
df_final = pd.concat([dataset_num, OH_cols], axis=1)
display(df_final.head())

Unnamed: 0,MSSubClass,LotArea,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF2,TotalBsmtSF,SalePrice,MSZoning_C (all),MSZoning_FV,...,Exterior1st_CemntBd,Exterior1st_HdBoard,Exterior1st_ImStucc,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_Stone,Exterior1st_Stucco,Exterior1st_VinylSd,Exterior1st_Wd Sdng,Exterior1st_WdShing
0,60,8450,5,2003,2003,0.0,856.0,208500.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,20,9600,8,1976,1976,0.0,1262.0,181500.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,60,11250,5,2001,2002,0.0,920.0,223500.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,70,9550,5,1915,1970,0.0,756.0,140000.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,60,14260,5,2000,2000,0.0,1145.0,250000.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Train-test split: Separate features and target

In [18]:
X = df_final.drop('SalePrice', axis=1)
y = df_final['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

Modeling: Since this is regression, I'm going to use models SVR, Random Forest Regressor, and Linear Regression.

In [19]:
from sklearn.svm import SVR
svr = SVR()
svr.fit(X_train, y_train)
pred_svr = svr.predict(X_test)
print("SVR MAPE:", mean_absolute_percentage_error(y_test, pred_svr))

rf = RandomForestRegressor(n_estimators=10, random_state=0)
rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)
print("Random Forest MAPE:", mean_absolute_percentage_error(y_test, pred_rf))

lr = LinearRegression()
lr.fit(X_train, y_train)
pred_lr = lr.predict(X_test)
print("Linear Regression MAPE:", mean_absolute_percentage_error(y_test, pred_lr))

SVR MAPE: 0.1870512931870423
Random Forest MAPE: 0.18496844723976405
Linear Regression MAPE: 0.1874168384159986


SVR gave about 0.187 MAPE, RF 0.193, and Linear 0.187. Lower error is better. Here SVR had the lowest error (~0.187), being the best model in this context.

# Task
Explain the provided code, which includes a `RandomForestRegressor` and a `GradientBoostingRegressor`, and then improve the model's performance. The dataset is located at "/content/HousePricePrediction.xlsx".

## Hyperparameter tuning for random forest

### Subtask:
Tune the hyperparameters of the RandomForestRegressor model using GridSearchCV to find the best combination of parameters for improved performance.


**Reasoning**:
Tune the hyperparameters of the RandomForestRegressor model using GridSearchCV to find the best combination of parameters for improved performance.



In [20]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30]
}

rf = RandomForestRegressor(random_state=0)

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='neg_mean_absolute_percentage_error')

grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score (neg_MAPE):", grid_search.best_score_)

Best parameters: {'max_depth': 10, 'n_estimators': 200}
Best cross-validation score (neg_MAPE): -0.18590708512361626


## Introduce a more powerful model (gradient boosting)

### Subtask:
Introduce and train a Gradient Boosting Regressor, a more powerful model, to see if it can outperform the tuned Random Forest.


**Reasoning**:
Introduce and train a Gradient Boosting Regressor, make predictions, and evaluate its performance using MAPE.



In [21]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(random_state=0)
gbr.fit(X_train, y_train)
pred_gbr = gbr.predict(X_test)
print("Gradient Boosting Regressor MAPE:", mean_absolute_percentage_error(y_test, pred_gbr))

Gradient Boosting Regressor MAPE: 0.1878335615561105


## Train and evaluate the new models

### Subtask:
Train and evaluate the new models, including the tuned Random Forest and the Gradient Boosting model, to compare their performance.


**Reasoning**:
I will now train and evaluate the tuned Random Forest and Gradient Boosting models as requested. I will start by getting the best estimator from the `grid_search` object, making predictions with it, and printing its MAPE. Then, I will instantiate and train a `GradientBoostingRegressor`, make predictions, and print its MAPE.



In [22]:
best_rf = grid_search.best_estimator_
pred_best_rf = best_rf.predict(X_test)
print("Tuned Random Forest MAPE:", mean_absolute_percentage_error(y_test, pred_best_rf))

gbr = GradientBoostingRegressor(random_state=0)
gbr.fit(X_train, y_train)
pred_gbr = gbr.predict(X_test)
print("Gradient Boosting Regressor MAPE:", mean_absolute_percentage_error(y_test, pred_gbr))

Tuned Random Forest MAPE: 0.18402043088121728
Gradient Boosting Regressor MAPE: 0.1878335615561105


## Summary:

### Data Analysis Key Findings

*   The best hyperparameters for the `RandomForestRegressor` using `GridSearchCV` were found to be `{'max_depth': 10, 'n_estimators': 200}`, resulting in a best cross-validation negative Mean Absolute Percentage Error (MAPE) of -0.1859 (equivalent to a MAPE of approximately 18.59%).
*   After tuning, the `RandomForestRegressor` achieved a MAPE of approximately 0.184 on the test set.
*   The `GradientBoostingRegressor` with default hyperparameters achieved a MAPE of approximately 0.188 on the test set.
*   The tuned `RandomForestRegressor` slightly outperformed the `GradientBoostingRegressor` with default parameters on the test set in terms of MAPE.

### Insights or Next Steps

*   The next logical step is to perform hyperparameter tuning on the `GradientBoostingRegressor` to see if its performance can be improved and potentially surpass the tuned `RandomForestRegressor`.
*   Consider exploring other ensemble methods or more advanced modeling techniques that might be suitable for this regression task.
