<a href="https://colab.research.google.com/github/torkelfaa/streamlit-exercise-torkelfaa/blob/main/House_price_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let us work with the dataset stored in [**house_prices.csv**](https://raw.githubusercontent.com/zhouy185/BUS_O712/refs/heads/main/Data/house_prices.csv) (click to download the file). This dataset includes the features of houses and the price at which it was sold in the current year (2024).
It includes the following variables:
* **Size (sq ft)**: This is the total area of the house
* **Number of Rooms**: The total number of bedrooms in the house
* **Neighborhood**: The type of the neighborhood the house is in
* **Year Built**: The year in which the house is built
* **Price**: The price at which the house was sold.

In this exercise, we will use linear regression model for prediction.

First, load the data and replace 'Year Built' with age of the house (as of 2025)

In [5]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/zhouy185/BUS_O712/refs/heads/main/Data/house_prices.csv')

df['Age'] = 2025 - df['Year Built']

y = df['Price']

x = df.drop(columns=['Price', 'Year Built'])

x, y

(     Size (sq ft)  Number of Rooms Neighborhood  Age
 0            3532                4       Suburb   49
 1            3407                5     Downtown   15
 2            2453                5  Countryside   57
 3            1635                3     Downtown   39
 4            1563                2       Suburb   55
 ..            ...              ...          ...  ...
 495          2668                2  Countryside    8
 496          2098                3       Suburb    4
 497          3074                5  Countryside   13
 498          2049                1     Downtown    3
 499          2763                1     Downtown   38
 
 [500 rows x 4 columns],
 0      1195126.0
 1      1412375.0
 2       797476.0
 3       523051.0
 4       532291.0
          ...    
 495     791199.0
 496     717297.0
 497    1426623.0
 498     694656.0
 499     859475.0
 Name: Price, Length: 500, dtype: float64)

Then, visualize the correlation between columns.

Perform the splits

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state = 42)

In [8]:
X_train

Unnamed: 0,Size (sq ft),Number of Rooms,Neighborhood,Age
249,1570,4,Downtown,32
433,3864,4,Countryside,46
19,2501,3,Suburb,51
322,1356,4,Countryside,55
332,2891,1,Downtown,10
...,...,...,...,...
106,2499,2,Suburb,23
270,1229,2,Countryside,47
348,2954,3,Suburb,45
435,1304,2,Countryside,58


Next, integrate preprocessing (one hot encoding), linear regression, model fitting into a pipeline.

In [12]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
    transformers = [
        ('cat', OneHotEncoder(sparse_output = False, drop = 'first'), ['Neighborhood'])
    ],

    remainder = 'passthrough'
)

Finally, use the fitted pipeline to do prediction.

In [22]:
from sklearn.linear_model import LinearRegression

from sklearn.pipeline import Pipeline

model = LinearRegression()

pipe = Pipeline(
    steps = [
        ('preprocess', ct),
        ('lin_reg', model)
    ]
)

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)

In [37]:
X_train.head(2)

new_house = pd.DataFrame([[3000,2,'Downtown',50],[2000,3,'Suburb',30]], columns = X_train.columns)

new_house

pipe.predict(new_house)

array([944280.76222759, 719408.92644035])

In [32]:
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score
import math

y_pred = pipe.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = math.sqrt(mse)

r2 = r2_score(y_test, y_pred)

print(rmse,r2)


102502.39434535184 0.8936273625137274
