## 1. Problem framing
Our goal is to find the best set of features and preprocessing steps that can succesfully predict property prices in Ireland. 

**target variable** "price"

**success metrics** 
- MSE (Mean Squared Error)
    - Average of squared errors
    - penalises large errors heavily, returns answer in original units squared (so price squared) makking it harder to interpret.
- RMSE 
    - square root of mse, same units as target (price)
    - still sensitive to outliers, but more interpretable
- MAE (Mean Absolute Error)
    - Average of (absolute (y - averagey)) (y - y-hat)
    - More robust to outliers than MSE/RMSE, easier to explain on average we are wrong by x euro

**Other options**
- R^2 - Measures how much of the variance in price is explained by the model; 1 is perfect, 0 means “no better than predicting the mean”
- AIC/BIC, log-likelihood (for linear models) - More about model selection and fit vs complexity balance, less about “how wrong is the prediction in euros”.

# 2. Data Description
- The dataset we are using is availble on Kaggle [here](https://www.kaggle.com/datasets/eavannan/daftie-house-price-data)
    - We use the 'daft_ie_v1.csv' file
        - It has 3967 rows, only one column ('propertySize') with nulls (we might need to add some in, so we can say we did cleaning on na's)
            - propertySize has 355 nulls
        - columns: ['id', 'title', 'featuredLevel', 'publishDate', 'price', 'numBedrooms', 'numBathrooms', 'propertyType', 'propertySize', 'category', 'AMV_price', 'sellerId', 'seller_name', 'seller_branch', 'sellerType', 'm_totalImages', 'm_hasVideo', 'm_hasVirtualTour', 'm_hasBrochure', 'ber_rating', 'longitude', 'latitude']

In [1]:
# Library import
import pandas as pd
import plotly.express as px
import numpy as np

df = pd.read_csv("daft_ie_v1.csv")

# for linear regression imputation later on
original_df = df.copy()

### 3. EDA & Preprocessing

#### 3.1 Remove outlier record in longitude and latitude

In [3]:
# scatter of longitude and altitude showed a value that was definitely an outlier
px.scatter(df, x='longitude', y='latitude', trendline='ols',
           title='longitude vs latitude').show()

# this value lines up wiht somewhere in America so we will remove
df_clean = df[~df['latitude'].isin([39.78373])]

In [4]:
df_clean.propertyType.value_counts()
# removing low count values that would interfer with our model
df_clean = df_clean[~df_clean['propertyType'].isin(['Townhouse', 'Duplex','Site','House','Studio'])]
df_clean.propertyType.value_counts()
print(len(df_clean))

3831


### 3.2 Histograms - log-transforming

What we're looking for
- Original scale: if the histogram is strongly skewed, a log transform often helps.
- Log scale: if the histogram of the log values looks more symmetric, bell shaped, it is a sign that log‑transforming for regression is useful (errors closer to normal, linearity easier).

In [5]:
# price and size log-transform
df_clean['log_price'] = np.log1p(df_clean['price'])          # log(1 + x) helps to avoid log(0)
df_clean['log_size']  = np.log1p(df_clean['propertySize'])

# histograms before transformation
px.histogram(df_clean, x='price', nbins=50, title='Price').show()
px.histogram(df_clean, x='log_price', nbins=50, title='log(price)').show()

#log- scale histograms
px.histogram(df_clean, x='propertySize', nbins=50, title='Property size').show()
px.histogram(df_clean, x='log_size', nbins=50, title='log(size)').show()

### 3.3 Feature Engineering
#### Splitting address to get town and county
Title would be useful as it contains information on location, rather than just longitude and latitude to access the county and other location information we need to split the title up

Unfortunately, we found this to not add anythign to our model and longitude and latitude were more useful in our model.

In [8]:
# uncomment to look at pattern in title column
# print(df_clean['title'].value_counts())

# regex pattern to get Co. xxx and Dublin 15 etc.
county_pattern = r'(Co\.\s+\w+|Dublin\s+\d+)$'
df_clean['county'] = df_clean['title'].str.extract(county_pattern)

# collecting the rest of the address (less the county)
df_clean['no_county'] = df_clean['title'].str.replace(
    r',?\s*(Co\.\s+\w+|Dublin\s+\d+)$', 
    '', 
    regex=True
)

# Without county, town will be next comma, leaving address (estate or house 1 or 2 lines) to left
tmp = df_clean['no_county'].str.rsplit(',', n=1, expand=True)
df_clean['address_part'] = tmp[0].str.strip()
df_clean['town'] = tmp[1].str.strip()

# drop helper column
df_clean = df_clean.drop(columns=['no_county'])
print(len(df_clean.columns))

27


### 3.4 Identifying outliers
#### Boxplots can be very useful to identify outliers and further investigate distributions alongside histograms

In [9]:
import plotly.express as px

# define which columns to invetigat
# This could be updated and rerun after any changes were made
outlier_cols = ['price', 'numBedrooms', 'numBathrooms', 'propertySize']

for col in outlier_cols:
    fig = px.box(df_clean, y=col, points='suspectedoutliers', title=f"{col} - Boxplot")
    fig.show()

In [10]:
# cell to investigate outliers, left examples here to observe thought process
### noticing outliers in boxplots and investiating further
df_clean.loc[df_clean['price'].isin([4500000])] # Not an outlier but will it distort our model?
df_clean.loc[df_clean['numBedrooms'].isin([23])] # Can not find it existing online, may be an outlier
df_clean.loc[df_clean['propertySize'].isin([8600, 8094])] # Not an outlier but will it distort our model?

Unnamed: 0,id,title,featuredLevel,publishDate,price,numBedrooms,numBathrooms,propertyType,propertySize,category,...,m_hasVirtualTour,m_hasBrochure,ber_rating,longitude,latitude,log_price,log_size,county,address_part,town
2259,3676488,"2719 Dara Park, Newbridge, Co. Kildare",standard,2022-01-19,180000,3,1,End of Terrace,8600.0,Buy,...,False,False,XXX,-6.68781,52.67001,12.100718,9.059634,Co. Kildare,2719 Dara Park,Newbridge
2463,3673436,"Glebe, Bunclody, Co. Wexford",standard,2022-01-30,85000,1,1,Detached,8600.0,Buy,...,False,False,E1,-6.258609,53.324768,11.350418,9.059634,Co. Wexford,Glebe,Bunclody
3442,3654681,"13 Donomore Crescent, Tallaght, Dublin 24",standard,2022-01-29,199000,3,1,Terrace,8094.0,Buy,...,False,False,SI_666,-7.133225,52.340573,12.201065,8.999002,Dublin 24,13 Donomore Crescent,Tallaght
3682,3649830,"Rathlikeen, Mullinavat, Co. Kilkenny",standard,2022-01-13,130000,1,1,Detached,8094.0,Buy,...,False,False,C2,-7.523661,53.530574,11.775297,8.999002,Co. Kilkenny,Rathlikeen,Mullinavat


#### 3.5 Why it is good to do IQR before training a model
- The IQR rule is about spotting values that are far from the bulk of the data in the original units.

- Linear regression is sensitive to outliers
    - It fits by minimising squared errors (MSE) so a few very extreme points (like 4.5m houses) can dominate the loss.
    - The line will bend to fit those rare extremes, making predictions worse for the majority of “normal” houses (e.g. 190k–360k).
- Better generalisation to future data
    - Most future properties you care about are likely in the typical price range, not in the tiny set of extreme mansions.
    - Training on a distribution cleansed of extreme noise/outliers makes the model’s bias/variance trade‑off better for that typical range.

In [11]:
# Define function to remove outliers
def remove_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)

    IQR = Q3 - Q1

    # define lower bound and upper bound
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Filter the DF to keep outliers (remove only between lower_bound and upper_bound)
    df_clean = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df_clean

# From boxplot we know what outliers we need to remove
df_clean = remove_outliers(df_clean, 'price')
df_clean = remove_outliers(df_clean, 'numBedrooms')
df_clean = remove_outliers(df_clean, 'numBathrooms')
df_clean = remove_outliers(df_clean, 'propertySize')

# viewing boxplots again after IQR transformations
for col in outlier_cols:
    fig = px.box(df_clean, y=col, points='suspectedoutliers', title=f"{col} - Boxplot")
    fig.show()

In [12]:
# From boxplot above notice very small propertySize squared
# defining propertySize to be at least 20m^2
df_clean = df_clean[df_clean['propertySize'] >= 20]

### 3.6 Relationships with price
- price vs propertySize
    - raw scale, dominated by outliers
    - on log-log relationship is much clearer and roughly linear, use log_price as target and log_size as one of predictors

- bedrooms and bathrooms
    - both have an upward relationship with price 

In [13]:
# price vs propertySize
px.scatter(df_clean, x='propertySize', y='price', trendline='ols',
           title='Price vs Property size').show()

# price and size log-transform
df_clean['log_price'] = np.log1p(df_clean['price'])          # log(1 + x) helps to avoid log(0)
df_clean['log_size']  = np.log1p(df_clean['propertySize'])

px.scatter(df_clean, x='log_size', y='log_price', trendline='ols',
           title='log Price vs log Property size').show()

# price vs numBedrooms
px.scatter(df_clean, x='numBedrooms', y='price', trendline='ols',
           title='Price vs Bedrooms').show()

# price vs numBathrooms
px.scatter(df_clean, x='numBathrooms', y='price', trendline='ols',
           title='Price vs Bathrooms').show()

# price vs numBathrooms
px.scatter(df_clean, x='numBathrooms', y='numBedrooms', trendline='ols',
           title='Bedrooms vs Bathrooms').show()

### 3.7 Investigating null values
- All NA's exist in the propertySize column, with over 3.5k rows to train a model with this seems like a good scenario to train a model to impute these values.
- The easiest option would be to remove all null values but we wanted to try something more complex.
- We trained a Linear Regression model to impute the missing propertySize values.
- Other options could be to use measures of central tendency such as mean, mode and median.

#### Option 1: drop all null values

In [None]:
df_nona = df.dropna()
df_nona

Unnamed: 0,id,title,featuredLevel,publishDate,price,numBedrooms,numBathrooms,propertyType,propertySize,category,...,m_hasVirtualTour,m_hasBrochure,ber_rating,longitude,latitude,log_price,log_size,county,address_part,town
0,3626025,"11 Chestnut Crescent, Bridgemount, Carrigaline...",featured,2022-01-28,290000,3,3,End of Terrace,96.0,Buy,...,False,False,C2,-8.382500,51.822940,12.577640,4.574711,Co. Cork,"11 Chestnut Crescent, Bridgemount",Carrigaline
1,3675175,"58 The Glen, Kilnacourt Woods, Portarlington, ...",featured,2022-01-28,225000,3,2,Semi-D,93.0,Buy,...,False,False,C1,-7.177098,53.157465,12.323860,4.543295,Co. Laois,"58 The Glen, Kilnacourt Woods",Portarlington
2,3673450,"16 Dodderbrook Park, Ballycullen, Dublin 24",featured,2022-01-27,575000,4,3,Semi-D,162.0,Buy,...,True,False,A3,-6.342763,53.269493,13.262127,5.093750,Dublin 24,16 Dodderbrook Park,Ballycullen
4,3643947,"5 Columba Terrace, Kells, Co. Meath",featured,2022-01-28,120000,3,1,Terrace,68.0,Buy,...,False,False,G,-6.879797,53.728601,11.695255,4.234107,Co. Meath,5 Columba Terrace,Kells
5,3598816,"75 The Lawn, Coolroe Meadows, Ballincollig, Co...",featured,2022-01-30,400000,4,3,Semi-D,113.0,Buy,...,False,False,C1,-8.614786,51.883612,12.899222,4.736198,Co. Cork,"75 The Lawn, Coolroe Meadows",Ballincollig
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3960,3644577,"Apartment 41, Penrose Court, Co. Waterford",standard,2022-01-24,115000,2,2,Apartment,63.0,Buy,...,False,False,A3,-6.183774,53.267151,11.652696,4.158883,Co. Waterford,Apartment 41,Penrose Court
3961,3644422,"8 Sliabh Cairbe, Drumlish, Co. Longford",standard,2021-12-13,185000,4,3,Semi-D,125.0,Buy,...,False,False,A3,-8.315556,51.849705,12.128117,4.836282,Co. Longford,8 Sliabh Cairbe,Drumlish
3963,3644275,"8 Thomas Street, Castlebar, Co. Mayo",standard,2022-01-30,149500,3,1,Bungalow,82.0,Buy,...,False,False,A3,-6.753848,54.115088,11.915058,4.418841,Co. Mayo,8 Thomas Street,Castlebar
3965,3644099,"School Land, Ballinalee, Co. Longford",standard,2021-12-04,170000,4,2,Detached,128.0,Buy,...,True,False,A2,-8.652927,52.664558,12.043560,4.859812,Co. Longford,School Land,Ballinalee


In [29]:
# Create a mask for rows with any NA values to investigate further
mask = df.isna().any(axis=1)
df[mask].head()

Unnamed: 0,id,title,featuredLevel,publishDate,price,numBedrooms,numBathrooms,propertyType,propertySize,category,...,m_hasVirtualTour,m_hasBrochure,ber_rating,longitude,latitude,log_price,log_size,county,address_part,town
3,3649708,"31 Lissanalta Drive, Dooradoyle, Co. Limerick",featured,2022-01-28,299000,3,3,Semi-D,,Buy,...,False,False,C2,-8.640716,52.629588,12.608202,,Co. Limerick,31 Lissanalta Drive,Dooradoyle
9,3486462,"Ballykeeran, Co. Westmeath",featured,2022-01-30,695000,7,6,Detached,400.0,Buy,...,False,False,C2,-7.899131,53.446741,13.451669,5.993961,Co. Westmeath,Ballykeeran,
11,3636266,"14 Ballinakill Avenue, Ballinakill, Co. Waterford",featured,2022-01-12,450000,4,2,Detached,,Buy,...,False,False,D1,-7.067098,52.242731,13.017005,,Co. Waterford,14 Ballinakill Avenue,Ballinakill
13,3655680,"Knockaneasy, Our Ladys Island",featured,2022-01-14,395000,3,2,Detached,204.0,Buy,...,False,False,C2,-6.37534,52.205808,12.886644,5.32301,,Knockaneasy,Our Ladys Island
17,3476643,"Seamount Rise, Malahide, Co. Dublin",featured,2022-01-11,625000,3,3,Terrace,,New Homes,...,True,False,SI_666,-7.107895,52.255349,13.345509,,Co. Dublin,Seamount Rise,Malahide


#### Option 2: Imputation by Linear Regression

In [14]:
## Option 2: Imputation by Linear Regression
from sklearn.linear_model import LinearRegression

lrdf = original_df.copy()

# drop the target column and keep only numeric columns
X = lrdf.drop(columns=['propertySize']).select_dtypes(include=[np.number])
y = lrdf['propertySize']

# rows where propertySize is not missing mask
mask = y.notna()

# Training data (only rows with propertySize available)
X_train, y_train = X[mask], y[mask]

# Build and fit the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Rows where propertySize is missing
X_missing = X[~mask]

# Predict missing values if there are any
if not X_missing.empty:
    y_pred_missing = model.predict(X_missing)
    lrdf.loc[~mask, 'propertySize'] = y_pred_missing
else:
    print("No Missing Values")

#### 3.8 Add log_size amd log_price to this linear regression imputed dataframe
We will use lrdf dataframe from now as it as no null values and has been cleaned in previous steps

In [None]:
lrdf['log_size']  = np.log1p(lrdf['propertySize'])
lrdf['log_price']  = np.log1p(lrdf['price'])

id                  0
title               0
featuredLevel       0
publishDate         0
price               0
numBedrooms         0
numBathrooms        0
propertyType        0
propertySize        0
category            0
AMV_price           0
sellerId            0
seller_name         0
seller_branch       0
sellerType          0
m_totalImages       0
m_hasVideo          0
m_hasVirtualTour    0
m_hasBrochure       0
ber_rating          0
longitude           0
latitude            0
log_size            0
log_price           0
dtype: int64

### 4 Linear Regression: Baseline Model

- Including every numeric feature in the dataframe, not expecting good results and extreme overfitting

In [29]:
import statsmodels.api as sm
import numpy as np

features = ['title', 'featuredLevel', 'publishDate', 'numBedrooms',
 'numBathrooms', 'propertyType', 'propertySize', 'category', 'AMV_price',
 'sellerId', 'seller_name', 'seller_branch', 'sellerType', 'm_totalImages',
 'm_hasVideo', 'm_hasVirtualTour', 'm_hasBrochure', 'ber_rating', 'longitude',
 'latitude']

x = lrdf[features].select_dtypes(include=[np.number])

y = lrdf['price']

x = sm.add_constant(x)
model = sm.OLS(y,x).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.184
Model:                            OLS   Adj. R-squared:                  0.183
Method:                 Least Squares   F-statistic:                     111.9
Date:                Fri, 28 Nov 2025   Prob (F-statistic):          4.31e-169
Time:                        14:54:22   Log-Likelihood:                -54841.
No. Observations:                3967   AIC:                         1.097e+05
Df Residuals:                    3958   BIC:                         1.098e+05
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const          1.243e+05   3.25e+05      0.383

#### Results explanation


`features = ['numBedrooms','numBathrooms', 'propertyType', 'property_size','longitude','latitude']`
- standard size vs log size
    - R^2: 0.171 vs. 0.201
- numBedrooms and numBathroom only variable statistically significant so will remove longitude and latitude and try again


In [None]:
import statsmodels.api as sm
import numpy as np

# features_reg_size = ['numBedrooms','numBathrooms', 'propertySize','longitude','latitude']
# features_log_size = ['numBedrooms','numBathrooms', 'log_size','longitude','latitude']
features_bed_bath = ['numBedrooms','numBathrooms', 'log_size']

# change value in first square bracket with version you want to see results for
x = lrdf[features_bed_bath].select_dtypes(include=[np.number])

y = lrdf['price']

x = sm.add_constant(x)
model = sm.OLS(y,x).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.201
Model:                            OLS   Adj. R-squared:                  0.201
Method:                 Least Squares   F-statistic:                     333.1
Date:                Fri, 28 Nov 2025   Prob (F-statistic):          6.74e-193
Time:                        14:47:36   Log-Likelihood:                -54800.
No. Observations:                3967   AIC:                         1.096e+05
Df Residuals:                    3963   BIC:                         1.096e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -3.744e+05   3.81e+04     -9.833   

## Comparing different transformations and testing removing features below

In [20]:
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import r2_score

features_size = [
 'title','featuredLevel','publishDate','numBedrooms','numBathrooms',
 'propertyType','propertySize','category','AMV_price','seller_name',
 'seller_branch','sellerType','m_hasVideo','m_hasVirtualTour',
 'm_hasBrochure','ber_rating','longitude','latitude'
]

features_log_size = [
    'title','featuredLevel','publishDate','numBedrooms','numBathrooms',
 'propertyType','log_size','category','AMV_price','seller_name',
 'seller_branch','sellerType','m_hasVideo','m_hasVirtualTour',
 'm_hasBrochure','ber_rating','longitude','latitude'
]

features_new = [
 'title','featuredLevel','publishDate','numBedrooms','numBathrooms',
 'propertyType','propertySize','category','AMV_price','seller_name',
 'seller_branch','sellerType','m_hasVideo','m_hasVirtualTour',
 'm_hasBrochure','ber_rating','latitude'
]

x_s  = lrdf[features_size].select_dtypes(include=[np.number])
x_ls = lrdf[features_log_size].select_dtypes(include=[np.number])
x_n = lrdf[features_new].select_dtypes(include=[np.number])

y_p  = lrdf['price']
y_lp = lrdf['log_price']

score_p_s  = cross_val_score(LinearRegression(), x_s,  y_p,  cv=10)
score_p_ls = cross_val_score(LinearRegression(), x_ls, y_p,  cv=10)
score_lp_s = cross_val_score(LinearRegression(), x_s,  y_lp, cv=10)
score_lp_ls = cross_val_score(LinearRegression(), x_ls, y_lp, cv=10)
score_n = cross_val_score(LinearRegression(), x_n, y_p, cv=10)

# Predictions for each model
X_train_s, X_test_s, y_train_p, y_test_p = train_test_split(x_s, y_p, test_size=0.2)
m_p_s = LinearRegression().fit(X_train_s, y_train_p)
y_pred_p_s = m_p_s.predict(X_test_s)

X_train_ls, X_test_ls, y_train_p2, y_test_p2 = train_test_split(x_ls, y_p, test_size=0.2)
m_p_ls = LinearRegression().fit(X_train_ls, y_train_p2)
y_pred_p_ls = m_p_ls.predict(X_test_ls)

X_train_s2, X_test_s2, y_train_lp, y_test_lp = train_test_split(x_s, y_lp, test_size=0.2)
m_lp_s = LinearRegression().fit(X_train_s2, y_train_lp)
y_pred_lp_s = m_lp_s.predict(X_test_s2)

X_train_ls2, X_test_ls2, y_train_lp2, y_test_lp2 = train_test_split(x_ls, y_lp, test_size=0.2)
m_lp_ls = LinearRegression().fit(X_train_ls2, y_train_lp2)
y_pred_lp_ls = m_lp_ls.predict(X_test_ls2)

X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(x_n, y_p, test_size=0.2)
m_n = LinearRegression().fit(X_train_n, y_train_n)
y_pred_n = m_n.predict(X_test_n)

print("Price and Size\n====================")
# print("accuracy score:", accuracy_score(y_test_p, y_pred_p_s))
print("r2 score:", r2_score(y_test_p, y_pred_p_s))
print("model score:", m_p_s.score(X_test_s, y_test_p))
print("===")
# print("cross validation scores:", score_p_s)
# print("mean:", score_p_s.mean())
# print("std:", score_p_s.std())
print()

print("Price and Log Size\n====================")
# print("accuracy score:", accuracy_score(y_test_p2, y_pred_p_ls))
print("r2 score:", r2_score(y_test_p2, y_pred_p_ls))
print("model score:", m_p_ls.score(X_test_ls, y_test_p2))
print("===")
# print("cross validation scores:", score_p_ls)
# print("mean:", score_p_ls.mean())
# print("std:", score_p_ls.std())
print()

print("Log Price and Size\n====================")
# print("accuracy score:", accuracy_score(y_test_lp, y_pred_lp_s))
print("r2 score:", r2_score(y_test_lp, y_pred_lp_s))
print("model score:", m_lp_s.score(X_test_s, y_test_lp))
print("===")
# print("cross validation scores:", score_lp_s)
# print("mean:", score_lp_s.mean())
# print("std:", score_lp_s.std())
print()

print("Log Price and Log Size\n====================")
# print("accuracy score:", accuracy_score(y_test_lp2, y_pred_lp_ls))
print("r2 score:", r2_score(y_test_lp2, y_pred_lp_ls))
print("model score:", m_lp_ls.score(X_test_ls, y_test_lp2))
print("===")
# print("cross validation scores:", score_lp_ls)
# print("mean:", score_lp_ls.mean())
# print("std:", score_lp_ls.std())
print()

print("Price and Size (without longitude)\n====================")
# print("accuracy score:", accuracy_score(y_test_p, y_pred_p_s))
print("r2 score:", r2_score(y_test_n, y_pred_n))
print("model score:", m_n.score(X_test_n, y_test_n))
print("===")
# print("cross validation scores:", score_p_s)
# print("mean:", score_p_s.mean())
# print("std:", score_p_s.std())
print()

# print(score_p_s)
# print(score_p_ls)
# print(score_lp_s)
# print(score_lp_ls)

import numpy as np
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# use model_lp_s already fitted:
# y_test_lp = actual log(price)
# y_pred_lp_s = predicted log(price)

y_pred_price = np.exp(y_pred_lp_s)
y_actual_price = np.exp(y_test_lp)

r2_real  = r2_score(y_actual_price, y_pred_price)
mae_real = mean_absolute_error(y_actual_price, y_pred_price)
rmse_real = np.sqrt(mean_squared_error(y_actual_price, y_pred_price))

print(f"r2_real: {r2_real}, mae_real: {mae_real}, rmse_real: {rmse_real}")

Price and Size
r2 score: 0.189270961867671
model score: 0.189270961867671
===

Price and Log Size
r2 score: 0.21154476640478548
model score: 0.21154476640478548
===

Log Price and Size
r2 score: 0.24580355625311823
model score: -0.1562560788354319
===

Log Price and Log Size
r2 score: 0.2440280501869414
model score: -0.2427818735022178
===

Price and Size (without longitude)
r2 score: 0.15890016782704985
model score: 0.15890016782704985
===

r2_real: 0.07300810768798449, mae_real: 142382.11359806452, rmse_real: 252878.34516871115


# Pretty good results below, high chance of overfitting
### - That is exactly what is happening here, 99.8% on model accuracy is because of log_price.
### - Log Transformation can cause linearity, which may falsely "improve" the model and spike the r squared
### - However this is false in actual price space.

In [21]:
# Pretty good results, high chance of overfitting
# That is exactly what is happening here, 99.8% on model accuracy is because of log_price.
# Log Transformation can cause linearity, which may falsely "improve" the model and spike the r squared
# However this is false in actual price space.

import statsmodels.api as sm
import numpy as np

features = ['title', 'featuredLevel', 'publishDate', 'numBedrooms',
 'numBathrooms', 'propertyType', 'log_size', 'category', 'AMV_price',
 'sellerId', 'seller_name', 'seller_branch', 'sellerType', 'm_totalImages',
 'm_hasVideo', 'm_hasVirtualTour', 'm_hasBrochure', 'ber_rating', 'longitude',
 'latitude']

x = lrdf[features].select_dtypes(include=[np.number])

y = lrdf['log_price']

x = sm.add_constant(x)
model = sm.OLS(y,x).fit()

print(model.summary())

# We can remove seller ID and images, as they dont effect house prices at all

                            OLS Regression Results                            
Dep. Variable:              log_price   R-squared:                       0.257
Model:                            OLS   Adj. R-squared:                  0.255
Method:                 Least Squares   F-statistic:                     171.0
Date:                Fri, 28 Nov 2025   Prob (F-statistic):          1.46e-248
Time:                        13:44:21   Log-Likelihood:                -3062.4
No. Observations:                3967   AIC:                             6143.
Df Residuals:                    3958   BIC:                             6199.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            11.8829      0.702     16.935

In [None]:
## Less features and check performance
## removing longitude and latitude reduces R^2 to 0.989 from 0.998 can jsut drop title and extra county/town info

import statsmodels.api as sm
import numpy as np

features = ['numBedrooms', 'numBathrooms', 'log_size', 'longitude', 'latitude']

x = lrdf[features].select_dtypes(include=[np.number])

y = lrdf['log_price']

x = sm.add_constant(x)
model = sm.OLS(y,x).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:              log_price   R-squared:                       0.200
Model:                            OLS   Adj. R-squared:                  0.199
Method:                 Least Squares   F-statistic:                     198.5
Date:                Fri, 28 Nov 2025   Prob (F-statistic):          3.03e-189
Time:                        13:44:21   Log-Likelihood:                -3207.9
No. Observations:                3967   AIC:                             6428.
Df Residuals:                    3961   BIC:                             6466.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           11.7188      0.726     16.146   

### 5. Models and Training
#### Random forest (because it can capture non linear relationships, and there are decreased overfitting risks)

#### 5.1 train_test_split

In [64]:
from sklearn.model_selection import train_test_split
df = lrdf.copy()
x = lrdf[['numBedrooms', 'numBathrooms', 'log_size', 'longitude', 'latitude']]
y = lrdf['price']

# Train/test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#### 5.2 Training and evaluating different models


In [51]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error

model = LinearRegression()
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

print("r2_score:", r2_score(y_test, y_pred))
print("MAE: €{:.2f}".format(mean_absolute_error(y_test, y_pred)))
print("RMSE: €{:.2f}".format(np.sqrt(mean_squared_error(y_test, y_pred))))

r2_score: 0.21972541263927048
MAE: €154862.93
RMSE: €253196.50


#### Interpretation of linear regression results

R-squared - 0.21 is very low for our case. The model is not capturing much of the variability in our housing data. Similar to the OLS model we trained.

MAE: €154862.93 - on average this is how much the model’s predictions are off by

RMSE: €253196.50 - the model’s predictions will fall within about this amount from the true price.

#### Random forest regressor hyperparamter tuning

In [56]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error

model = RandomForestRegressor(n_estimators=150, max_depth=10, min_samples_split=5, random_state=42)
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

print("r2_score:", r2_score(y_test, y_pred))
print("MAE: €{:.2f}".format(mean_absolute_error(y_test, y_pred)))
print("RMSE: €{:.2f}".format(np.sqrt(mean_squared_error(y_test, y_pred))))

r2_score: 0.3055140818841162
MAE: €149659.09
RMSE: €238872.25


In [57]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error

model = RandomForestRegressor(n_estimators=200, max_depth=15, min_samples_split=3, random_state=42)
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

print("r2_score:", r2_score(y_test, y_pred))
print("MAE: €{:.2f}".format(mean_absolute_error(y_test, y_pred)))
print("RMSE: €{:.2f}".format(np.sqrt(mean_squared_error(y_test, y_pred))))

r2_score: 0.2967810312242717
MAE: €152432.76
RMSE: €240369.44


In [58]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error

model = RandomForestRegressor(n_estimators=300, max_depth=20, min_samples_split=2, random_state=42)
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

print("r2_score:", r2_score(y_test, y_pred))
print("MAE: €{:.2f}".format(mean_absolute_error(y_test, y_pred)))
print("RMSE: €{:.2f}".format(np.sqrt(mean_squared_error(y_test, y_pred))))

r2_score: 0.28454579957726744
MAE: €154395.39
RMSE: €242451.51


### Interpretation
The random forest model that performed the best was with the following hyperparameters `RandomForestRegressor(n_estimators=150, max_depth=10, min_samples_split=5, random_state=42)`

#### 6: Cross validation + hyperparameter tuning

In [None]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np
import pandas as pd
from scipy.stats import randint

# Parameter combinations to try
param_dist = {
    "max_depth": [3, 5, 10, None],
    "min_samples_split": randint(2, 20),   # random int between 2 and 20
    "min_samples_leaf": randint(1, 10),    # random int between 1 and 10
    "max_features": [None, "sqrt", "log2"]
}

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# RandomizedSearchCV
rand_search = RandomizedSearchCV(
    RandomForestRegressor(random_state=42),
    param_distributions=param_dist,
    n_iter=30,          # number of random combinations to try
    cv=5,
    scoring="r2",
    random_state=42
)

rand_search.fit(X_train, y_train)

# Predictions and metrics
y_pred = rand_search.predict(X_test)

print("Best params:", rand_search.best_params_)
print("Best CV R²:", rand_search.best_score_)
print("Test R²:", r2_score(y_test, y_pred))
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

# Results table
results = pd.DataFrame(rand_search.cv_results_)
params_df = results["params"].apply(pd.Series)
table = pd.concat([params_df, results["mean_test_score"]], axis=1)
table = table.sort_values(by="mean_test_score", ascending=False).reset_index(drop=True)
table["mean_test_score"] = table["mean_test_score"].round(3)

print("\nTop 10 parameter sets:\n")
print(table.head(10).to_string(index=False))

Best params: {'max_depth': 10, 'max_features': None, 'min_samples_leaf': 3, 'min_samples_split': 6}
Best CV R²: 0.9863294075056828
Test R²: 0.9989334199273571
MAE: 989.0750079972113
RMSE: 9361.182304473967

Top 10 parameter sets:

 max_depth max_features  min_samples_leaf  min_samples_split  mean_test_score
      10.0          NaN               3.0                2.0            0.986
      10.0          NaN               3.0                6.0            0.986
      10.0          NaN               4.0                9.0            0.984
       NaN          NaN               1.0               13.0            0.983
      10.0          NaN               5.0               11.0            0.980
      10.0          NaN               8.0                4.0            0.973
      10.0          NaN               8.0                8.0            0.973
      10.0          NaN               8.0               17.0            0.973
       3.0          NaN               1.0                8.0       

### 7. Evaluation of our models: use appropriate metrics (e.g., accuracy/F1/, MAE/RMSE), with tables/plots. F1: MAE:

In [65]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error

# Train Test Split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Random Forest Model
modelR = RandomForestRegressor(random_state=42, max_depth=10, min_samples_split=15, min_samples_leaf=4, max_features='sqrt')
modelR.fit(x_train, y_train)
y_predR = modelR.predict(x_test)

# Decision Tree Model
modelD = DecisionTreeRegressor(random_state=42, max_depth=3, min_samples_split=8, min_samples_leaf=1, max_features=None)
modelD.fit(x_train, y_train)
y_predD = modelD.predict(x_test)

# Linear Regression Model
modelL = LinearRegression()

modelL.fit(x_train, y_train)
y_predL = modelL.predict(x_test)

# Our Metrics: Random Forest
R_MSE = mean_squared_error(y_test, y_predR)
R_RMSE = np.sqrt(R_MSE)
R_MAE = mean_absolute_error(y_test, y_predR)
R_R2 = r2_score(y_test, y_predR)
R_MAPE = mean_absolute_percentage_error(y_test, y_predR)
R_SMAPE = np.mean(
    2.0 * np.abs(y_predR - y_test) / (np.abs(y_test) + np.abs(y_predR) + 1e-8)
)

# Calculations for log_likelihood
n = len(y_test)
p = x_test.shape[1]
residuals = y_test - y_predL
RSS = np.sum(residuals**2)
sigma2 = RSS / n
log_likelihood = -0.5 * n * (np.log(2 * np.pi * sigma2) + 1)

# Use log_likelihood to get AIC and BIC
R_AIC = 2 * p - 2 * log_likelihood
R_BIC = p * np.log(n) - 2 * log_likelihood

print("RandomForest: Our Best Model!")
print(f"R²: [{R_R2}] - Measures how much of the variance in price is explained by the model.")
print(f"AIC: [{R_AIC}] - (More about model selection and fit vs complexity balance).")
print(f"BIC: [{R_BIC}] - (More about model selection and fit vs complexity balance).")
print("=====")
print(f"MSE: [{R_MSE}] - bigger the MSE, the further away predictions are from actual data.")
print(f"RMSE: [{R_RMSE}] - square root of mse, same units as target (price).")
print(f"MAE: [{R_MAE}] - More robust to outliers than MSE/RMSE, easier to explain on average we are wrong by x euro.")
print(f"MAPE: [{R_MAPE}] - How much our model under-estimates.")
print(f"SMAPE: [{R_SMAPE}] - tries to fix MAPE’s asymmetry; often used in forecasting tasks.")

# Decision Tree Metrics
D_MSE = mean_squared_error(y_test, y_predD)
D_RMSE = np.sqrt(D_MSE)
D_MAE = mean_absolute_error(y_test, y_predD)
D_R2 = r2_score(y_test, y_predD)
D_MAPE = mean_absolute_percentage_error(y_test, y_predD)
D_SMAPE = np.mean(
    2.0 * np.abs(y_predD - y_test) / (np.abs(y_test) + np.abs(y_predD) + 1e-8)
)

# Calculations for log_likelihood
D_n = len(y_test)
D_p = x_test.shape[1]
D_res = y_test - y_predD
D_RSS = np.sum(D_res**2)
D_sigma2 = D_RSS / D_n
D_logL = -0.5 * D_n * (np.log(2 * np.pi * D_sigma2) + 1)

# Use log_likelihood to get AIC and BIC
D_AIC = 2 * D_p - 2 * D_logL
D_BIC = D_p * np.log(D_n) - 2 * D_logL

# Linear Regression Metrics
L_MSE = mean_squared_error(y_test, y_predL)
L_RMSE = np.sqrt(L_MSE)
L_MAE = mean_absolute_error(y_test, y_predL)
L_R2 = r2_score(y_test, y_predL)
L_MAPE = mean_absolute_percentage_error(y_test, y_predL)
L_SMAPE = np.mean(
    2.0 * np.abs(y_predL - y_test) / (np.abs(y_test) + np.abs(y_predL) + 1e-8)
)

# Calculations for log_likelihood
L_n = len(y_test)
L_p = x_test.shape[1]
L_res = y_test - y_predL
L_RSS = np.sum(L_res**2)
L_sigma2 = L_RSS / L_n
L_logL = -0.5 * L_n * (np.log(2 * np.pi * L_sigma2) + 1)

# Use log_likelihood to get AIC and BIC
L_AIC = 2 * L_p - 2 * L_logL
L_BIC = L_p * np.log(L_n) - 2 * L_logL

results_table = pd.DataFrame([
    {
        "Model": "Random Forest",
        "R²": R_R2,
        "AIC": R_AIC,
        "BIC": R_BIC,
        "MSE": R_MSE,
        "RMSE": R_RMSE,
        "MAE": R_MAE,
        "MAPE": R_MAPE,
        "SMAPE": R_SMAPE
    },
    {
        "Model": "Decision Tree",
        "R²": D_R2,
        "AIC": D_AIC,
        "BIC": D_BIC,
        "MSE": D_MSE,
        "RMSE": D_RMSE,
        "MAE": D_MAE,
        "MAPE": D_MAPE,
        "SMAPE": D_SMAPE
    },
    {
        "Model": "Linear Regression",
        "R²": L_R2,
        "AIC": L_AIC,
        "BIC": L_BIC,
        "MSE": L_MSE,
        "RMSE": L_RMSE,
        "MAE": L_MAE,
        "MAPE": L_MAPE,
        "SMAPE": L_SMAPE
    }
])

print("\nResults Table:")
print(results_table.to_string(index=False))

RandomForest: Our Best Model!
R²: [0.3084406496485531] - Measures how much of the variance in price is explained by the model.
AIC: [22021.04518408867] - (More about model selection and fit vs complexity balance).
BIC: [22044.430601394906] - (More about model selection and fit vs complexity balance).
=====
MSE: [56819498774.433075] - bigger the MSE, the further away predictions are from actual data.
RMSE: [238368.40976612878] - square root of mse, same units as target (price).
MAE: [146821.61776806245] - More robust to outliers than MSE/RMSE, easier to explain on average we are wrong by x euro.
MAPE: [0.5796257539405477] - How much our model under-estimates.
SMAPE: [0.4093380152211956] - tries to fix MAPE’s asymmetry; often used in forecasting tasks.

Results Table:
            Model       R²          AIC          BIC          MSE          RMSE           MAE     MAPE    SMAPE
    Random Forest 0.308441 22021.045184 22044.430601 5.681950e+10 238368.409766 146821.617768 0.579626 0.409338

## Lastly, testing out our models!

In [66]:
# Example single input (replace with realistic values for your dataset)
sample = pd.DataFrame({
    "numBedrooms": [2],
    "numBathrooms": [2],
    "log_size": [1.9731278536],
    "longitude": [-6.15],
    "latitude": [53.40]
})

# Use Random Forest model to predict
pred1 = modelR.predict(sample)
pred2 = modelD.predict(sample)
pred3 = modelL.predict(sample)

print(f"Predicted price for sample house ({sample["numBedrooms"][0]} beds, {sample["numBathrooms"][0]} baths, {np.square(np.exp(sample["log_size"][0])):,.2f} square footage, and in Dublin 13) using Random Forest: €{pred1[0]:,.2f}\n")

# Compare against our bad models
print("Compared with Decision Tree and Linear Regression:")
print(f"Decision Tree -> €{pred2[0]:,.2f}")
print(f"Linear Regression -> €{pred3[0]:,.2f}")

Predicted price for sample house (2 beds, 2 baths, 51.74 square footage, and in Dublin 13) using Random Forest: €262,572.63

Compared with Decision Tree and Linear Regression:
Decision Tree -> €302,346.24
Linear Regression -> €24,710.04
