What to do:
1. Problem framing: goal, target variable(s), success metric(s).
2. Data description: provenance, size, features, licensing, any cleaning you performed.
3. EDA & preprocessing: summaries, visuals, handling missing/outliers, feature engineering.
4. Baseline model: a simple benchmark (e.g., majority class, linear model).
5. Models & training: at least one ML approach suitable for the task (justify your choice).
6. Validation: proper split/cross-validation; tune hyperparameters.
7. Evaluation: use appropriate metrics (e.g., accuracy/F1/AUC, MAE/RMSE), with tables/plots.
8. Error analysis & insights: where the model fails/succeeds;
9. Limitations & future work: data quality, bias, generalisability, next steps.
10. Ethics & data protection: consent/licensing, GDPR considerations, bias if any.

## 1. Problem framing
Our goal is to find the best set of features and preprocessing steps that can succesfully predict property prices in Ireland. 

**target variable** "price"

**success metrics** 
- MSE (Mean Squared Error)
    - Average of squared errors
    - penalises large errors heavily, returns answer in original units squared (so price squared) makking it harder to interpret.
- RMSE 
    - square root of mse, same units as target (price)
    - still sensitive to outliers, but more interpretable
- MAE (Mean Absolute Error)
    - Average of (absolute (y - averagey)) (y - y-hat)
    - More robust to outliers than MSE/RMSE, easier to explain on average we are wrong by x euro

**Other options**
- R^2 - Measures how much of the variance in price is explained by the model; 1 is perfect, 0 means “no better than predicting the mean”
- AIC/BIC, log-likelihood (for linear models) - More about model selection and fit vs complexity balance, less about “how wrong is the prediction in euros”.
- MAPE (Mean Absolute Percentage Error) - Interpretable as “average percentage error”, but breaks when actual prices can be zero/very small and heavily penalizes under-estimates vs over-estimates.
- SMAPE (Symmetric MAPE) - tries to fix MAPE’s asymmetry; often used in forecasting tasks.

# Data Description
- The dataset we are using is availble on Kaggle [here](https://www.kaggle.com/datasets/eavannan/daftie-house-price-data)
    - We use the 'daft_ie_v1.csv' file
        - It has 3967 rows, only one column ('propertySize') with nulls (we might need to add some in, so we can say we did cleaning on na's)
            - propertySize has 355 nulls
        - columns: ['id', 'title', 'featuredLevel', 'publishDate', 'price', 'numBedrooms', 'numBathrooms', 'propertyType', 'propertySize', 'category', 'AMV_price', 'sellerId', 'seller_name', 'seller_branch', 'sellerType', 'm_totalImages', 'm_hasVideo', 'm_hasVirtualTour', 'm_hasBrochure', 'ber_rating', 'longitude', 'latitude']

In [92]:
import pandas as pd
df = pd.read_csv("daft_ie_v1.csv")

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3967 entries, 0 to 3966
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                3967 non-null   int64  
 1   title             3967 non-null   object 
 2   featuredLevel     3967 non-null   object 
 3   publishDate       3967 non-null   object 
 4   price             3967 non-null   int64  
 5   numBedrooms       3967 non-null   int64  
 6   numBathrooms      3967 non-null   int64  
 7   propertyType      3967 non-null   object 
 8   propertySize      3612 non-null   float64
 9   category          3967 non-null   object 
 10  AMV_price         3967 non-null   int64  
 11  sellerId          3967 non-null   float64
 12  seller_name       3967 non-null   object 
 13  seller_branch     3967 non-null   object 
 14  sellerType        3967 non-null   object 
 15  m_totalImages     3967 non-null   float64
 16  m_hasVideo        3967 non-null   bool   


In [51]:
## columns
df.columns

Index(['id', 'title', 'featuredLevel', 'publishDate', 'price', 'numBedrooms',
       'numBathrooms', 'propertyType', 'propertySize', 'category', 'AMV_price',
       'sellerId', 'seller_name', 'seller_branch', 'sellerType',
       'm_totalImages', 'm_hasVideo', 'm_hasVirtualTour', 'm_hasBrochure',
       'ber_rating', 'longitude', 'latitude'],
      dtype='object')

### 3. EDA & Preprocessing

### 3.1 Histograms - log-transforming

What we're looking for
- Original scale: if the histogram is strongly right‑skewed, a log transform often helps.
- Log scale: if the histogram of log_price looks more symmetric, bell‑ish, that’s a sign that log‑transforming for regression is useful (errors closer to normal, linearity easier).

In [101]:
import plotly.express as px
from plotly.subplots import make_subplots
import numpy as np

# Histograms on log scale (log-transform)
df['log_price'] = np.log1p(df['price'])          # log(1 + x) to avoid log(0)
df['log_size']  = np.log1p(df['propertySize'])

# Basic histograms
px.histogram(df, x='price', nbins=50, title='Price').show()
px.histogram(df, x='log_price', nbins=50, title='log(price)').show()

#log- scale histograms
px.histogram(df, x='propertySize', nbins=50, title='Property size').show()
px.histogram(df, x='log_size', nbins=50, title='log(size)').show()

#### 3.2 Splitting the address so we can get county and town

In [63]:
# title would be useful as it contains information on location, rather than just longitude and latitude
# to access the county and other location information we need to split the title up
print(df['title'])

# 1. Extract county / Dublin district at the end
county_pattern = r'(Co\.\s+\w+|Dublin\s+\d+)$'
df['county'] = df['title'].str.extract(county_pattern)

# 2. Remove that county piece (and any trailing comma/space) from the title
df['no_county'] = df['title'].str.replace(
    r',?\s*(Co\.\s+\w+|Dublin\s+\d+)$', 
    '', 
    regex=True
)

# 3. From what’s left, split from the RIGHT into "address_part" and "town"
#    (town is the last comma-separated chunk; 
#     address_part is everything before that, even if it has extra commas)
tmp = df['no_county'].str.rsplit(',', n=1, expand=True)
df['address_part'] = tmp[0].str.strip()
df['town'] = tmp[1].str.strip()

# Optionally, you can now drop helper column
df = df.drop(columns=['no_county'])
print(df['county'].value_counts())

0       11 Chestnut Crescent, Bridgemount, Carrigaline...
1       58 The Glen, Kilnacourt Woods, Portarlington, ...
2             16 Dodderbrook Park, Ballycullen, Dublin 24
4                     5 Columba Terrace, Kells, Co. Meath
5       75 The Lawn, Coolroe Meadows, Ballincollig, Co...
                              ...                        
3961              8 Sliabh Cairbe, Drumlish, Co. Longford
3962                13 Cherry Close, Bellfield, Waterford
3963                 8 Thomas Street, Castlebar, Co. Mayo
3965                School Land, Ballinalee, Co. Longford
3966    14 Coolmagort Ave, Beaufort, Killarney, Co. Kerry
Name: title, Length: 3612, dtype: object
county
Co. Cork         387
Co. Dublin       298
Co. Galway       168
Co. Kildare      145
Co. Wexford      130
Co. Limerick     125
Co. Meath        112
Co. Kerry        111
Co. Mayo         106
Co. Westmeath    101
Dublin 15         95
Co. Wicklow       89
Co. Waterford     85
Co. Donegal       84
Co. Tipperary     83

In [64]:
df.propertyType.value_counts()
# Removing propertyType's 'Studio', 'House', 'Site', 'Duplex', 'Townhouse', 
#df = df[~df['propertyType'].isin(['Studio', 'House', 'Site', 'Duplex', 'Townhouse'])]

propertyType
Detached          958
Semi-D            884
Apartment         673
Terrace           563
End of Terrace    230
Bungalow          180
Townhouse          58
Duplex             46
Site               17
House               2
Studio              1
Name: count, dtype: int64

In [65]:
df.loc[df['price'].isin([20000])] #Detached house possibly an outlier tho

Unnamed: 0,id,title,featuredLevel,publishDate,price,numBedrooms,numBathrooms,propertyType,propertySize,category,...,m_hasVirtualTour,m_hasBrochure,ber_rating,longitude,latitude,log_price,log_size,county,address_part,town
1039,3690452,"Greaghhacholea, (Folio CN25921F), Killeshandra...",standard,2022-01-28,20000,1,1,Detached,169.0,Buy,...,False,False,C3,-8.971497,52.811777,9.903538,5.135798,Co. Cavan,"Greaghhacholea, (Folio CN25921F)",Killeshandra
1073,3689958,"Dunaft, (Folio DL73611F), Clonmany, Co. Donegal",standard,2022-01-28,20000,1,1,Site,50.0,Buy,...,False,False,SI_666,-7.280916,54.085887,9.903538,3.931826,Co. Donegal,"Dunaft, (Folio DL73611F)",Clonmany


#### 3.3 Identifying outliers

In [27]:
import plotly.express as px

# first define which columns would be useful to know about outliers
outlier_cols = ['price', 'numBedrooms', 'numBathrooms', 'propertySize', 'm_totalImages']

for col in outlier_cols:
    fig = px.box(df, y=col, points='suspectedoutliers', title=f"{col} - Boxplot")
    fig.show()

In [None]:
df.loc[df['price'].isin([4500000])] # Not an outlier but will it distort our model?
df.loc[df['numBedrooms'].isin([23])] # Can not find it existing online, may be an outlier
df.loc[df['propertySize'].isin([8600, 8094])] # Not an outlier but will it distort our model?
df['AMV_price'].value_counts() # what is this?

AMV_price
0    3779
1     188
Name: count, dtype: int64

#### 3.3 Why it is good to do IQR before training a model
- The IQR rule is about spotting values that are far from the bulk of the data in the original units.

- Linear regression is sensitive to outliers
    - It fits by minimising squared errors (MSE) so a few very extreme points (like 4.5m houses) can dominate the loss.
    - The line will bend to fit those rare extremes, making predictions worse for the majority of “normal” houses (e.g. 190k–360k).
- Better generalisation to future data
    - Most future properties you care about are likely in the typical price range, not in the tiny set of extreme mansions.
    - Training on a distribution cleansed of extreme noise/outliers makes the model’s bias/variance trade‑off better for that typical range.

In [43]:
def remove_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)

    IQR = Q3 - Q1

    # define lower bound and upper bound
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Filter the DF to keep outliers (remove only between lower_bound and upper_bound)
    no_outliers_df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return no_outliers_df

# PRICE
# Define function to remove outliers

no_outliers_df = remove_outliers(df, 'price')

for col in outlier_cols:
    fig = px.box(no_outliers_df, y=col, points='suspectedoutliers', title=f"{col} - Boxplot")
    fig.show()

In [44]:
# M_TOTALIMAGES
no_outliers_df = remove_outliers(no_outliers_df, 'm_totalImages')
for col in outlier_cols:
    fig = px.box(no_outliers_df, y=col, points='suspectedoutliers', title=f"{col} - Boxplot")
    fig.show()

In [45]:
# PROPERTYSIZE 
no_outliers_df = remove_outliers(no_outliers_df, 'propertySize')
for col in outlier_cols:
    fig = px.box(no_outliers_df, y=col, points='suspectedoutliers', title=f"{col} - Boxplot")
    fig.show()

In [46]:
# investigating propertSize of 1?? Need to drop these
df[df['propertySize'].isin([1,2,3,4,5,6,7])]

Unnamed: 0,id,title,featuredLevel,publishDate,price,numBedrooms,numBathrooms,propertyType,propertySize,category,...,sellerType,m_totalImages,m_hasVideo,m_hasVirtualTour,m_hasBrochure,ber_rating,longitude,latitude,log_price,log_size
1457,3686246,"Ballymore Lodge Ballycanew, Gorey, Co. Wexford",standard,2022-01-28,185000,4,1,Detached,1.0,Buy,...,BRANDED_AGENT,34.0,False,False,False,D2,-8.764859,53.762837,12.128117,0.693147
1640,3684706,"NICHOLSON'S, Bridge Street, Ballyhaunis, Co. Mayo",standard,2022-01-28,220000,4,5,Townhouse,1.0,Buy,...,BRANDED_AGENT,29.0,False,True,False,C1,-6.797753,53.166666,12.301387,0.693147
1652,3684616,"37 Bruach Na Gaile, Moyvane, Co. Kerry",standard,2022-01-30,149000,3,3,Semi-D,7.0,Buy,...,UNBRANDED_AGENT,1.0,False,False,False,C2,-9.548146,53.799898,11.911708,2.079442
1836,3682050,"Apartment 102, The Harbour Mill, Westport, Co....",standard,2022-01-28,185000,2,2,Apartment,7.0,Buy,...,BRANDED_AGENT,14.0,True,False,False,SI_666,-7.879693,52.616347,12.128117,2.079442
2378,3674774,"Knockeennahone, Scartaglin, Co. Kerry",standard,2022-01-26,80000,3,1,Detached,1.0,Buy,...,BRANDED_AGENT,12.0,True,False,False,G,-6.208533,53.316831,11.289794,0.693147


In [47]:
# NUMBEDROOMS - makes numbathrooms better too!
no_outliers_df = remove_outliers(no_outliers_df, 'numBedrooms')
for col in outlier_cols:
    fig = px.box(no_outliers_df, y=col, points='suspectedoutliers', title=f"{col} - Boxplot")
    fig.show()

### Relationships with price
- price vs propertySize
    - raw scale, dominated by outliers
    - on log-log relationship is much clearer and roughly linear, use log_price as target and log_size as one of predictors

- bedrooms and bathrooms
    - both have an upward relationship with price 

In [None]:
# price vs propertySize
px.scatter(df, x='propertySize', y='price', trendline='ols',
           title='Price vs Property size').show()

px.scatter(df, x='log_size', y='log_price', trendline='ols',
           title='log Price vs log Property size').show()


# price vs numBedrooms
px.scatter(df, x='numBedrooms', y='price', trendline='ols',
           title='Price vs Bedrooms').show()

# price vs numBathrooms
px.scatter(df, x='numBathrooms', y='price', trendline='ols',
           title='Price vs Bathrooms').show()

# price vs numBathrooms
px.scatter(df, x='numBathrooms', y='numBedrooms', trendline='ols',
           title='Bedrooms vs Bathrooms').show()

# price vs numBathrooms
px.scatter(df, x='longitude', y='latitude', trendline='ols',
           title='longitude vs latitude').show()

In [None]:
df[df['latitude'].isin([39.78373])]



Unnamed: 0,id,title,featuredLevel,publishDate,price,numBedrooms,numBathrooms,propertyType,propertySize,category,...,m_hasVirtualTour,m_hasBrochure,ber_rating,longitude,latitude,log_price,log_size,county,address_part,town
3902,3645859,"6 Ashgrove Drive, Ballyvolane, Ballyvolane, Co...",standard,2022-01-24,295000,3,1,Detached,84.0,Buy,...,False,False,XXX,-100.445882,39.78373,12.594734,4.442651,Co. Cork,"6 Ashgrove Drive, Ballyvolane",Ballyvolane


### Investigating null values
- All NA's exist in the propertySize column, with over 3.5k rows to train a model with this seems like a good scenario to train a model to impute these values.
- Other options could be to use measures of central tendency such as mean, mode, median, would not suit for propertySize as many factors would come in to it
    - Finding similar records or rows could be another option

In [31]:
## Option 1 - remove NA's
df_nona = df.dropna()
df_nona

Unnamed: 0,id,title,featuredLevel,publishDate,price,numBedrooms,numBathrooms,propertyType,propertySize,category,...,seller_name,seller_branch,sellerType,m_totalImages,m_hasVideo,m_hasVirtualTour,m_hasBrochure,ber_rating,longitude,latitude
0,3626025,"11 Chestnut Crescent, Bridgemount, Carrigaline...",featured,2022-01-28,290000,3,3,End of Terrace,96.0,Buy,...,Roy Dennehy,Dennehy Auctioneers,BRANDED_AGENT,16.0,False,False,False,C2,-8.382500,51.822940
1,3675175,"58 The Glen, Kilnacourt Woods, Portarlington, ...",featured,2022-01-28,225000,3,2,Semi-D,93.0,Buy,...,Marie Kiernan,Tom McDonald & Associates,BRANDED_AGENT,33.0,False,False,False,C1,-7.177098,53.157465
2,3673450,"16 Dodderbrook Park, Ballycullen, Dublin 24",featured,2022-01-27,575000,4,3,Semi-D,162.0,Buy,...,Moovingo,Moovingo,BRANDED_AGENT,38.0,False,True,False,A3,-6.342763,53.269493
4,3643947,"5 Columba Terrace, Kells, Co. Meath",featured,2022-01-28,120000,3,1,Terrace,68.0,Buy,...,REA T&J Gavigan,REA T & J Gavigan,BRANDED_AGENT,5.0,False,False,False,G,-6.879797,53.728601
5,3598816,"75 The Lawn, Coolroe Meadows, Ballincollig, Co...",featured,2022-01-30,400000,4,3,Semi-D,113.0,Buy,...,Norma Healy,Sherry FitzGerald Cork,BRANDED_AGENT,20.0,True,False,False,C1,-8.614786,51.883612
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3961,3644422,"8 Sliabh Cairbe, Drumlish, Co. Longford",standard,2021-12-13,185000,4,3,Semi-D,125.0,Buy,...,Paul O'Shea,Sherry FitzGerald Cork,BRANDED_AGENT,34.0,False,False,False,A3,-8.315556,51.849705
3962,3644416,"13 Cherry Close, Bellfield, Waterford",standard,2022-01-24,235000,3,3,Semi-D,103.0,Buy,...,Robert Forbes,Forbes Property,BRANDED_AGENT,24.0,False,True,False,A1,-7.212145,53.647194
3963,3644275,"8 Thomas Street, Castlebar, Co. Mayo",standard,2022-01-30,149500,3,1,Bungalow,82.0,Buy,...,DNG John O' Brien Office,DNG John O’Brien,UNBRANDED_AGENT,14.0,True,False,False,A3,-6.753848,54.115088
3965,3644099,"School Land, Ballinalee, Co. Longford",standard,2021-12-04,170000,4,2,Detached,128.0,Buy,...,Tom Hickey,Hickey O'Donoghue Auctioneers Ltd.,BRANDED_AGENT,38.0,False,True,False,A2,-8.652927,52.664558


In [28]:
# Create a mask for rows with any NA values
mask = df.isna().any(axis=1)
df[mask].head()
# df['featuredLevel'].value_counts()

Unnamed: 0,id,title,featuredLevel,publishDate,price,numBedrooms,numBathrooms,propertyType,propertySize,category,...,m_totalImages,m_hasVideo,m_hasVirtualTour,m_hasBrochure,ber_rating,longitude,latitude,county,address_part,town
3,3649708,"31 Lissanalta Drive, Dooradoyle, Co. Limerick",featured,2022-01-28,299000,3,3,Semi-D,,Buy,...,22.0,False,False,False,C2,-8.640716,52.629588,Co. Limerick,31 Lissanalta Drive,Dooradoyle
9,3486462,"Ballykeeran, Co. Westmeath",featured,2022-01-30,695000,7,6,Detached,400.0,Buy,...,24.0,True,False,False,C2,-7.899131,53.446741,Co. Westmeath,Ballykeeran,
11,3636266,"14 Ballinakill Avenue, Ballinakill, Co. Waterford",featured,2022-01-12,450000,4,2,Detached,,Buy,...,28.0,False,False,False,D1,-7.067098,52.242731,Co. Waterford,14 Ballinakill Avenue,Ballinakill
13,3655680,"Knockaneasy, Our Ladys Island",featured,2022-01-14,395000,3,2,Detached,204.0,Buy,...,43.0,True,False,False,C2,-6.37534,52.205808,,Knockaneasy,Our Ladys Island
17,3476643,"Seamount Rise, Malahide, Co. Dublin",featured,2022-01-11,625000,3,3,Terrace,,New Homes,...,29.0,False,True,False,SI_666,-7.107895,52.255349,Co. Dublin,Seamount Rise,Malahide


In [22]:
# columns that could be good to impute propertySize value 
similarity_columns = ['price', 'numBedrooms', 'numBathrooms', 'propertyType', 'county', 'town'] # seller_branch is a maybe

import numpy as np

def impute_by_similarity(
    df,
    target_col,
    similarity_cols,
    price_col='price',
    index=None
):
    """
    Impute missing values in `target_col` using similar rows.
    
    - If `index` is given, only impute that row.
    - Otherwise, impute all rows where `target_col` is NaN.
    """
    df = df.copy()
    
    # rows that need imputation
    if index is not None:
        missing_idx = [index]
    else:
        missing_idx = df[df[target_col].isna()].index

    price_std = df[price_col].std()

    for idx in missing_idx:
        row = df.loc[idx]

        # start from rows where target is known
        similar = df[df[target_col].notna()].copy()

        # match on all non-price similarity columns exactly
        for col in similarity_cols:
            if col == price_col:
                continue
            similar = similar[similar[col] == row[col]]

        # restrict by price window around the row's price
        similar = similar[
            similar[price_col].between(
                row[price_col] - price_std,
                row[price_col] + price_std
            )
        ]

        if similar.empty:
            # nothing similar found -> skip
            continue

        mean_val = similar[target_col].mean()

        if not np.isnan(mean_val):
            # round if you want an integer result
            df.loc[idx, target_col] = round(mean_val)

    return df

df_imputed = impute_by_similarity(
    df=df,                   # or whatever your DataFrame is called
    target_col='propertySize',
    similarity_cols=similarity_columns,
    price_col='price'             # matches your similarity_columns[0]
)

In [None]:
# only went from 355 na's to 248 - not great
df_imputed.isna().sum()

id                    0
title                 0
featuredLevel         0
publishDate           0
price                 0
numBedrooms           0
numBathrooms          0
propertyType          0
propertySize        248
category              0
AMV_price             0
sellerId              0
seller_name           0
seller_branch         0
sellerType            0
m_totalImages         0
m_hasVideo            0
m_hasVirtualTour      0
m_hasBrochure         0
ber_rating            0
longitude             0
latitude              0
county               41
address_part          0
town                 61
dtype: int64

### 4 Linear Regression

In [33]:
import statsmodels.api as sm
import numpy as np

nona_df = df.dropna().copy()
features = ['title', 'featuredLevel', 'publishDate', 'numBedrooms',
 'numBathrooms', 'propertyType', 'propertySize', 'category', 'AMV_price',
 'sellerId', 'seller_name', 'seller_branch', 'sellerType', 'm_totalImages',
 'm_hasVideo', 'm_hasVirtualTour', 'm_hasBrochure', 'ber_rating', 'longitude',
 'latitude']

x = nona_df[features].select_dtypes(include=[np.number])

y = nona_df['price']

# x = sm.add_constant(x)
model = sm.OLS(y,x).fit()

print(model.summary())

                                 OLS Regression Results                                
Dep. Variable:                  price   R-squared (uncentered):                   0.690
Model:                            OLS   Adj. R-squared (uncentered):              0.689
Method:                 Least Squares   F-statistic:                              1002.
Date:                Mon, 03 Nov 2025   Prob (F-statistic):                        0.00
Time:                        21:12:13   Log-Likelihood:                         -49958.
No. Observations:                3612   AIC:                                  9.993e+04
Df Residuals:                    3604   BIC:                                  9.998e+04
Df Model:                           8                                                  
Covariance Type:            nonrobust                                                  
                    coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------

features = ['id', 'title', 'featuredLevel', 'publishDate', 'price', 'numBedrooms',
       'numBathrooms', 'propertyType', 'propertySize', 'category', 'AMV_price',
       'sellerId', 'seller_name', 'seller_branch', 'sellerType',
       'm_totalImages', 'm_hasVideo', 'm_hasVirtualTour', 'm_hasBrochure',
       'ber_rating', 'longitude', 'latitude']

In [None]:
# Pretty good results, high chance of overfitting

import statsmodels.api as sm
import numpy as np

df = df.dropna().copy()
features = ['title', 'featuredLevel', 'publishDate', 'numBedrooms',
 'numBathrooms', 'propertyType', 'log_size', 'category', 'AMV_price',
 'sellerId', 'seller_name', 'seller_branch', 'sellerType', 'm_totalImages',
 'm_hasVideo', 'm_hasVirtualTour', 'm_hasBrochure', 'ber_rating', 'longitude',
 'latitude']

x = df[features].select_dtypes(include=[np.number])

y = df['log_price']

# x = sm.add_constant(x)
model = sm.OLS(y,x).fit()

print(model.summary())

                                 OLS Regression Results                                
Dep. Variable:              log_price   R-squared (uncentered):                   0.998
Model:                            OLS   Adj. R-squared (uncentered):              0.998
Method:                 Least Squares   F-statistic:                          2.452e+05
Date:                Tue, 25 Nov 2025   Prob (F-statistic):                        0.00
Time:                        15:04:58   Log-Likelihood:                         -2890.2
No. Observations:                3612   AIC:                                      5796.
Df Residuals:                    3604   BIC:                                      5846.
Df Model:                           8                                                  
Covariance Type:            nonrobust                                                  
                    coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------

In [84]:
## Less features and check performance
## removing longitude and latitude reduces R^2 to 0.989 from 0.998 can jsut drop title and extra county/town info

import statsmodels.api as sm
import numpy as np

df = df.dropna().copy()
features = ['numBedrooms', 'numBathrooms', 'log_size', 'longitude', 'latitude']

x = df[features].select_dtypes(include=[np.number])
# print(x)

y = df['log_price']

# x = sm.add_constant(x)
model = sm.OLS(y,x).fit()

print(model.summary())

                                 OLS Regression Results                                
Dep. Variable:              log_price   R-squared (uncentered):                   0.998
Model:                            OLS   Adj. R-squared (uncentered):              0.998
Method:                 Least Squares   F-statistic:                          3.614e+05
Date:                Tue, 25 Nov 2025   Prob (F-statistic):                        0.00
Time:                        15:27:31   Log-Likelihood:                         -2909.6
No. Observations:                3515   AIC:                                      5829.
Df Residuals:                    3510   BIC:                                      5860.
Df Model:                           5                                                  
Covariance Type:            nonrobust                                                  
                   coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------

### 5. Models and Training
Options - Decision Tree Regressor/ Random Forest