# House price model predictor
## Introduction

In this notebook we build a model to predict house prices based on the following variables from houses in Melbourne, Australia.
### Original Dataset Columns
- `Suburb` – Suburb name (categorical)
- `Address` – Street address (text)
- `Rooms` – Number of rooms (numeric)
- `Type` – Property type (categorical: h, u, t, etc.)
- `Price` – Target variable (numeric)
- `Method` – Sale method (categorical)
- `SellerG` – Real estate agent (categorical)
- `Date` – Sale date (datetime)
- `Distance` – Distance from CBD (numeric)
- `Postcode` – Postal code (numeric)
- `Bedroom2` – Scraped # of bedrooms (numeric)
- `Bathroom` – Number of bathrooms (numeric)
- `Car` – Car spots (numeric)
- `Landsize` – Land size (numeric)
- `BuildingArea` – Building size (numeric)
- `YearBuilt` – Year house was built (numeric, sometimes dropped)
- `CouncilArea` – Governing council (categorical)
- `Lattitude` – Latitude (numeric)
- `Longtitude` – Longitude (numeric)
- `Regionname` – General region (categorical)
- `Propertycount` – Number of properties in suburb (numeric)



## Import Libraries
Import all required packages for:
- Data manipulation (`pandas`, `numpy`)
- Visualization (`matplotlib`, `seaborn`)
- Modeling (`xgboost`, `sklearn`)

In [1]:
# We begin by importing the required packages

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn import set_config
from sklearn.preprocessing import TargetEncoder
import seaborn as sns
import numpy as np
set_config(transform_output = "pandas")
from xgboost import XGBRegressor

## Data Cleaning
We import the data and do some rudimentary analysis.

In [473]:
# Load dataset
df = pd.read_csv("melb_data.csv")

# Quick overview
print("Shape of dataset:", df.shape)
print(df.info())
print(df.head())

# Check missing values
print(df.isna().sum())

# Describe numeric features
print(df.describe())

# Describe categorical features
print(df.describe(include=['object']))


Shape of dataset: (13580, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580

In the dataset, a value of 0 for Bathroom or Landsize indicates missing information. We will replace these zeros with NaN and impute them later during preprocessing. We also drop `Address` as it is unique for each property and thus is unhelpful. We next engineer some features. 

In [474]:

df["Landsize"] = (df["Landsize"].replace(0, np.nan))
df["Bathroom"] = (df["Bathroom"].replace(0, np.nan))

# Drop 'Address'
df = df.drop("Address", axis=1)

# Convert 'Date' column to datetime
df["Date"] = pd.to_datetime(df["Date"], dayfirst=True)

# --- Temporal features ---
df["YearSold"] = df["Date"].dt.year        # Year of sale
df["MonthSold"] = df["Date"].dt.month      # Month of sale
df["QuarterSold"] = df["Date"].dt.quarter  # Quarter of sale
df["SeasonSold"] = df["Date"].dt.month % 12 // 3 + 1   # Approx. season (1=Winter, 2=Spring, etc.)

# --- Property features ---
df["HouseAge"] = df["YearSold"] - df["YearBuilt"]  # Age of property at sale time
#df["Room_to_Bathroom_Ratio"] = df["Rooms"] / df["Bathroom"]
df["Building_to_Land_Ratio"] = df["BuildingArea"] / df["Landsize"]
#df = df.drop("Address", axis = 1)

# --- Agent feature ---
top_agents = df["SellerG"].value_counts().nlargest(20).index
df["TopAgent"] = df["SellerG"].where(df["SellerG"].isin(top_agents), "Other")


# --- Define target and features ---
X = df.drop("Price", axis=1)
y = df["Price"]



## Building our pipeline
We intend to impute numerical data by the median, and categorical data by most frequent. For categorical data, `Suburb` and `SellerG` are both high cardinality so we will Target Encode them, and for the rest we will simply one hot encode. 

In [475]:
# We isolate categorical columns and numerical columns
#categorical_info = [col for col in df.columns
#    if df[col].dtype ==  "object"
#]
low_card_cols = ["Type", "Method", "Regionname", "TopAgent"]
high_card_cols = ["Suburb", "SellerG"]


numeric_info = [
    col for col in df.columns
    if df[col].dtype in ["int64", "float64"] and col != "Price"
]

lands_pipeline_raw = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

lands_pipeline_log = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("log", FunctionTransformer(np.log1p, validate=False))
])

In [476]:
# We impute categorical and numeric information

lowcategorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

highcategorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("target", TargetEncoder())
])


numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "median")),
    ("scaler", StandardScaler())])

# We then put them both into a single preprocessing pipeline

preprocessor = ColumnTransformer([
    ("lands", lands_pipeline_raw, ["Landsize"]),
    ("rooms", Pipeline([("log", FunctionTransformer(np.log1p)), ("scaler", StandardScaler())]), ["Rooms"]),
    ("num", numeric_pipeline, numeric_info),
    ("lowcat", lowcategorical_pipeline, low_card_cols),
    ("highcat", highcategorical_pipeline, high_card_cols)
])

## Building our model
We use `XGBRegressor` to model our data, and after having done that we optimize by performing a Randomized Search cross-validation. We first split our data with `test_train_split`.

In [477]:
# --- Train/test split ---
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [478]:
# We put this into a larger pipeline and apply our XGBRegressor.

model_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("trainer", XGBRegressor(n_jobs=-1, random_state=42,
                            ))
])

model_pipeline.fit(X_train,y_train)





0,1,2
,steps,"[('preprocessor', ...), ('trainer', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('lands', ...), ('rooms', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,func,<ufunc 'log1p'>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,
,inv_kw_args,

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,target_type,'auto'
,smooth,'auto'
,cv,5
,shuffle,True
,random_state,

0,1,2
,objective,'reg:squarederror'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


## Testing our model
We use our pipeline to predict our test data and then calculate the `RMSE`.

In [480]:
from sklearn.metrics import root_mean_squared_error
y_pred = model_pipeline.predict(X_test)
root_mean_squared_error(y_pred, y_test)

263549.63480477815

## Optimizing our hyperparameters
We use `RandomizedSearchCV` to optimize our hyperparameters.

In [481]:
from sklearn.model_selection import RandomizedSearchCV

In [482]:
xgb_param_distributions = {
    "trainer__n_estimators": [100, 200],
    "trainer__max_depth": [5, 10, 20, None],
    "trainer__learning_rate": [0.01, 0.05, 0.1, 0.2],
    "trainer__subsample": [0.8, 1.0],
    "trainer__colsample_bytree": [0.8, 1.0]
}
random_search = RandomizedSearchCV(
    estimator=model_pipeline,
    param_distributions=xgb_param_distributions,
    n_iter=20,                     # number of random combinations to try
    scoring="neg_mean_absolute_error",
    cv=3,                          # 5-fold cross-validation
    random_state=42,
    n_jobs=-1,
)


In [None]:
random_search.fit(X_train,y_train)



In [None]:
best_pipeline = random_search.best_estimator_
#print("Best hyperparameters:", random_search.best_params_)
#print("Best CV MAE:", -random_search.best_score_)

# Predict on test set
y_pred = best_pipeline.predict(X_test)

In [None]:
root_mean_squared_error(y_pred,y_test)