# Domain Data Preprocessing

## Purpose
Clean and preprocess the Domain property dataset for Victoria. This includes mapping properties to SA2s, extracting and normalizing features, handling missing values, filtering by property type and rent thresholds, and preparing the dataset for downstream modeling.

## Inputs
- Raw Domain property listings: `vic_rentals_all.csv`
- SA2 shapefile: `SA2_2021_AUST_GDA2020.shp`

## Outputs
- Cleaned and processed Domain dataset: `domain_cleaned.csv`

## Key Steps
1. Load property listings and SA2 boundaries.
2. Convert property lat/lon to geometric points and map to SA2s using spatial join.
3. Normalize and extract structured features (e.g., balcony, heating, laundry).
4. Merge similar features and create binary indicator columns.
5. Handle missing and zero values for weekly rent and bond.
6. Detect and correct outliers in rent values.
7. Filter properties by residential types and minimum rent threshold.
8. Group property types to reduce imbalance.
9. Impute missing numeric values (`bedrooms`, `bathrooms`, `carspaces`) with median per property type.
10. Convert data types for consistency and export cleaned dataset.


In [1]:
# Libraries
import os
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
import matplotlib.pyplot as plt
import folium
import numpy as np
from pathlib import Path
import statsmodels.api as sm
import seaborn as sns
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn.linear_model import LinearRegression
from fuzzywuzzy import process
from collections import Counter

In [2]:
# Load data
domain_df = pd.read_csv("../../datasets/property/domain/vic_rentals_all.csv")
print(domain_df.shape)
domain_df.head()


(12717, 30)


Unnamed: 0,listing_id,suburb,postcode,weekly_rent,bond,available_date,date_listed,days_listed,bedrooms,bathrooms,...,floorplans_count,virtual_tour,primary_type,secondary_type,agency,agency_id,agent_names,structured_features,url,land_area
0,16782629,SOUTH KINGSVILLE,3015,460.0,1994.0,"Tuesday, 02 September 2025",2025-08-13,27.0,2.0,1.0,...,0.0,False,Apartment,Apartment / Unit / Flat,Lease A Property,21230.0,Marc Angelone,"Built in wardrobes, Secure Parking, Bath, Heat...",https://www.domain.com.au/3-53-greene-street-s...,
1,17471867,SOUTH KINGSVILLE,3015,400.0,1738.0,"Thursday, 27 March 2025",2025-03-06,187.0,2.0,1.0,...,0.0,False,Apartment,Apartment / Unit / Flat,Village Real Estate,20880.0,Trudie Thobe,"Internal Laundry, Pets Allowed, Heating",https://www.domain.com.au/1-3-new-street-south...,
2,17721851,SOUTH KINGSVILLE,3015,795.0,3454.0,"Monday, 15 September 2025",2025-08-19,21.0,3.0,2.0,...,1.0,False,Townhouse/Villa,Townhouse,Jas Stephens Real Estate,22.0,Jesseigh Stella,"Internal Laundry, Balcony / Deck, Floorboards,...",https://www.domain.com.au/19-92-new-street-sou...,
3,17725855,SOUTH KINGSVILLE,3015,675.0,2933.0,"Wednesday, 20 August 2025",2025-08-21,19.0,3.0,1.0,...,0.0,False,Townhouse/Villa,Townhouse,Jellis Craig Inner West,19574.0,Zac Keltie,"Air conditioning, Built in wardrobes, Balcony ...",https://www.domain.com.au/3-14-saltley-street-...,
4,17745057,SOUTH KINGSVILLE,3015,450.0,1955.0,"Tuesday, 02 September 2025",2025-09-03,6.0,2.0,1.0,...,0.0,False,Apartment,Apartment / Unit / Flat,Belle Property Albert Park,3702.0,"William Brydon Waldren, Shar Claridge","Built in wardrobes, Heating",https://www.domain.com.au/4-2b-saltley-street-...,


### Match Domain Data to SA2

In [3]:
# Load data
sa2_df = gpd.read_file("../../datasets/district_shape/SA2_GDA2020_SHAPEFILE/SA2_2021_AUST_GDA2020.shp")
sa2_df.head()

Unnamed: 0,SA2_CODE21,SA2_NAME21,CHG_FLAG21,CHG_LBL21,SA3_CODE21,SA3_NAME21,SA4_CODE21,SA4_NAME21,GCC_CODE21,GCC_NAME21,STE_CODE21,STE_NAME21,AUS_CODE21,AUS_NAME21,AREASQKM21,LOCI_URI21,geometry
0,101021007,Braidwood,0,No change,10102,Queanbeyan,101,Capital Region,1RNSW,Rest of NSW,1,New South Wales,AUS,Australia,3418.3525,http://linked.data.gov.au/dataset/asgsed3/SA2/...,"POLYGON ((149.58424 -35.44426, 149.58444 -35.4..."
1,101021008,Karabar,0,No change,10102,Queanbeyan,101,Capital Region,1RNSW,Rest of NSW,1,New South Wales,AUS,Australia,6.9825,http://linked.data.gov.au/dataset/asgsed3/SA2/...,"POLYGON ((149.21899 -35.36738, 149.218 -35.366..."
2,101021009,Queanbeyan,0,No change,10102,Queanbeyan,101,Capital Region,1RNSW,Rest of NSW,1,New South Wales,AUS,Australia,4.762,http://linked.data.gov.au/dataset/asgsed3/SA2/...,"POLYGON ((149.21326 -35.34325, 149.21619 -35.3..."
3,101021010,Queanbeyan - East,0,No change,10102,Queanbeyan,101,Capital Region,1RNSW,Rest of NSW,1,New South Wales,AUS,Australia,13.0032,http://linked.data.gov.au/dataset/asgsed3/SA2/...,"POLYGON ((149.24034 -35.34781, 149.24024 -35.3..."
4,101021012,Queanbeyan West - Jerrabomberra,0,No change,10102,Queanbeyan,101,Capital Region,1RNSW,Rest of NSW,1,New South Wales,AUS,Australia,13.6748,http://linked.data.gov.au/dataset/asgsed3/SA2/...,"POLYGON ((149.19572 -35.36126, 149.1997 -35.35..."


In [4]:
# Convert property lat/lon to geometry points
gdf_points = gpd.GeoDataFrame(domain_df, geometry=gpd.points_from_xy(domain_df['lon'], domain_df['lat']), crs="EPSG:4326" )
sa2_df = sa2_df.to_crs(gdf_points.crs)

# Join 
domain_sa2 = gpd.sjoin(gdf_points, sa2_df, how="left", predicate="within")

In [5]:
# Check matching quality
total_props = len(domain_sa2)
matched_props = domain_sa2['SA2_CODE21'].notna().sum()
matching_rate = matched_props / total_props * 100

print(f"Matched {matched_props} out of {total_props} properties ({matching_rate:.2f}%)")

# suburbs that failed to match
unmatched_suburbs = domain_sa2.loc[domain_sa2['SA2_CODE21'].isna(), 'suburb'].unique()
print("unmatched suburbs:", unmatched_suburbs[:20]) 

Matched 12713 out of 12717 properties (99.97%)
unmatched suburbs: ['Rowville' 'Irymple' 'Coimadai']


In [6]:
# rename for consistency
domain_sa2.rename(columns={'SA2_CODE21': 'sa2_code','SA2_NAME21': 'sa2_name'},inplace=True)
domain_sa2.columns

Index(['listing_id', 'suburb', 'postcode', 'weekly_rent', 'bond',
       'available_date', 'date_listed', 'days_listed', 'bedrooms', 'bathrooms',
       'carspaces', 'property_type', 'address', 'lat', 'lon', 'scraped_date',
       'domain_page_id', 'property_id', 'photo_count', 'video_count',
       'floorplans_count', 'virtual_tour', 'primary_type', 'secondary_type',
       'agency', 'agency_id', 'agent_names', 'structured_features', 'url',
       'land_area', 'geometry', 'index_right', 'sa2_code', 'sa2_name',
       'CHG_FLAG21', 'CHG_LBL21', 'SA3_CODE21', 'SA3_NAME21', 'SA4_CODE21',
       'SA4_NAME21', 'GCC_CODE21', 'GCC_NAME21', 'STE_CODE21', 'STE_NAME21',
       'AUS_CODE21', 'AUS_NAME21', 'AREASQKM21', 'LOCI_URI21'],
      dtype='object')

In [7]:
keep_cols = ['sa2_code', 'sa2_name','suburb','postcode', 'weekly_rent', 'bond', 'address', 'lat', 'lon', 'bedrooms', 'bathrooms', 'carspaces', 'property_type', 'land_area', 'structured_features', 'date_listed']
domain = domain_sa2[keep_cols].copy()
domain.head()

Unnamed: 0,sa2_code,sa2_name,suburb,postcode,weekly_rent,bond,address,lat,lon,bedrooms,bathrooms,carspaces,property_type,land_area,structured_features,date_listed
0,213021344,Newport,SOUTH KINGSVILLE,3015,460.0,1994.0,3/53 Greene Street,-37.830982,144.87091,2.0,1.0,2.0,Apartment / Unit / Flat,,"Built in wardrobes, Secure Parking, Bath, Heat...",2025-08-13
1,213021344,Newport,SOUTH KINGSVILLE,3015,400.0,1738.0,1/3 New Street,-37.826218,144.86755,2.0,1.0,1.0,Apartment / Unit / Flat,,"Internal Laundry, Pets Allowed, Heating",2025-03-06
2,213021343,Altona North,SOUTH KINGSVILLE,3015,795.0,3454.0,19/92 New Street,-37.831226,144.86632,3.0,2.0,3.0,Townhouse,,"Internal Laundry, Balcony / Deck, Floorboards,...",2025-08-19
3,213021344,Newport,SOUTH KINGSVILLE,3015,675.0,2933.0,3/14 Saltley Street,-37.827423,144.86768,3.0,1.0,2.0,Townhouse,,"Air conditioning, Built in wardrobes, Balcony ...",2025-08-21
4,213021344,Newport,SOUTH KINGSVILLE,3015,450.0,1955.0,4/2B Saltley Street,-37.82627,144.8679,2.0,1.0,,Apartment / Unit / Flat,,"Built in wardrobes, Heating",2025-09-03


### Turn 'structured_features' Column to Independent Columns

In [8]:
# Split into lists
feature_lists = domain['structured_features'].dropna().str.split(', ')

# Flatten
all_features = [feature for sublist in feature_lists for feature in sublist]

# Count frequencies
feature_counts = Counter(all_features)

# Convert to DataFrame
feature_df = pd.DataFrame(feature_counts.items(), columns=['Feature', 'Count'])
feature_df = feature_df.sort_values(by='Count', ascending=False).reset_index(drop=True)
feature_df.head(30)


Unnamed: 0,Feature,Count
0,Built in wardrobes,6690
1,Heating,5710
2,Dishwasher,5456
3,Secure Parking,4966
4,Internal Laundry,3993
5,Air conditioning,3796
6,Balcony / Deck,2509
7,Gas,2421
8,Floorboards,1904
9,Bath,1881


### Fuzzy Match Features to Group Similar Features Together

In [9]:
canonical_features = ['balcony', 'car parking', 'heating', 'air conditioning', 'builtin wardrobes', 'laundry', 'swimming pool', 'ensuite',
    'bathroom features', 'dishwasher', 'garden', 'gym', 'security', 'pets allowed', 'gas', 'washing machine', 'alarm', 'intercom', 'internet']

# Find best match from canonical_features and accept if similar enough
def map_to_canonical(feature_string):
    feature = feature_string.lower().strip()
    best_match, score = process.extractOne(feature, canonical_features)
    if score >= 70:
        return best_match
    else:
        return feature  
    
def normalize_list(feature_list):
    
    if not isinstance(feature_list, list):
        return []  
    return list(set(map(map_to_canonical, feature_list)))

# Apply function
domain['normalized_features'] = (domain['structured_features'].fillna('').str.split(',').apply(normalize_list))




In [10]:
# Check results
feature_lists = domain['normalized_features'].apply(lambda x: x if isinstance(x, list) else [])
all_normalized_features = [feature for sublist in feature_lists for feature in sublist]
normalized_feature_counts = Counter(all_normalized_features)
normalized_feature_df = pd.DataFrame(normalized_feature_counts.items(), columns=['Feature', 'Count'])
normalized_feature_df = normalized_feature_df.sort_values(by='Count', ascending=False).reset_index(drop=True)
normalized_feature_df.head(15)


Unnamed: 0,Feature,Count
0,heating,7693
1,builtin wardrobes,6691
2,dishwasher,5458
3,car parking,5269
4,air conditioning,4562
5,laundry,4158
6,balcony,3093
7,gas,2732
8,bathroom features,1918
9,floorboards,1904


In [11]:
# Extract only some features and turn into independent columns
chosen = ['balcony', 'car parking', 'heating', 'air conditioning', 'builtin wardrobes', 'laundry', 'swimming pool', 'ensuite', 'dishwasher',
    'garden', 'gym', 'security', 'pets allowed', 'gas', 'washing machine', 'alarm', 'intercom']

domain_expanded = domain.copy()

# Fill NaN with empty string
domain_expanded['normalized_features'] = domain_expanded['normalized_features'].fillna('')

# create a column for each chosen feature
for feature in chosen:
    domain_expanded[feature] = domain_expanded['normalized_features'].apply(lambda x: 1 if feature in x else 0)

# Drop original column 
domain_expanded = domain_expanded.drop(columns=['structured_features'])
domain_expanded[chosen].head()

Unnamed: 0,balcony,car parking,heating,air conditioning,builtin wardrobes,laundry,swimming pool,ensuite,dishwasher,garden,gym,security,pets allowed,gas,washing machine,alarm,intercom
0,0,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0
1,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
2,1,0,1,1,1,1,0,1,1,0,0,0,1,1,0,0,0
3,1,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0
4,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0


In [12]:
# merge similar features
merge_dict = {'security_system': ['alarm', 'security'],'washing_machine': ['laundry', 'washing machine']}

for new_feat, old_feats in merge_dict.items():
    domain_expanded[new_feat] = domain_expanded['normalized_features'].apply(
        lambda x: any(feat in x for feat in old_feats) if isinstance(x, list) else False)

In [13]:
# remove unneeded columns and rename for consistency
for old_feats in merge_dict.values():
    for feat in old_feats:
        if feat in domain_expanded.columns:
            domain_expanded.drop(columns=feat, inplace=True)
            
domain_expanded = domain_expanded.drop(columns=['normalized_features'])
domain_expanded['security_system'] = domain_expanded['security_system'].astype(int)
domain_expanded['washing_machine'] = domain_expanded['washing_machine'].astype(int)
domain_expanded.columns = domain_expanded.columns.str.lower().str.replace(' ', '_')
domain_expanded.columns

Index(['sa2_code', 'sa2_name', 'suburb', 'postcode', 'weekly_rent', 'bond',
       'address', 'lat', 'lon', 'bedrooms', 'bathrooms', 'carspaces',
       'property_type', 'land_area', 'date_listed', 'balcony', 'car_parking',
       'heating', 'air_conditioning', 'builtin_wardrobes', 'swimming_pool',
       'ensuite', 'dishwasher', 'garden', 'gym', 'pets_allowed', 'gas',
       'intercom', 'security_system', 'washing_machine'],
      dtype='object')

## Data Cleaning

In [14]:
print(domain_expanded.isnull().sum())
print(domain_expanded.shape)

sa2_code                 4
sa2_name                 4
suburb                   0
postcode                 0
weekly_rent            279
bond                   782
address                104
lat                      4
lon                      4
bedrooms               125
bathrooms               51
carspaces             1803
property_type            0
land_area            12715
date_listed              4
balcony                  0
car_parking              0
heating                  0
air_conditioning         0
builtin_wardrobes        0
swimming_pool            0
ensuite                  0
dishwasher               0
garden                   0
gym                      0
pets_allowed             0
gas                      0
intercom                 0
security_system          0
washing_machine          0
dtype: int64
(12717, 30)


In [15]:
# Obvious fixes
domain_expanded = domain_expanded.dropna(subset=['sa2_code', 'sa2_name'])
domain_expanded = domain_expanded.drop(columns=['land_area', 'date_listed'])

### Weekly Rent and Bond

In [16]:
domain_expanded[['weekly_rent', 'bond']].describe()

Unnamed: 0,weekly_rent,bond
count,12434.0,11932.0
mean,764.777626,2787.329702
std,9430.235319,2684.197655
min,0.0,50.0
25%,485.0,2086.0
50%,560.0,2433.0
75%,690.0,2955.0
max,808500.0,212917.0


In [17]:
# Check rent to bond ratio to see relationship
valid = domain_expanded.dropna(subset=['weekly_rent', 'bond'])
valid = valid[valid['weekly_rent'] > 0]
valid['bond_ratio'] = valid['bond'] / valid['weekly_rent']
print(valid['bond_ratio'].describe())
print(valid['bond_ratio'].round().value_counts())

count    11671.000000
mean         4.407425
std          4.006144
min          0.003224
25%          4.344444
50%          4.345122
75%          4.345652
max        434.524490
Name: bond_ratio, dtype: float64
bond_ratio
4.0      11050
6.0        269
5.0        190
7.0         48
2.0         38
1.0         37
3.0         21
9.0          5
0.0          5
8.0          3
11.0         1
14.0         1
435.0        1
12.0         1
13.0         1
Name: count, dtype: int64


Note: use rentx4 to compute bond and /4 for rent. also check max bond ratio (looks like error)

In [18]:
# Replace 0 with NaN 
domain_expanded['weekly_rent'] = domain_expanded['weekly_rent'].replace(0, np.nan)
domain_expanded['bond'] = domain_expanded['bond'].replace(0, np.nan)

# Handle nan values for bond and weekly rent using our logic
domain_expanded.loc[domain_expanded['weekly_rent'].isna() & domain_expanded['bond'].notna(),'weekly_rent'] = domain_expanded['bond'] / 4
domain_expanded.loc[domain_expanded['bond'].isna() & domain_expanded['weekly_rent'].notna(),'bond'] = domain_expanded['weekly_rent'] * 4

In [19]:
# Outlier Check
low_threshold = 50  
high_threshold = 10000
outliers = domain_expanded[(domain_expanded['weekly_rent'] < low_threshold) |
                           (domain_expanded['weekly_rent'] > high_threshold)]
outliers[['weekly_rent', 'bond', 'address', 'suburb', 'property_type']]

Unnamed: 0,weekly_rent,bond,address,suburb,property_type
282,33.0,132.0,14-16 Bubb Street,MOE,House
394,30.0,200.0,58 Saleyards Road,BENALLA,House
395,30.0,200.0,40 Gay Street,BENALLA,House
414,43.25,173.0,21 Hannah Street,BENALLA,House
1929,315000.0,1365.0,1/28 Elgin Street,MORWELL,Apartment / Unit / Flat
2586,808500.0,2607.0,83 Golf Links Road,BERWICK,House
3395,27.5,110.0,47 Chambers Street,MYRTLEFORD,House
3396,23.75,95.0,14 Jubilee Street,MYRTLEFORD,House
3397,33.0,132.0,8 Mcgeehan Cres,MYRTLEFORD,Apartment / Unit / Flat
3641,595000.0,2578.0,4 Wilfred Street,ROSEBUD,House


In [20]:
# get outliers
mask = domain_expanded['weekly_rent'] > 10000

# Fix weekly_rent by recomputing from bond
domain_expanded.loc[mask, 'weekly_rent'] = domain_expanded.loc[mask, 'bond'] / 4
domain_expanded.loc[mask, ['weekly_rent', 'bond', 'address', 'suburb']]

Unnamed: 0,weekly_rent,bond,address,suburb
1929,341.25,1365.0,1/28 Elgin Street,MORWELL
2586,651.75,2607.0,83 Golf Links Road,BERWICK
3641,644.5,2578.0,4 Wilfred Street,ROSEBUD
11193,27157.5,108630.0,HUG11-12/847 Whitehorse Road,BOX HILL


In [21]:
domain_expanded = domain_expanded[domain_expanded['weekly_rent'] < 10000]

In [22]:
domain_expanded.sort_values(by='weekly_rent', ascending=False).head(20)

Unnamed: 0,sa2_code,sa2_name,suburb,postcode,weekly_rent,bond,address,lat,lon,bedrooms,...,swimming_pool,ensuite,dishwasher,garden,gym,pets_allowed,gas,intercom,security_system,washing_machine
12294,207011149,Camberwell,CAMBERWELL,3124,5866.0,5866.0,54 Fairmont Avenue,-37.842377,145.07043,5.0,...,0,0,0,0,0,0,0,0,0,0
10358,206061138,Toorak,TOORAK,3142,5750.0,34500.0,2 Lisbuoy Court,-37.84854,145.0175,4.0,...,0,0,0,0,0,0,0,0,0,0
12705,208011169,Brighton (Vic.),BRIGHTON,3186,5000.0,20000.0,,-37.9044,144.99974,5.0,...,0,0,1,0,0,1,0,1,0,1
8776,207011520,Hawthorn - South,HAWTHORN,3122,4500.0,27000.0,94 Illawarra Road,-37.83688,145.0389,5.0,...,0,0,0,0,0,0,0,0,0,0
10768,206041125,South Yarra - West,SOUTH YARRA,3141,4500.0,27000.0,1 Fairlie Court,-37.83275,144.9842,3.0,...,0,0,0,0,0,0,0,0,0,0
12677,208011169,Brighton (Vic.),BRIGHTON,3186,4000.0,24000.0,74 Champion Street,-37.921494,145.00458,4.0,...,0,1,1,0,1,0,0,1,0,0
1236,206051512,South Melbourne,SOUTH MELBOURNE,3205,4000.0,24000.0,901/161 Eastern Road,-37.836548,144.96724,4.0,...,0,1,1,0,0,0,0,1,0,1
1625,206051511,Port Melbourne Industrial,PORT MELBOURNE,3207,3900.0,23400.0,1 Tarver St,-37.834713,144.91972,4.0,...,1,0,1,0,1,1,0,0,1,0
3704,206071517,Richmond (South) - Cremorne,CREMORNE,3121,3850.0,15400.0,13 Balmain St,-37.8297,144.99123,4.0,...,0,1,1,0,0,0,0,0,0,0
10707,206041125,South Yarra - West,SOUTH YARRA,3141,3700.0,22200.0,16 St Martins Lane,-37.835285,144.98175,3.0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
domain_expanded.loc[12294, 'weekly_rent'] = domain_expanded.loc[12294, 'bond'] / 4

In [24]:
print(domain_expanded.loc[12294, ['weekly_rent', 'bond', 'address', 'suburb']])

weekly_rent                1466.5
bond                       5866.0
address        54 Fairmont Avenue
suburb                 CAMBERWELL
Name: 12294, dtype: object


In [25]:
domain_expanded[['weekly_rent', 'bond']].describe()

Unnamed: 0,weekly_rent,bond
count,12676.0,12676.0
mean,626.427895,2768.190281
std,286.249267,2400.549808
min,12.5,50.0
25%,485.0,2086.0
50%,560.0,2433.0
75%,690.0,2955.0
max,5750.0,212917.0


In [26]:
print(domain_expanded['property_type'].value_counts())

property_type
House                            6566
Apartment / Unit / Flat          4502
Townhouse                        1276
Studio                            208
New Apartments / Off the Plan      44
Villa                              28
Acreage / Semi-Rural               13
Semi-Detached                      10
New House & Land                    8
Car Space                           7
Duplex                              3
Block of Units                      3
Terrace                             3
Farm                                2
Unknown                             2
Vacant land                         1
Name: count, dtype: int64


In [27]:
# Filter to only property types we want
residential_types = [
    "House", "Apartment / Unit / Flat", "Townhouse", "Villa",
    "Semi-Detached", "Terrace", "Duplex", "Acreage / Semi-Rural",
    "New Apartments / Off the Plan", "New House & Land", "Studio"
]

domain_expanded = domain_expanded[domain_expanded['property_type'].isin(residential_types)]

# Apply rent minimum threshold 
domain_expanded = domain_expanded[domain_expanded['weekly_rent'] >= 100]

print("After filtering:", domain_expanded.shape)
print(domain_expanded['weekly_rent'].describe())

After filtering: (12616, 28)
count    12616.000000
mean       628.894658
std        284.408111
min        100.000000
25%        490.000000
50%        565.000000
75%        690.000000
max       5750.000000
Name: weekly_rent, dtype: float64


In [28]:
domain_expanded.groupby("property_type")["weekly_rent"].describe().sort_values("mean", ascending=False)


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
property_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
New House & Land,8.0,941.03125,498.488724,530.0,577.5,782.5,1112.4375,2000.0
Acreage / Semi-Rural,13.0,939.384615,454.673562,217.5,680.0,850.0,1392.0,1550.0
Semi-Detached,10.0,752.0,310.413702,385.0,500.0,687.5,977.5,1200.0
Townhouse,1276.0,728.671434,297.981589,165.0,550.0,660.0,830.0,3900.0
Terrace,3.0,700.0,43.30127,650.0,687.5,725.0,725.0,725.0
Villa,28.0,634.339286,298.175938,375.0,517.5,583.25,635.0,2000.0
House,6534.0,624.106175,281.095413,100.0,495.0,560.0,650.0,5750.0
Apartment / Unit / Flat,4489.0,618.996213,278.346549,125.0,475.0,565.0,700.0,5000.0
Duplex,3.0,570.0,98.488578,460.0,530.0,600.0,625.0,650.0
New Apartments / Off the Plan,44.0,547.545455,182.72262,370.0,450.0,450.0,638.75,1253.0


In [29]:
# Group properties since there is imbalance
mapping = {"House": "House", "New House & Land": "House", "Acreage / Semi-Rural": "House", "Apartment / Unit / Flat": "Apartment",
    "Studio": "Apartment", "New Apartments / Off the Plan": "Apartment","Townhouse": "Townhouse", "Villa": "Townhouse", "Semi-Detached": "House", "Duplex": "House", "Terrace": "House"}

domain_expanded.groupby("property_type")["weekly_rent"].describe().sort_values("mean", ascending=False)
domain_expanded["property_type_grouped"] = domain_expanded["property_type"].map(mapping)
domain_expanded["property_type_grouped"].value_counts()

property_type_grouped
House        6571
Apartment    4741
Townhouse    1304
Name: count, dtype: int64

### Bedrooms, Bathrooms and Carspaces

In [30]:
# see missing by property type
missing_by_type = domain_expanded.groupby('property_type')[['bedrooms', 'bathrooms', 'carspaces']] \
                                 .apply(lambda x: x.isna().sum())
print(missing_by_type)

                               bedrooms  bathrooms  carspaces
property_type                                                
Acreage / Semi-Rural                  2          2          4
Apartment / Unit / Flat              27          4       1015
Duplex                                0          0          0
House                                 5          4        517
New Apartments / Off the Plan         0          0         40
New House & Land                      0          0          1
Semi-Detached                         0          0          3
Studio                               48          0        154
Terrace                               0          0          2
Townhouse                             0          0         30
Villa                                 0          0          1


In [31]:
# see median by property type
stats_by_type = domain_expanded.groupby('property_type')[['bedrooms', 'bathrooms', 'carspaces']] \
                               .median()
print(stats_by_type)

                               bedrooms  bathrooms  carspaces
property_type                                                
Acreage / Semi-Rural                4.0        2.0        6.0
Apartment / Unit / Flat             2.0        1.0        1.0
Duplex                              3.0        1.0        1.0
House                               3.0        2.0        2.0
New Apartments / Off the Plan       1.0        1.0        1.0
New House & Land                    3.0        1.5        2.0
Semi-Detached                       2.0        1.0        2.0
Studio                              1.0        1.0        1.0
Terrace                             2.0        1.0        2.0
Townhouse                           3.0        2.0        2.0
Villa                               2.0        1.0        2.0


In [32]:
# If car_parking == 0, set carspaces to 0
domain_expanded.loc[(domain_expanded['car_parking'] == 0) & (domain_expanded['carspaces'].isna()), 'carspaces'] = 0

# Impute with median for the rest 
for col in ['bedrooms', 'bathrooms', 'carspaces']:
    domain_expanded[col] = domain_expanded.groupby('property_type')[col].transform(
        lambda x: x.fillna(x.median())
    )

In [33]:
domain_expanded.isnull().sum()

sa2_code                  0
sa2_name                  0
suburb                    0
postcode                  0
weekly_rent               0
bond                      0
address                  91
lat                       0
lon                       0
bedrooms                  0
bathrooms                 0
carspaces                 0
property_type             0
balcony                   0
car_parking               0
heating                   0
air_conditioning          0
builtin_wardrobes         0
swimming_pool             0
ensuite                   0
dishwasher                0
garden                    0
gym                       0
pets_allowed              0
gas                       0
intercom                  0
security_system           0
washing_machine           0
property_type_grouped     0
dtype: int64

In [34]:
print(domain_expanded.shape)
domain_expanded.describe()

(12616, 29)


Unnamed: 0,postcode,weekly_rent,bond,lat,lon,bedrooms,bathrooms,carspaces,balcony,car_parking,...,swimming_pool,ensuite,dishwasher,garden,gym,pets_allowed,gas,intercom,security_system,washing_machine
count,12616.0,12616.0,12616.0,12616.0,12616.0,12616.0,12616.0,12616.0,12616.0,12616.0,...,12616.0,12616.0,12616.0,12616.0,12616.0,12616.0,12616.0,12616.0,12616.0,12616.0
mean,3236.604153,628.894658,2778.875079,-37.782524,144.966695,2.729312,1.588618,1.456563,0.244689,0.416218,...,0.044705,0.141249,0.431833,0.067216,0.060558,0.082673,0.216233,0.12936,0.048431,0.329423
std,279.89591,284.408111,2400.057844,0.387175,0.554449,1.162112,0.635265,1.050912,0.42992,0.49295,...,0.206664,0.348292,0.495351,0.250406,0.238527,0.275398,0.411692,0.335611,0.214683,0.470022
min,3000.0,100.0,200.0,-38.82908,141.00055,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3044.0,490.0,2086.0,-37.905163,144.869393,2.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3138.0,565.0,2433.0,-37.823923,144.98278,3.0,2.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,3220.0,690.0,2955.0,-37.751486,145.11011,4.0,2.0,2.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,3996.0,5750.0,212917.0,-34.16681,149.75592,50.0,12.0,22.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [35]:
# convert data type for consistency
domain_expanded['bedrooms'] = domain_expanded['bedrooms'].astype(int)
domain_expanded['bathrooms'] = domain_expanded['bathrooms'].astype(int)
domain_expanded['carspaces'] = domain_expanded['carspaces'].astype(int)

In [36]:
domain_expanded.to_csv('../../datasets/raw/cleaned/domain_cleaned.csv', index=False)