# Population Data Preprocessing

## Purpose

This notebook processes historical and projected population data for Victoria's SA2 regions, and integrates it with rental data to prepare features for later analysis and modelling. 

Key steps include:
- Load SA2-level population data and filter to Victoria
- Clean and select relevant columns from both historical and projected population datasets
- Check for negative or missing values and ensure consistency in SA2 codes
- Load median rental prices by suburb and map them to SA2 regions
- Merge historical population, projected population, and rental data into a combined dataset
- Compute population growth rates (historical and projected)
- Interpolate missing ERP (Effective Rent Price) values across years
- Conduct backtesting to check interpolation accuracy (MAE, MAPE)

## Inputs
- `../../datasets/population_data/32180_ERP_2024_SA2_GDA2020.gpkg` (historical ERP)
- `../../datasets/population_data/VIF2023_SA2_Pop_Hhold_Dwelling_Projections_to_2036_Release_2.xlsx` (projected ERP)
- `../../datasets/property/median_by_suburb/Moving annual median rent by suburb and town - March quarter 2025 (2).xlsx`
- `../../datasets/district_shape/sa2_lookup/mapped_target_suburbs.csv` (suburb → SA2 mapping)

## Outputs
- `combined_df_full` — fully merged dataset including historical population, projected population, interpolated ERP values, and 2025 median rental prices by SA2


In [11]:
# Libraries
import os
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
import matplotlib.pyplot as plt
import folium
import numpy as np
from pathlib import Path
import statsmodels.api as sm
import seaborn as sns
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn.linear_model import LinearRegression

## 1. Historical Population Data

In [12]:
gdf = gpd.read_file("../../datasets/population_data/32180_ERP_2024_SA2_GDA2020.gpkg")
gdf.columns

Index(['state_code_2021', 'state_name_2021', 'gccsa_code_2021',
       'gccsa_name_2021', 'sa4_code_2021', 'sa4_name_2021', 'sa3_code_2021',
       'sa3_name_2021', 'sa2_code_2021', 'sa2_name_2021', 'erp_2001',
       'erp_2002', 'erp_2003', 'erp_2004', 'erp_2005', 'erp_2006', 'erp_2007',
       'erp_2008', 'erp_2009', 'erp_2010', 'erp_2011', 'erp_2012', 'erp_2013',
       'erp_2014', 'erp_2015', 'erp_2016', 'erp_2017', 'erp_2018', 'erp_2019',
       'erp_2020', 'erp_2021', 'erp_2022', 'erp_2023', 'erp_2024',
       'erp_change_number_2023_24', 'erp_change_per_cent_2023_24', 'area_km2',
       'pop_density_2024_people_per_km2', 'births_2021_22', 'deaths_2021_22',
       'natural_increase_2021_22', 'internal_arrivals_2021_22',
       'internal_departures_2021_22', 'net_internal_migration_2021_22',
       'overseas_arrivals_2021_22', 'overseas_departures_2021_22',
       'net_overseas_migration_2021_22', 'births_2022_23', 'deaths_2022_23',
       'natural_increase_2022_23', 'internal_arr

In [13]:
# Filter to Victoria only
gdf_vic = gdf[gdf["state_name_2021"] == "Victoria"]
print(gdf_vic.shape)
gdf_vic.head()

(522, 66)


Unnamed: 0,state_code_2021,state_name_2021,gccsa_code_2021,gccsa_name_2021,sa4_code_2021,sa4_name_2021,sa3_code_2021,sa3_name_2021,sa2_code_2021,sa2_name_2021,...,births_2023_24,deaths_2023_24,natural_increase_2023_24,internal_arrivals_2023_24,internal_departures_2023_24,net_internal_migration_2023_24,overseas_arrivals_2023_24,overseas_departures_2023_24,net_overseas_migration_2023_24,geometry
642,2,Victoria,2RVIC,Rest of Vic.,201,Ballarat,20101,Ballarat,201011001,Alfredton,...,232,61,171,2321,1470,851,167,54,113,"MULTIPOLYGON (((143.78282 -37.56666, 143.78299..."
643,2,Victoria,2RVIC,Rest of Vic.,201,Ballarat,20101,Ballarat,201011002,Ballarat,...,86,129,-43,1312,1415,-103,159,51,108,"MULTIPOLYGON (((143.81896 -37.55582, 143.81886..."
644,2,Victoria,2RVIC,Rest of Vic.,201,Ballarat,20101,Ballarat,201011005,Buninyong,...,45,32,13,567,613,-46,91,23,68,"MULTIPOLYGON (((143.84171 -37.61596, 143.84142..."
645,2,Victoria,2RVIC,Rest of Vic.,201,Ballarat,20101,Ballarat,201011006,Delacombe,...,264,89,175,2422,1224,1198,49,16,33,"MULTIPOLYGON (((143.7505 -37.59119, 143.75052 ..."
646,2,Victoria,2RVIC,Rest of Vic.,201,Ballarat,20101,Ballarat,201011007,Smythes Creek,...,34,12,22,293,307,-14,4,1,3,"MULTIPOLYGON (((143.73296 -37.62333, 143.73103..."


In [14]:
# Drop unneeded columns
cols_to_drop = ['state_code_2021', 'state_name_2021', 'gccsa_code_2021','gccsa_name_2021', 'sa4_name_2021', 'sa4_code_2021', 'sa3_code_2021', 'sa3_name_2021', 'erp_2001','erp_2002', 'erp_2003', 'erp_2004', 'erp_2005', 'erp_2006', 'erp_2007',
       'erp_2008', 'erp_2009', 'erp_2010', 'erp_2011', 'erp_2012', 'erp_2013', 'erp_2014']

pop_vic = gdf_vic.drop(columns=cols_to_drop, errors="ignore")

print(pop_vic.shape)
pop_vic.head()

(522, 44)


Unnamed: 0,sa2_code_2021,sa2_name_2021,erp_2015,erp_2016,erp_2017,erp_2018,erp_2019,erp_2020,erp_2021,erp_2022,...,births_2023_24,deaths_2023_24,natural_increase_2023_24,internal_arrivals_2023_24,internal_departures_2023_24,net_internal_migration_2023_24,overseas_arrivals_2023_24,overseas_departures_2023_24,net_overseas_migration_2023_24,geometry
642,201011001,Alfredton,11039.0,11852,12649,13537,14434,15507,16841,18002,...,232,61,171,2321,1470,851,167,54,113,"MULTIPOLYGON (((143.78282 -37.56666, 143.78299..."
643,201011002,Ballarat,12300.0,12301,12266,12244,12320,12196,12071,11938,...,86,129,-43,1312,1415,-103,159,51,108,"MULTIPOLYGON (((143.81896 -37.55582, 143.81886..."
644,201011005,Buninyong,7191.0,7311,7409,7418,7458,7377,7229,7247,...,45,32,13,567,613,-46,91,23,68,"MULTIPOLYGON (((143.84171 -37.61596, 143.84142..."
645,201011006,Delacombe,6846.0,7195,7622,8183,8890,9755,10648,11798,...,264,89,175,2422,1224,1198,49,16,33,"MULTIPOLYGON (((143.7505 -37.59119, 143.75052 ..."
646,201011007,Smythes Creek,3966.0,3990,4004,4042,4112,4152,4211,4223,...,34,12,22,293,307,-14,4,1,3,"MULTIPOLYGON (((143.73296 -37.62333, 143.73103..."


In [15]:
# Rename for consistency
pop_vic.rename(columns={'sa2_code_2021': 'sa2_code','sa2_name_2021': 'sa2_name'},inplace=True)
pop_vic

Unnamed: 0,sa2_code,sa2_name,erp_2015,erp_2016,erp_2017,erp_2018,erp_2019,erp_2020,erp_2021,erp_2022,...,births_2023_24,deaths_2023_24,natural_increase_2023_24,internal_arrivals_2023_24,internal_departures_2023_24,net_internal_migration_2023_24,overseas_arrivals_2023_24,overseas_departures_2023_24,net_overseas_migration_2023_24,geometry
642,201011001,Alfredton,11039.0,11852,12649,13537,14434,15507,16841,18002,...,232,61,171,2321,1470,851,167,54,113,"MULTIPOLYGON (((143.78282 -37.56666, 143.78299..."
643,201011002,Ballarat,12300.0,12301,12266,12244,12320,12196,12071,11938,...,86,129,-43,1312,1415,-103,159,51,108,"MULTIPOLYGON (((143.81896 -37.55582, 143.81886..."
644,201011005,Buninyong,7191.0,7311,7409,7418,7458,7377,7229,7247,...,45,32,13,567,613,-46,91,23,68,"MULTIPOLYGON (((143.84171 -37.61596, 143.84142..."
645,201011006,Delacombe,6846.0,7195,7622,8183,8890,9755,10648,11798,...,264,89,175,2422,1224,1198,49,16,33,"MULTIPOLYGON (((143.7505 -37.59119, 143.75052 ..."
646,201011007,Smythes Creek,3966.0,3990,4004,4042,4112,4152,4211,4223,...,34,12,22,293,307,-14,4,1,3,"MULTIPOLYGON (((143.73296 -37.62333, 143.73103..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1159,217031476,Otway,3538.0,3556,3635,3710,3802,3911,3979,3974,...,26,36,-10,209,269,-60,51,16,35,"MULTIPOLYGON (((143.40263 -38.78152, 143.4028 ..."
1160,217041477,Moyne - East,6716.0,6709,6717,6746,6798,6883,6990,7046,...,82,52,30,475,516,-41,36,11,25,"MULTIPOLYGON (((142.41438 -38.09303, 142.39372..."
1161,217041478,Moyne - West,9467.0,9603,9686,9783,9845,9859,9967,10098,...,106,69,37,541,557,-16,68,21,47,"MULTIPOLYGON (((142.0087 -38.41715, 142.00871 ..."
1162,217041479,Warrnambool - North,21217.0,21442,21688,21954,22184,22416,22470,22586,...,254,166,88,1454,1534,-80,217,70,147,"MULTIPOLYGON (((142.43668 -38.35544, 142.43666..."


In [16]:
# Check for negative values
numeric_cols = pop_vic.select_dtypes(include='number')
negative_erp = pop_vic[numeric_cols.filter(like='erp').columns].lt(0).sum()
print("Negative population counts per column:\n", negative_erp)
print("Unique SA2 codes:", pop_vic["sa2_code"].nunique())


Negative population counts per column:
 erp_2015                        0
erp_2016                        0
erp_2017                        0
erp_2018                        0
erp_2019                        0
erp_2020                        0
erp_2021                        0
erp_2022                        0
erp_2023                        0
erp_2024                        0
erp_change_number_2023_24      50
erp_change_per_cent_2023_24    50
dtype: int64
Unique SA2 codes: 522


In [607]:
# Save to local
pop_vic.to_file("../datasets/raw/vic_population.gpkg", driver="GPKG")

## 2. Projection Data (VIF)

In [19]:
vif_data_pop = pd.read_excel("../../datasets/population_data/VIF2023_SA2_Pop_Hhold_Dwelling_Projections_to_2036_Release_2.xlsx", 
                                       sheet_name="Dwellings_and_Households", skiprows=9)
print(vif_data_pop.shape)
vif_data_pop.head()

(612, 34)


Unnamed: 0,GCCSA,SA4 Code,SA3 Code,SA2 code,Region Type,Region,2021,2026,2031,2036,...,2031.4,2036.4,2021.5,2026.5,2031.5,2036.5,2021.6,2026.6,2031.6,2036.6
0,,,,,,,,,,,...,,,,,,,,,,
1,2RVIC,201.0,20101.0,201011001.0,SA2,Alfredton,16841.0,20756.256163,23604.443836,26060.320807,...,2.575123,2.554885,6245.0,8252.0,9732.0,10830.0,0.967833,0.935,0.935,0.935
2,2RVIC,201.0,20101.0,201011002.0,SA2,Ballarat,12071.0,11698.293593,11803.430603,11985.992387,...,1.975963,1.937952,5970.0,6134.548371,6350.037451,6553.095884,0.90819,0.895,0.895,0.895
3,2RVIC,201.0,20101.0,201011005.0,SA2,Buninyong,7229.0,7372.079773,7685.113372,8028.887243,...,2.475924,2.405964,2768.0,2943.325691,3199.637967,3445.433425,0.963041,0.94,0.94,0.94
4,2RVIC,201.0,20101.0,201011006.0,SA2,Delacombe,10648.0,15915.186041,20475.587469,24965.202439,...,2.421662,2.395607,4172.0,6585.376102,8740.266903,10770.851234,0.984004,0.955,0.955,0.955


In [20]:
# rename columns 
new_columns = ['gcsa', 'sa4_code', 'sa3_code',
    "sa2_code", 'region_type', "region",
    "erp_2021", "erp_2026", "erp_2031", "erp_2036",
    "pnpd_2021", "pnpd_2026", "pnpd_2031", "pnpd_2036",
    "popd_2021", "popd_2026", "popd_2031", "popd_2036",
    "opd_2021", "opd_2026", "opd_2031", "opd_2036",
    "hhs_2021", "hhs_2026", "hhs_2031", "hhs_2036",
    "spd_2021", "spd_2026", "spd_2031", "spd_2036",
    "occ_2021", "occ_2026", "occ_2031", "occ_2036"
]

vif_data_pop.columns = new_columns

vif_data_pop = vif_data_pop.dropna(axis=1, how='all')
vif_data_pop = vif_data_pop.dropna(how='all')
vif_data_pop.drop(columns=['gcsa', 'sa4_code', 'sa3_code', 'region_type'], inplace=True)
vif_data_pop = vif_data_pop.dropna(subset=['sa2_code']).copy()
print(vif_data_pop.columns)
print("Unique SA2 codes:", vif_data_pop["sa2_code"].nunique())

Index(['sa2_code', 'region', 'erp_2021', 'erp_2026', 'erp_2031', 'erp_2036',
       'pnpd_2021', 'pnpd_2026', 'pnpd_2031', 'pnpd_2036', 'popd_2021',
       'popd_2026', 'popd_2031', 'popd_2036', 'opd_2021', 'opd_2026',
       'opd_2031', 'opd_2036', 'hhs_2021', 'hhs_2026', 'hhs_2031', 'hhs_2036',
       'spd_2021', 'spd_2026', 'spd_2031', 'spd_2036', 'occ_2021', 'occ_2026',
       'occ_2031', 'occ_2036'],
      dtype='object')
Unique SA2 codes: 522


## 3.  Median by Suburb Rental Data

In [27]:
median_rentals = pd.read_excel("../../datasets/property/median_by_suburb/Moving annual median rent by suburb and town - March quarter 2025 (2).xlsx", sheet_name="All properties", skiprows= 1)
median_rentals.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Mar 2000,Mar 2000.1,Jun 2000,Jun 2000.1,Sep 2000,Sep 2000.1,Dec 2000,...,Mar 2024,Mar 2024.1,Jun 2024,Jun 2024.1,Sep 2024,Sep 2024.1,Dec 2024,Dec 2024.1,Mar 2025,Mar 2025.1
0,,,,Count,Median,Count,Median,Count,Median,Count,...,Count,Median,Count,Median,Count,Median,Count,Median,Count,Median
1,Inner Melbourne,Albert Park-Middle Park-West St Kilda,,1143,260,1134,260,1177,270,1178,...,677,660,661,675,640,693,621,700,621,700
2,,Armadale,,733,200,737,200,738,205,739,...,571,560,565,590,569,600,589,600,625,625
3,,Carlton North,,864,260,814,260,799,265,736,...,391,680,381,690,389,680,379,700,371,720
4,,Carlton-Parkville,,1303,251,1278,260,1280,260,1301,...,2619,570,2678,580,2681,585,2621,600,2643,600


In [28]:
# reformat
median_rentals = median_rentals.drop(median_rentals.columns[0], axis=1)
median_rentals = median_rentals.drop(median_rentals.columns[1], axis=1)

In [29]:
# cap to 2015 above
median_rentals = median_rentals.drop(median_rentals.columns[1:121], axis=1)
median_rentals.head()

Unnamed: 0,Unnamed: 1,Mar 2015,Mar 2015.1,Jun 2015,Jun 2015.1,Sep 2015,Sep 2015.1,Dec 2015,Dec 2015.1,Mar 2016,...,Mar 2024,Mar 2024.1,Jun 2024,Jun 2024.1,Sep 2024,Sep 2024.1,Dec 2024,Dec 2024.1,Mar 2025,Mar 2025.1
0,,Count,Median,Count,Median,Count,Median,Count,Median,Count,...,Count,Median,Count,Median,Count,Median,Count,Median,Count,Median
1,Albert Park-Middle Park-West St Kilda,942,480,925,490,913,495,941,500,961,...,677,660,661,675,640,693,621,700,621,700
2,Armadale,680,400,663,399,663,400,659,400,653,...,571,560,565,590,569,600,589,600,625,625
3,Carlton North,535,530,529,530,541,530,561,530,544,...,391,680,381,690,389,680,379,700,371,720
4,Carlton-Parkville,2170,415,2429,440,2501,450,2510,450,2663,...,2619,570,2678,580,2681,585,2621,600,2643,600


In [30]:
# fix column names
new_cols = []
for col in median_rentals.columns:
    if ".1" in col:
        # Median column
        base = col.replace(".1", "").strip().split(" ")
        month, year = base[0].lower(), base[1]
        new_cols.append(f"median_{month}_{year}")
    else:
        # Count column
        base = col.strip().split(" ")
        month, year = base[0].lower(), base[1]
        new_cols.append(f"count_{month}_{year}")

median_rentals.columns = new_cols
median_rentals.rename(columns={'count_unnamed:_1': 'suburb'},inplace=True)
median_rentals = median_rentals.drop(median_rentals.index[0])

In [31]:
median_rentals['suburb']

1      Albert Park-Middle Park-West St Kilda
2                                   Armadale
3                              Carlton North
4                          Carlton-Parkville
5                            CBD-St Kilda Rd
                       ...                  
155                              Wanagaratta
156                                 Warragul
157                              Warrnambool
158                                  Wodonga
159                              Group Total
Name: suburb, Length: 159, dtype: object

### Map Suburb to SA2 rental data

In [36]:
suburb_to_sa2 = pd.read_csv("../../datasets/district_shape/sa2_lookup/mapped_target_suburbs.csv")  

median_rentals['suburb'] = median_rentals['suburb'].str.strip()
suburb_to_sa2['Target_Suburb'] = suburb_to_sa2['Target_Suburb'].str.strip()

# Merge rental data with SA2 mapping
rental_with_sa2 = median_rentals.merge(suburb_to_sa2[['Target_Suburb', 'SA2_CODE21', 'SA2_NAME21']],left_on='suburb',right_on='Target_Suburb',how='left')

rental_with_sa2 = rental_with_sa2.drop(columns=['Target_Suburb'])
rental_with_sa2.head()

Unnamed: 0,suburb,count_mar_2015,median_mar_2015,count_jun_2015,median_jun_2015,count_sep_2015,median_sep_2015,count_dec_2015,median_dec_2015,count_mar_2016,...,count_jun_2024,median_jun_2024,count_sep_2024,median_sep_2024,count_dec_2024,median_dec_2024,count_mar_2025,median_mar_2025,SA2_CODE21,SA2_NAME21
0,Albert Park-Middle Park-West St Kilda,942,480,925,490,913,495,941,500,961,...,661,675,640,693,621,700,621,700,206051128.0,Albert Park
1,Armadale,680,400,663,399,663,400,659,400,653,...,565,590,569,600,589,600,625,625,206061135.0,Armadale
2,Carlton North,535,530,529,530,541,530,561,530,544,...,381,690,389,680,379,700,371,720,206041117.0,Carlton
3,Carlton-Parkville,2170,415,2429,440,2501,450,2510,450,2663,...,2678,580,2681,585,2621,600,2643,600,206041124.0,Parkville
4,CBD-St Kilda Rd,8776,450,8842,450,9146,450,9040,450,8622,...,13517,650,13253,650,13028,650,13383,650,206041125.0,South Yarra - West


In [37]:
unmatched_suburbs = rental_with_sa2[rental_with_sa2['SA2_CODE21'].isna()]['suburb'].unique()
print("Unmatched suburbs:", unmatched_suburbs)

Unmatched suburbs: ['Group Total' 'Wanagaratta']


In [38]:
# Drop unmatched suburb, rename columns for consistency and fix format
rental_with_sa2 = rental_with_sa2[~rental_with_sa2['suburb'].isin(['Group Total', 'Wanagaratta'])]
rental_with_sa2 = rental_with_sa2.rename(columns={'SA2_CODE21': 'sa2_code','SA2_NAME21': 'sa2_name'})
rental_with_sa2['sa2_code'] = rental_with_sa2['sa2_code'].astype('Int64').astype(str)

### Merge Data for Population Analysis

In [39]:
# Get projected and historical data
proj_data = vif_data_pop.copy()
hist_data = pop_vic.copy()
print(proj_data['sa2_code'].dtype)
print(hist_data['sa2_code'].dtype)

float64
int64


In [40]:
# Reformat and merge (note that erp_21 is removed from proj_data cus historical is available)
hist_data['sa2_code'] = hist_data['sa2_code'].astype(str).str.strip()
proj_data['sa2_code'] = proj_data['sa2_code'].astype('Int64').astype(str)

proj_data = proj_data.drop(columns=['erp_2021'])

combined = pd.merge(hist_data, proj_data, on='sa2_code', how='left')

In [41]:
# Check SA2 mismatches
missing_in_abs = set(proj_data['sa2_code']) - set(hist_data['sa2_code'])
print(missing_in_abs)

missing_in_vif = set(hist_data['sa2_code']) - set(proj_data['sa2_code'])
print(missing_in_vif)

combined.shape

set()
set()


(522, 72)

In [42]:
# Take only 2025 rental median values and merge with combined population data
rental_2025 = rental_with_sa2[['sa2_code', 'sa2_name', 'median_mar_2025']].copy()

rental_2025 = rental_2025.rename(columns={'median_mar_2025': 'median_rent_2025'})
combined_df = combined.merge(rental_2025[['sa2_code', 'median_rent_2025']], how='left',on='sa2_code')

## 4. Feature Engineering

In [44]:
# Replace 0s with NaN in ERP columns
erp_cols = ['erp_2015', 'erp_2020', 'erp_2021', 'erp_2024', 'erp_2026', 'erp_2031', 'erp_2036']
combined_df[erp_cols] = combined_df[erp_cols].replace(0, np.nan)

# year pairs for growth calculations
growth_pairs = {'pop_growth_2015_2020': ('erp_2015', 'erp_2020'), 'pop_growth_2020_2024': ('erp_2020', 'erp_2024'), 'pop_growth_2021_2026': ('erp_2021', 'erp_2026'), 'pop_growth_2026_2031': ('erp_2026', 'erp_2031'),'pop_growth_2031_2036': ('erp_2031', 'erp_2036')}

# Compute growth for each pair
for col, (start, end) in growth_pairs.items(): 
    combined_df[col] = ((combined_df[end] - combined_df[start]) / combined_df[start]) * 100

# Median growths
combined_df['median_projected_growth'] = combined_df[['pop_growth_2021_2026','pop_growth_2026_2031','pop_growth_2031_2036']].median(axis=1)
combined_df['median_historical_growth'] = combined_df[['pop_growth_2015_2020','pop_growth_2020_2024']].median(axis=1)

In [45]:
combined_df.head()

Unnamed: 0,sa2_code,sa2_name,erp_2015,erp_2016,erp_2017,erp_2018,erp_2019,erp_2020,erp_2021,erp_2022,...,occ_2031,occ_2036,median_rent_2025,pop_growth_2015_2020,pop_growth_2020_2024,pop_growth_2021_2026,pop_growth_2026_2031,pop_growth_2031_2036,median_projected_growth,median_historical_growth
0,201011001,Alfredton,11039.0,11852,12649,13537,14434,15507.0,16841.0,18002,...,0.935,0.935,,40.474681,29.812343,23.248359,13.722068,10.404299,13.722068,35.143512
1,201011002,Ballarat,12300.0,12301,12266,12244,12320,12196.0,12071.0,11938,...,0.895,0.895,395.0,-0.845528,-3.46835,-3.087618,0.898738,1.546684,0.898738,-2.156939
2,201011005,Buninyong,7191.0,7311,7409,7418,7458,7377.0,7229.0,7247,...,0.94,0.94,413.0,2.586567,-0.257557,1.979247,4.246205,4.473244,4.246205,1.164505
3,201011006,Delacombe,6846.0,7195,7622,8183,8890,9755.0,10648.0,11798,...,0.955,0.955,430.0,42.491966,46.294208,49.466435,28.654402,21.926672,28.654402,44.393087
4,201011007,Smythes Creek,3966.0,3990,4004,4042,4112,4152.0,4211.0,4223,...,0.93,0.93,,4.689864,3.034682,2.40082,3.369934,6.013677,3.369934,3.862273


In [None]:
# Save CSV without geometry
combined_nogeo = combined_df.drop(columns=["geometry"], errors="ignore")
combined_nogeo.to_csv("../../datasets/raw/curated/population_data.csv", index=False)

## 5. Forecasting Population Growth

## Linear Interpolation

In [46]:
erp_cols = [col for col in combined_df.columns if col.startswith('erp_20')]

def interpolate_erp(row):  
    """
    Interpolates missing ERP (Effective Rent Price) values across years for a single row of data.
    
    Parameters:
        A row from a DataFrame containing ERP values for multiple years (e.g., erp_2015, erp_2016, ...).
    
    Returns:
        A new Series containing interpolated ERP values for all years in the range, with column names formatted as 'erp_<year>'.
    """
    # Extract year from column names
    years = [int(col.split('_')[1]) for col in erp_cols]
    values = row[erp_cols].values.astype(float)
    
    # Mask for non-NaN values
    mask = ~np.isnan(values)
    known_years = np.array(years)[mask]
    known_values = values[mask]
    
    # Interpolate for all years in range
    full_years = np.arange(min(years), max(years)+1)
    interpolated = np.interp(full_years, known_years, known_values)
    
    # Return as a Series with year-specific column names
    return pd.Series(interpolated, index=[f'erp_{y}' for y in full_years])

# Apply to all SA2s
erp_interpolated = combined_df.apply(interpolate_erp, axis=1)

combined_df_full = pd.concat([combined_df[['sa2_code']], erp_interpolated], axis=1)

combined_df_full.head()


Unnamed: 0,sa2_code,erp_2015,erp_2016,erp_2017,erp_2018,erp_2019,erp_2020,erp_2021,erp_2022,erp_2023,...,erp_2027,erp_2028,erp_2029,erp_2030,erp_2031,erp_2032,erp_2033,erp_2034,erp_2035,erp_2036
0,201011001,11039.0,11852.0,12649.0,13537.0,14434.0,15507.0,16841.0,18002.0,18995.0,...,21325.893697,21895.531232,22465.168766,23034.806301,23604.443836,24095.61923,24586.794624,25077.970018,25569.145413,26060.320807
1,201011002,12300.0,12301.0,12266.0,12244.0,12320.0,12196.0,12071.0,11938.0,11811.0,...,11719.320995,11740.348397,11761.375799,11782.403201,11803.430603,11839.94296,11876.455316,11912.967673,11949.48003,11985.992387
2,201011005,7191.0,7311.0,7409.0,7418.0,7458.0,7377.0,7229.0,7247.0,7323.0,...,7434.686493,7497.293213,7559.899932,7622.506652,7685.113372,7753.868146,7822.62292,7891.377695,7960.132469,8028.887243
3,201011006,6846.0,7195.0,7622.0,8183.0,8890.0,9755.0,10648.0,11798.0,12865.0,...,16827.266327,17739.346612,18651.426898,19563.507183,20475.587469,21373.510463,22271.433457,23169.356451,24067.279445,24965.202439
4,201011007,3966.0,3990.0,4004.0,4042.0,4112.0,4152.0,4211.0,4223.0,4267.0,...,4341.161505,4370.224481,4399.287456,4428.350431,4457.413406,4511.024293,4564.635179,4618.246065,4671.856951,4725.467837


### Backtest for evaluation

In [47]:
erp_cols = [col for col in combined_df.columns if col.startswith('erp_20')]

def interpolate_erp_backtest(row, drop_years):
    years = np.array([int(col.split('_')[1]) for col in erp_cols])
    values = row[erp_cols].values.astype(float)
    
    # Temporarily remove selected years
    mask = ~np.isin(years, drop_years)
    known_years = years[mask]
    known_values = values[mask]
    
    full_years = np.arange(min(years), max(years)+1)
    interpolated = np.interp(full_years, known_years, known_values)
    
    return pd.Series(interpolated, index=[f'erp_{y}' for y in full_years])

# years to hide for backtesting
drop_years = [2017, 2022, 2024]

results = []

for sa2 in combined_df['sa2_code'].unique():
    df_sa2 = combined_df[combined_df['sa2_code'] == sa2].iloc[0]  # single row per SA2
    
    # Interpolate
    interpolated = interpolate_erp_backtest(df_sa2, drop_years)
    
    # Actual values
    actual_values = df_sa2[[f'erp_{y}' for y in drop_years]].values.astype(float)
    pred_values = interpolated[[f'erp_{y}' for y in drop_years]].values.astype(float)
    
    if np.isnan(actual_values).any() or np.isnan(pred_values).any():
        continue
    
    # Compute errors
    mae = mean_absolute_error(actual_values, pred_values)
    mape = mean_absolute_percentage_error(actual_values, pred_values) * 100
    
    results.append({'sa2_code': sa2, 'mae': mae, 'mape': mape})

accuracy_summary = pd.DataFrame(results)
overall = accuracy_summary[['mae', 'mape']].mean()
print("Overall backtest metrics:")
print(overall)

Overall backtest metrics:
mae     115.363943
mape      1.135614
dtype: float64


In [None]:
# Save file
combined_df_full.to_csv("../../datasets/raw/curated/full_erp_only_population_data.csv", index=False)
