# HDB Resale Price Predictor & Visualisation

This project aims to create a data pipeline with the help of availale APIs (Data.gov.sg and OneMap) to build a web-based application for
1. HDB Price visualisation
2. HDB Price prediction

The prototype aims to read latest data directly from data.gov.sg and perform ETL (Extract, Transform, and Load) to a local/web database of choice.

In [1]:
import os

os.chdir('f:\python_stuff\ml_webapp')
print(f'Working directory: {str(os.getcwd())}')

Working directory: f:\python_stuff\ml_webapp


In [2]:
from modules.utils import *
from modules.utils import logger
from etl import *

## Data Wrangling Contents
1. API call data
2. Data Wrangling
3. Feature Engineering

## 1. Getting the data through API call

### Wrapper functions
* To time function calls
* To error handle HTTPerrors and other Exceptions
* To cache API calls

In [3]:
with open('config.yaml', 'r') as file:
    config = yaml.safe_load(file)
    
    # if config['automation'] & datetime.now().day != 30:
    #     print('Exiting ETL script - script will only run on 30th of each month')
    #     sys.exit()

    # Accounts for filepathing local and in pythonanywhere
    if config['local']:
        cache_filepath = config['local_cache_filepath']
    else:
        os.chdir(config['web_prefix'])
        cache_filepath = 'project_cache'
    
    # files to append to
    output_file_train = config['train']
    output_file_test = config['test']

    # Determines whether to extract all data for current year, or particular year and months
    use_curr_datetime = config['use_datetime']
    if use_curr_datetime:
        timestamp = datetime.now()
        years = [timestamp.year]
        months = [x for x in range(1, timestamp.month+1)]
    else:
        years = config['year']
        months = config['months']

logger.info(f"{'-'*50}New run started {'-'*50}")
logger.info(f'Data extraction settings:')
logger.info(f'\tuse_curr_datetime: {use_curr_datetime}')
logger.info(f'\tyear(s): {years}')
logger.info(f'\tmonth(s): {months}')

# Enable caching
session = requests_cache.CachedSession(cache_filepath, backend="sqlite")

--------------------New run started ----------------------------------------------------------------------------------------------------
Data extraction settings:
	use_curr_datetime: True
	year(s): [2024]
	month(s): [1]


### Details for Data.gov.sg API call can be found at
https://data.gov.sg/dataset/ckan-datastore-search

In [4]:
# There is now a limit to the API calls, so split to individual call for each month instead
df = pd.DataFrame()
logger.info('Making API calls to data.gov.sg')
for month in months:
    temp_df = datagovsg_api_call('https://data.gov.sg/api/action/datastore_search?resource_id=f1765b54-a209-4718-8d38-a39237f502b3', 
                            sort='month desc',
                            limit = 10000,
                            months = [month],
                            years=years)
    logger.info(f'\tData df shape received: {temp_df.shape}')
    if df.empty:
        df = temp_df
    else:
        df = pd.concat([df, temp_df])

Making API calls to data.gov.sg
datagovsg_api_call() called at 	09:11:21
datagovsg_api_call() ended at 	09:11:21 	execution time: 0.5463 seconds
datagovsg_api_call() ended at 	execution time: 0.5463 seconds
	Data df shape received: (1160, 12)


In [5]:
df

Unnamed: 0,_id,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,remaining_lease,resale_price
0,169172,2024-01,ANG MO KIO,2 ROOM,116,ANG MO KIO AVE 4,07 TO 09,44,Improved,1978,53 years 06 months,288000
1,169173,2024-01,ANG MO KIO,2 ROOM,510,ANG MO KIO AVE 8,04 TO 06,44,Improved,1980,55 years 07 months,322500
2,169174,2024-01,ANG MO KIO,3 ROOM,308B,ANG MO KIO AVE 1,01 TO 03,70,Model A,2012,87 years 09 months,520000
3,169175,2024-01,ANG MO KIO,3 ROOM,308B,ANG MO KIO AVE 1,25 TO 27,70,Model A,2012,87 years 09 months,650000
4,169176,2024-01,ANG MO KIO,3 ROOM,223,ANG MO KIO AVE 1,04 TO 06,67,New Generation,1978,53 years 01 month,343800
...,...,...,...,...,...,...,...,...,...,...,...,...
1155,170327,2024-01,YISHUN,5 ROOM,504B,YISHUN ST 51,07 TO 09,113,Improved,2016,91 years 03 months,723000
1156,170328,2024-01,YISHUN,5 ROOM,602,YISHUN ST 61,07 TO 09,121,Improved,1987,62 years 05 months,688000
1157,170329,2024-01,YISHUN,5 ROOM,820,YISHUN ST 81,04 TO 06,122,Improved,1988,63 years 08 months,670002
1158,170330,2024-01,YISHUN,EXECUTIVE,356,YISHUN RING RD,04 TO 06,146,Maisonette,1988,63 years 08 months,860000


## 2. Data wrangling steps
1. Reindexed dataframe using _id (unique to every resale transaction)
2. Changed room types into float values, with Executive as 5.5 rooms (extra study/balcony/bathroom)
3. Storey range was converted to avg_storey, the avg floor would be used (every value is a difference of 3 storeys)
4. Resale_price, Floor area converted to float values
5. Month was converted into datetime format, to be used to detrend the time series moving average
6. Year/Month was separated into Year and Month for visualisation purposes
7. Remaining lease was converted into remaining months (float)
8. Update capitalisation and street naming conventions (for purpose of API call later)
9. Categorised towns into regions (North, West, East, North-East, Central) https://www.hdb.gov.sg/about-us/history/hdb-towns-your-home

In [6]:
# Data transformation and geolocationing
logger.info('Cleaning data')
df = clean_df(df)
display(df.dtypes)
df

Cleaning data
clean_df() called at 	09:13:04
clean_df() ended at 	09:13:04 	execution time: 0.6841 seconds
clean_df() ended at 	execution time: 0.6841 seconds


resale_price              float64
year                        int32
month                       int32
year_month         datetime64[ns]
region                     object
town                       object
rooms                     float64
avg_storey                float64
floor_area_sqm            float64
remaining_lease           float64
address                    object
dtype: object

Unnamed: 0_level_0,resale_price,year,month,year_month,region,town,rooms,avg_storey,floor_area_sqm,remaining_lease,address
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
169172,288000.0,2024,1,2024-01-01,North-East,Ang Mo Kio,2.0,8.0,44.0,53.500000,"116, Ang Mo Kio Avenue 4"
169173,322500.0,2024,1,2024-01-01,North-East,Ang Mo Kio,2.0,5.0,44.0,55.583333,"510, Ang Mo Kio Avenue 8"
169174,520000.0,2024,1,2024-01-01,North-East,Ang Mo Kio,3.0,2.0,70.0,87.750000,"308B, Ang Mo Kio Avenue 1"
169175,650000.0,2024,1,2024-01-01,North-East,Ang Mo Kio,3.0,26.0,70.0,87.750000,"308B, Ang Mo Kio Avenue 1"
169176,343800.0,2024,1,2024-01-01,North-East,Ang Mo Kio,3.0,5.0,67.0,53.083333,"223, Ang Mo Kio Avenue 1"
...,...,...,...,...,...,...,...,...,...,...,...
170327,723000.0,2024,1,2024-01-01,North,Yishun,5.0,8.0,113.0,91.250000,"504B, Yishun Street 51"
170328,688000.0,2024,1,2024-01-01,North,Yishun,5.0,8.0,121.0,62.416667,"602, Yishun Street 61"
170329,670002.0,2024,1,2024-01-01,North,Yishun,5.0,5.0,122.0,63.666667,"820, Yishun Street 81"
170330,860000.0,2024,1,2024-01-01,North,Yishun,5.5,5.0,146.0,63.666667,"356, Yishun Ring Road"


## 3. Feature Engineering (Geodata)

Lastly, location plays a huge role in house pricing, hence

3.1 Obtaining latitude, longitude, postal codes

3.2 Distance to city center

3.3 Obtaining MRT locations

3.4 Determine nearest MRT and traveling time

### 3.1 Latitude & longitude from address
Using street name and block, I utilized OneMap API to obtain the latitude, longitude, and postal codes of each flat https://www.onemap.gov.sg/docs

In [12]:
logger.info('Getting geolocations')
geo_data_df= get_location_data(df[['address']], verbose=1)
display(geo_data_df.dtypes)
geo_data_df

get_location_data() called at 	01:22:02
get_location_data() ended at 	01:22:02 	execution time: 43.3675 seconds


lat_long        object
postal_code     object
latitude       float64
longitude      float64
numpy_array     object
dtype: object

Unnamed: 0_level_0,lat_long,postal_code,latitude,longitude,numpy_array
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
143396,"1.36200453938712,103.853879910407",560406,1.362005,103.853880,"[1.36200453938712, 103.853879910407]"
143397,"1.36790849360635,103.84771408812",560323,1.367908,103.847714,"[1.36790849360635, 103.84771408812]"
143398,"1.36622707120636,103.850085858983",560314,1.366227,103.850086,"[1.36622707120636, 103.850085858983]"
143399,"1.36622707120636,103.850085858983",560314,1.366227,103.850086,"[1.36622707120636, 103.850085858983]"
143400,"1.37400071781295,103.83643153142",560170,1.374001,103.836432,"[1.37400071781295, 103.83643153142]"
...,...,...,...,...,...
161855,"1.33454683171677,103.845077697814",310147,1.334547,103.845078,"[1.33454683171677, 103.845077697814]"
161856,"1.33159005591995,103.851295104405",310193,1.331590,103.851295,"[1.33159005591995, 103.851295104405]"
161857,"1.33716136352623,103.858353639387",312010,1.337161,103.858354,"[1.33716136352623, 103.858353639387]"
161858,"1.33472713626235,103.849822984337",315079,1.334727,103.849823,"[1.33472713626235, 103.849822984337]"


### 3.2 Distance to city center

The central district of Singapore has the highest housing prices. Property nearer to the city centre tend to have a higher price.

We will make use of this to create a new feature to test if it is significant in model building.

In [14]:
logger.info('Getting distances to city center (Marina Bay)')
dist_to_marina_bay = distance_to(geo_data_df['numpy_array'], 'Marina Bay', dist_type='geodesic', verbose=1)
dist_to_marina_bay = pd.Series(dist_to_marina_bay, name='dist_to_marina_bay')

logger.info('Combining geolocation data to main')
df = pd.concat([df, dist_to_marina_bay, geo_data_df['latitude'], geo_data_df['longitude'], geo_data_df['postal_code']], axis=1)
df

Coordinates of Marina Bay : [  1.28466204 103.86100592]


Unnamed: 0_level_0,resale_price,year,month,timeseries_month,region,town,rooms,avg_storey,floor_area_sqm,remaining_lease,address,dist_to_marina_bay,latitude,longitude,postal_code
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
143396,267000.0,2023,1,2023-01-01,North-East,Ang Mo Kio,2.0,2.0,44.0,55.416667,"406, Ang Mo Kio Avenue 10",8.59,1.362005,103.853880,560406
143397,300000.0,2023,1,2023-01-01,North-East,Ang Mo Kio,2.0,5.0,49.0,53.500000,"323, Ang Mo Kio Avenue 3",9.32,1.367908,103.847714,560323
143398,280000.0,2023,1,2023-01-01,North-East,Ang Mo Kio,2.0,5.0,44.0,54.083333,"314, Ang Mo Kio Avenue 3",9.10,1.366227,103.850086,560314
143399,282000.0,2023,1,2023-01-01,North-East,Ang Mo Kio,2.0,8.0,44.0,54.083333,"314, Ang Mo Kio Avenue 3",9.10,1.366227,103.850086,560314
143400,289800.0,2023,1,2023-01-01,North-East,Ang Mo Kio,2.0,2.0,45.0,62.083333,"170, Ang Mo Kio Avenue 4",10.25,1.374001,103.836432,560170
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161855,820000.0,2023,9,2023-09-01,Central,Toa Payoh,4.0,8.0,86.0,81.416667,"147, Lor 2 Toa Payoh",5.79,1.334547,103.845078,310147
161856,575000.0,2023,9,2023-09-01,Central,Toa Payoh,4.0,20.0,84.0,49.916667,"193, Lor 4 Toa Payoh",5.30,1.331590,103.851295,310193
161857,708000.0,2023,9,2023-09-01,Central,Toa Payoh,4.0,11.0,104.0,73.833333,"10B, Lor 7 Toa Payoh",5.81,1.337161,103.858354,312010
161858,770000.0,2023,9,2023-09-01,Central,Toa Payoh,4.0,20.0,76.0,84.666667,"79E, Toa Payoh Ctrl",5.67,1.334727,103.849823,315079


### 3.3 MRT Locations
The location of all MRT stations was also obtained using OneMap API and saved as a json file locally

Load Json file and convert to numpy array to utilize matrix operations.

In [16]:
# Convert coordinates into numpy arrays
    mrt_coordinates_dict = load_mrt_coordinates()
    mrt_stations = np.array(list(mrt_coordinates_dict.keys()))
    mrt_coordinates = np.array(list(mrt_coordinates_dict.values()))

get_mrt_coordinates() called at 	01:22:52
get_mrt_coordinates() ended at 	01:22:52 	execution time: 0.0006 seconds


### 3.4 Nearest MRT stations and Minimum distance/time
* Using the matrix operations, we are able to find the nearest MRT station by absolute distance 
* Then use OneMap's route_api_call() to get distance/time to MRT stations

In [18]:
n_nearest_stations = 1
# Matrix operations to find nearest MRT stations for each row
logger.info(f'Finding nearest stations: n={n_nearest_stations}')
nearest_stations = geo_data_df.apply(find_nearest_stations, mrt_stations= mrt_stations, mrt_coordinates=mrt_coordinates, n_nearest_stations=n_nearest_stations, axis=1, verbose=0)
nearest_stations_df = pd.DataFrame(nearest_stations.tolist(), index=geo_data_df.index, columns=['nearest_station_'+ str(x) for x in range(n_nearest_stations)] + ['dist_to_station_'+ str(x) for x in range(n_nearest_stations)])
nearest_stations_df

Unnamed: 0_level_0,nearest_station_0,dist_to_station_0
_id,Unnamed: 1_level_1,Unnamed: 2_level_1
143396,Ang Mo Kio MRT,1.00
143397,Ang Mo Kio MRT,0.30
143398,Ang Mo Kio MRT,0.41
143399,Ang Mo Kio MRT,0.41
143400,Mayflower MRT,0.28
...,...,...
161855,Toa Payoh MRT,0.34
161856,Toa Payoh MRT,0.44
161857,Braddell MRT,1.34
161858,Toa Payoh MRT,0.35


In [19]:
df = pd.concat([df, nearest_stations_df], axis=1)
display(df.dtypes)
df

resale_price                 float64
year                           int32
month                          int32
timeseries_month      datetime64[ns]
region                        object
town                          object
rooms                        float64
avg_storey                   float64
floor_area_sqm               float64
remaining_lease              float64
address                       object
dist_to_marina_bay           float64
latitude                     float64
longitude                    float64
postal_code                   object
nearest_station_0             object
dist_to_station_0            float64
dtype: object

Unnamed: 0_level_0,resale_price,year,month,timeseries_month,region,town,rooms,avg_storey,floor_area_sqm,remaining_lease,address,dist_to_marina_bay,latitude,longitude,postal_code,nearest_station_0,dist_to_station_0
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
143396,267000.0,2023,1,2023-01-01,North-East,Ang Mo Kio,2.0,2.0,44.0,55.416667,"406, Ang Mo Kio Avenue 10",8.59,1.362005,103.853880,560406,Ang Mo Kio MRT,1.00
143397,300000.0,2023,1,2023-01-01,North-East,Ang Mo Kio,2.0,5.0,49.0,53.500000,"323, Ang Mo Kio Avenue 3",9.32,1.367908,103.847714,560323,Ang Mo Kio MRT,0.30
143398,280000.0,2023,1,2023-01-01,North-East,Ang Mo Kio,2.0,5.0,44.0,54.083333,"314, Ang Mo Kio Avenue 3",9.10,1.366227,103.850086,560314,Ang Mo Kio MRT,0.41
143399,282000.0,2023,1,2023-01-01,North-East,Ang Mo Kio,2.0,8.0,44.0,54.083333,"314, Ang Mo Kio Avenue 3",9.10,1.366227,103.850086,560314,Ang Mo Kio MRT,0.41
143400,289800.0,2023,1,2023-01-01,North-East,Ang Mo Kio,2.0,2.0,45.0,62.083333,"170, Ang Mo Kio Avenue 4",10.25,1.374001,103.836432,560170,Mayflower MRT,0.28
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161855,820000.0,2023,9,2023-09-01,Central,Toa Payoh,4.0,8.0,86.0,81.416667,"147, Lor 2 Toa Payoh",5.79,1.334547,103.845078,310147,Toa Payoh MRT,0.34
161856,575000.0,2023,9,2023-09-01,Central,Toa Payoh,4.0,20.0,84.0,49.916667,"193, Lor 4 Toa Payoh",5.30,1.331590,103.851295,310193,Toa Payoh MRT,0.44
161857,708000.0,2023,9,2023-09-01,Central,Toa Payoh,4.0,11.0,104.0,73.833333,"10B, Lor 7 Toa Payoh",5.81,1.337161,103.858354,312010,Braddell MRT,1.34
161858,770000.0,2023,9,2023-09-01,Central,Toa Payoh,4.0,20.0,76.0,84.666667,"79E, Toa Payoh Ctrl",5.67,1.334727,103.849823,315079,Toa Payoh MRT,0.35
