# HDB Resale Price Predictor & Visualisation

This project aims to create a data pipeline with the help of availale APIs (Data.gov.sg and OneMap) to build a web-based application for
1. HDB Price visualisation
2. HDB Price prediction

The prototype aims to read latest data directly from data.gov.sg and perform ETL (Extract, Transform, and Load) to a local/web database of choice.

In [1]:
import requests
from requests.exceptions import HTTPError
import pandas as pd
from time import sleep
from pprint import pprint
import json

def get_token(location):
    '''
    Function to check if API token is still valid and updates API token if outdated
    Returns the API token
    '''
    try:
        with open(location, 'r+') as f:
            file = f.read()
            data = json.loads(file)
            response = requests.post("https://developers.onemap.sg/privateapi/auth/post/getToken", data=data)
            token = response.json()
            if token['access_token'] != data['access_token']:
                print(f"New token found: {token['access_token']}")
                data['access_token'] = token['access_token']
                data['expiry_timestamp'] = token['expiry_timestamp']
                json.dump(data, f, indent=4)
                print('Updated')
    except Exception as err:
        print(err)
    return data['access_token']

credentials = get_token("venv/onemap.json")

def datagovsg_api_call(url, sort = 'month desc', limit = 100, years=["2023"]):
    '''
    Function to build the API call and construct the pandas dataframe
    Inputs:
        url: url for API, with resource_id parameters
        sort: field, by ascending/desc
        limit: maximum entries (API default by OneMap is 100, if not specified)
        years: list of years data required
    Returns a pandas dataframe of the data
    '''
    month_dict = '{"month":['
    for year in years:
        for month in range(1,13):
            month_dict = month_dict + f'"{year}-{str(month).zfill(2)}", '
    month_dict = month_dict[:-2] 
    month_dict = month_dict + ']}'
    url = url+f'&sort={sort}&filters={month_dict}'
    if limit:
        print(f'Call limit : {limit}')
        url = url+f'&limit={limit}'
    pprint(f'API call = {url}')
    try:
        response = requests.get(url)
        response.raise_for_status()
        data = response.json()
        df = pd.DataFrame(data['result']['records'])
    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')
    except Exception as err:
        print(f'Other error occurred: {err}')
    else:
        return df

New token found: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOjEwMjIyLCJ1c2VyX2lkIjoxMDIyMiwiZW1haWwiOiJsaW1zaWVubG9uZ0BnbWFpbC5jb20iLCJmb3JldmVyIjpmYWxzZSwiaXNzIjoiaHR0cDpcL1wvb20yLmRmZS5vbmVtYXAuc2dcL2FwaVwvdjJcL3VzZXJcL3Nlc3Npb24iLCJpYXQiOjE2ODI5MzIxMTIsImV4cCI6MTY4MzM2NDExMiwibmJmIjoxNjgyOTMyMTEyLCJqdGkiOiJhZWM2ZmUzOTJjNzlmZWQ5NTY2NzgzNTRiYWIxMGVmZSJ9.hCWNmunJFtUBNyEyqXYMokDRCAGx-Bg9zPaM8XcyurI
Updated


In [2]:
df = datagovsg_api_call('https://data.gov.sg/api/action/datastore_search?resource_id=f1765b54-a209-4718-8d38-a39237f502b3')
df

Call limit : 100
('API call = '
 'https://data.gov.sg/api/action/datastore_search?resource_id=f1765b54-a209-4718-8d38-a39237f502b3&sort=month '
 'desc&filters={"month":["2023-01", "2023-02", "2023-03", "2023-04", '
 '"2023-05", "2023-06", "2023-07", "2023-08", "2023-09", "2023-10", "2023-11", '
 '"2023-12"]}&limit=100')


Unnamed: 0,town,flat_type,flat_model,floor_area_sqm,street_name,resale_price,month,remaining_lease,lease_commence_date,storey_range,_id,block
0,BEDOK,3 ROOM,Improved,60,BEDOK NTH RD,335000,2023-04,62 years 01 month,1986,04 TO 06,150152,77
1,BEDOK,3 ROOM,New Generation,68,BEDOK RESERVOIR RD,365000,2023-04,57 years 04 months,1981,04 TO 06,150174,709
2,ANG MO KIO,4 ROOM,New Generation,92,ANG MO KIO AVE 3,488000,2023-04,54 years 06 months,1978,07 TO 09,150116,119
3,ANG MO KIO,5 ROOM,Model A,134,ANG MO KIO AVE 4,635000,2023-04,72 years 03 months,1996,04 TO 06,150134,618
4,BEDOK,3 ROOM,New Generation,68,BEDOK NTH AVE 1,375000,2023-04,56 years 03 months,1980,10 TO 12,150145,548
...,...,...,...,...,...,...,...,...,...,...,...,...
95,ANG MO KIO,5 ROOM,Improved,119,ANG MO KIO AVE 6,792000,2023-04,56 years 02 months,1980,07 TO 09,150136,716
96,BEDOK,3 ROOM,Improved,59,BEDOK NTH RD,325000,2023-04,53 years 10 months,1978,04 TO 06,150154,74
97,ANG MO KIO,5 ROOM,Improved,117,ANG MO KIO AVE 8,728000,2023-04,56 years 07 months,1980,04 TO 06,150137,710
98,BEDOK,3 ROOM,Improved,59,BEDOK NTH RD,338000,2023-04,53 years 09 months,1978,01 TO 03,150156,76


In [15]:
# from dataprep.eda import create_report
# create_report(df).show()

## Data wrangling and feature engineering steps

1. Reindexed dataframe using _id (unique to every resale transaction)
2. Categorised towns into regions (North, West, East, North-East, Central) based on HDB's categorisation https://www.hdb.gov.sg/about-us/history/hdb-towns-your-home
3. Changed room types into float values, with Executive as 4.5 rooms (extra study/balcony), and Multigeneration 6 rooms
4. Floor area converted to float values
5. Month was converted into datetime format, to be used to detrend the time series moving average
6. Storey range was converted to max_storey, since unable to determine the floor, the highest floor would be used (every value is a difference of 3 storeys)
7. Remaining lease was converted into remaining months (float)

Lastly, location plays a huge role in house pricing, hence
1. Using street name and block, I utilized OneMap API to obtain the latitude, longitude, and postal codes of each flat https://www.onemap.gov.sg/docs
2. The location of all MRT stations was also obtained using OneMap API and saved as a json file locally
3. Using the two data above, I am able to determine the nearest MRT station
4. The minimum travelling time (walk and public transport) to the nearest MRT will be an additional feature of the dataset

In [16]:
def clean_df(df):
    '''
    function to clean the raw dataframe
    '''
    # Start
    # set index to overall id
    df.set_index('_id', inplace=True)
        
    # Create feature "rooms", "max_storey"
    def categorise_rooms(flat_type):
        '''
        Helper function for categorising number of rooms
        '''
        if flat_type[0] == 'E':
            return 4.5
        elif flat_type[0] == 'M':
            return 6.0
        else:
            return float(flat_type[0])
        
    df['rooms'] = df['flat_type'].apply(categorise_rooms)
    df['max_storey'] = df['storey_range'].apply(lambda x: int(x[-2:]))

    # Change dtypes
    df['lease_commence_date'] = df['lease_commence_date'].astype('int')
    df['resale_price'] = df['resale_price'].astype('float')
    df['floor_area_sqm'] = df['floor_area_sqm'].astype('float')
    df['month'] = pd.to_datetime(df['month'], format="%Y-%m-%d")
    
    # Calculate remaining_lease
    def year_month_to_year(remaining_lease):
        '''
        Helper function to change year & months, into years (float)
        '''
        remaining_lease = remaining_lease.split(' ')
        if len(remaining_lease) > 2:
            year = float(remaining_lease[0]) + float(remaining_lease[2])/12
        else:
            year = float(remaining_lease[0])
        return year
    
    df['remaining_lease'] = df['remaining_lease'].apply(year_month_to_year)

    # Change capitalization of strings
    for column in df.columns:
        if df[column].dtype == 'O':
            df[column] = df[column].str.title()
    
    # Update address abbreviations for onemap API call
    df['original_street_name'] = df['street_name']
    abbreviations = {'Sth':'South', 
                     '[S][t][^.]':'Street ', 
                     '[S][t]$':'Street',
                     '[S][t][.]':'Saint', 
                     'Nth':'North', 
                     'Ave':'Avenue', 
                     'Dr':'Drive', 
                     'Rd':'Road'}
    for abbreviation, full in abbreviations.items():
        df['street_name'] = df['street_name'].str.replace(abbreviation, full, regex=True)
    
    # Categorise town regions
    town_regions = {'Sembawang' : 'North',
                'Woodlands' : 'North',
                'Yishun' : 'North',
                'Ang Mo Kio' : 'North-East',
                'Hougang' : 'North-East',
                'Punggol' : 'North-East',
                'Sengkang' : 'North-East',
                'Serangoon' : 'North-East',
                'Bedok' : 'East',
                'Pasir Ris' : 'East',
                'Tampines' : 'East',
                'Bukit Batok' : 'West',
                'Bukit Panjang' : 'West',
                'Choa Chu Kang' : 'West',
                'Clementi' : 'West',
                'Jurong East' : 'West',
                'Jurong West' : 'West',
                'Tengah' : 'West',
                'Bishan' : 'Central',
                'Bukit Merah' : 'Central',
                'Bukit Timah' : 'Central',
                'Central Area' : 'Central',
                'Geylang' : 'Central',
                'Kallang/Whampoa' : 'Central',
                'Marine Parade' : 'Central',
                'Queenstown' : 'Central',
                'Toa Payoh' : 'Central'}      
    df['region'] = df['town'].apply(lambda x: town_regions[x])

    # Getting latitude, longitude, postal code
    def get_lat_long(df):
        '''
        API call to get latitude, longitude, and postal code
        Incorporates sleep time to not exceed a max of 250 calls per min
        '''
        sleep(0.15)
        address = df['block'] + ', ' + df['street_name']
        try:
            call = f'https://developers.onemap.sg/commonapi/search?searchVal={address}&returnGeom=Y&getAddrDetails=Y'
            response = requests.get(call)
            response.raise_for_status()
            data = response.json()
            return data['results'][0]['LATITUDE'] + ',' + data['results'][0]['LONGITUDE'] + ' ' + data['results'][0]['POSTAL']
        except HTTPError as http_err:
            print(f'HTTP error occurred during get_lat_long: {http_err}')
        except Exception as err:
            print(f'Error occurred during get_lat_long: {err} on the following call:')
            pprint(call)

    df['position'] = df.apply(get_lat_long, axis=1)
    try:
        df['postal_code'] = df['position'].apply(lambda x: x.split()[1]).astype('int')
        df['lat_long'] = df['position'].apply(lambda x: x.split()[0])
        # I need another split here to get floats
        df['lat'] = df['lat_long'].apply(lambda x: float(x.split(',')[0]))
        df['long'] = df['lat_long'].apply(lambda x: float(x.split(',')[1]))
        
    except Exception as err:
        print(f'Error splitting postal_code from lat_long: {err}')
    else:
        # Reorder columns
        df = df[['resale_price', 'month', 'region', 'town', 'rooms', 'max_storey', 'floor_area_sqm', 'remaining_lease',
                'lat_long', 'lat', 'long', 'postal_code']]
                # Unused columns - 'block', 'street_name', 'original_street_name', 'lease_commence_date', 'flat_model', 'storey_range', 'flat_type'
    return df

In [17]:
df = clean_df(df)
df.dtypes

resale_price              float64
month              datetime64[ns]
region                     object
town                       object
rooms                     float64
max_storey                  int64
floor_area_sqm            float64
remaining_lease           float64
lat_long                   object
lat                       float64
long                      float64
postal_code                 int64
dtype: object

In [18]:
# df.to_csv('check.csv')
df

Unnamed: 0_level_0,resale_price,month,region,town,rooms,max_storey,floor_area_sqm,remaining_lease,lat_long,lat,long,postal_code
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
150166,360000.0,2023-04-01,East,Bedok,3.0,9,74.0,60.083333,"1.32684846948333,103.919015573079",1.326848,103.919016,460055
150184,768888.0,2023-04-01,East,Bedok,4.0,9,93.0,94.750000,"1.3304987981324,103.939995686405",1.330499,103.939996,462187
150131,375000.0,2023-04-01,East,Bedok,3.0,12,68.0,56.250000,"1.33131021834565,103.92685391703",1.331310,103.926854,460548
150090,353000.0,2023-04-01,North-East,Ang Mo Kio,3.0,9,67.0,59.833333,"1.36275784702216,103.858015323667",1.362758,103.858015,560474
150175,495000.0,2023-04-01,East,Bedok,4.0,12,92.0,54.250000,"1.32953680475668,103.940406562732",1.329537,103.940407,460082
...,...,...,...,...,...,...,...,...,...,...,...,...
150148,407000.0,2023-04-01,East,Bedok,3.0,12,68.0,54.750000,"1.32696621417456,103.928135771625",1.326966,103.928136,460419
150150,368000.0,2023-04-01,East,Bedok,3.0,12,68.0,56.166667,"1.33082005919243,103.925595840288",1.330820,103.925596,460546
150168,440000.0,2023-04-01,East,Bedok,3.0,3,68.0,92.666667,"1.32747242201869,103.923868263972",1.327472,103.923868,462808
150152,308000.0,2023-04-01,East,Bedok,3.0,12,59.0,54.000000,"1.33188000305479,103.932389359044",1.331880,103.932389,460504


In [18]:
def update_mrt_coordinates(mrt_stations=None):
    '''
    Function to API call for MRT station coordinates
    Input: list of mrt station names, default to All stations if nothing is given
    '''
    if not mrt_stations:
        mrt_stations = ['Admiralty MRT', 'Aljunied MRT', 'Ang Mo Kio MRT', 'Bakau LRT', 'Bangkit LRT', 'Bartley MRT', 'Bayfront MRT',
                        'Bayshore MRT', 'Beauty World MRT', 'Bedok MRT', 'Bedok North MRT', 'Bedok Reservoir MRT', 'Bencoolen MRT',
                        'Bendemeer MRT', 'Bishan MRT', 'Boon Keng MRT', 'Boon Lay MRT', 'Botanic Gardens MRT', 'Braddell MRT',
                        'Bras Basah MRT', 'Buangkok MRT', 'Bugis MRT', 'Bukit Batok MRT', 'Bukit Brown MRT', 'Bukit Gombak MRT',
                        'Bukit Panjang MRT', 'Buona Vista MRT', 'Caldecott MRT', 'Cashew MRT', 'Changi Airport MRT',
                        'Chinatown MRT', 'Chinese Garden MRT', 'Choa Chu Kang MRT', 'City Hall MRT', 'Clarke Quay MRT',
                        'Clementi MRT', 'Commonwealth MRT', 'Compassvale LRT', 'Cove LRT', 'Dakota MRT', 'Dhoby Ghaut MRT',
                        'Downtown MRT', 'Xilin MRT', 'Tampines East MRT', 'Mayflower MRT', 'Upper Thomson MRT',
                        'Lentor MRT', 'Woodlands North MRT', 'Woodlands South MRT', 'Esplanade MRT', 'Eunos MRT',
                        'Expo MRT', 'Fajar LRT', 'Farmway LRT', 'Farrer Park MRT', 'Fort Canning MRT',
                        'Gardens by the Bay MRT', 'Geylang Bahru MRT', 'HarbourFront MRT', 'Haw Par Villa MRT', 'Hillview MRT',
                        'Holland Village MRT', 'Hougang MRT', 'Jalan Besar MRT', 'Joo Koon MRT', 'Jurong East MRT',
                        'Jurong West MRT', 'Kadaloor LRT', 'Kaki Bukit MRT', 'Kallang MRT', 'Kembangan MRT', 'Keppel MRT',
                        'King Albert Park MRT', 'Kovan MRT', 'Kranji MRT', 'Labrador Park MRT', 'Lakeside MRT', 'Lavender MRT',
                        'Layar LRT', 'Little India MRT', 'Lorong Chuan MRT', 'MacPherson MRT', 'Marina Bay MRT', 'Marina South Pier MRT',
                        'Marsiling MRT', 'Marymount MRT', 'Mattar MRT', 'Meridian LRT', 'Mountbatten MRT',
                        'Newton MRT', 'Nibong LRT', 'Nicoll Highway MRT', 'Novena MRT', 'Oasis LRT', 'One-North MRT', 'Orchard MRT',
                        'Outram Park MRT', 'Paya Lebar MRT', 'Pasir Ris MRT', 'Paya Lebar MRT', 'Pasir Ris MRT', 'Paya Lebar MRT', 'Pasir Ris MRT', 
                        'Pioneer MRT', 'Potong Pasir MRT', 'Promenade MRT', 'Punggol MRT', 'Queenstown MRT', 'Raffles Place MRT', 'Redhill MRT',
                        'Riviera LRT', 'Rochor MRT', 'Sembawang MRT', 'Sengkang MRT', 'Serangoon MRT', 'Simei MRT', 'Sixth Avenue MRT', 
                        'Somerset MRT', 'Springleaf MRT', 'Stadium MRT', 'Stevens MRT', 'Sumang LRT', 'Tai Seng MRT', 'Tampines MRT', 
                        'Tampines East MRT', 'Tampines West MRT', 'Tanah Merah MRT', 'Tanjong Pagar MRT', 'Tanjong Rhu MRT', 'Teck Lee LRT', 
                        'Telok Ayer MRT', 'Telok Blangah MRT', 'Thanggam LRT', 'Tiong Bahru MRT', 'Toa Payoh MRT', 
                        'Tuas Crescent MRT', 'Tuas Link MRT', 'Tuas West Road MRT', 'Ubi MRT', 'Upper Changi MRT', 
                        'Woodlands MRT', 'Woodlands South MRT', 'Woodlands North MRT', 'Yew Tee MRT', 'Yio Chu Kang MRT', 'Yishun MRT']
    # Future stations - 'Tampines North MRT', 'Tengah MRT'

    mrt_coordinates = {}

    for mrt in mrt_stations:
        try:
            response = requests.get(f"https://developers.onemap.sg/commonapi/search?searchVal={mrt}&returnGeom=Y&getAddrDetails=Y")
            response.raise_for_status()
            data = response.json()
            # string (lat,long) as key
            # mrt_coordinates[f"{data['results'][0]['LATITUDE']},{data['results'][0]['LONGITUDE']}"] = mrt
            mrt_coordinates[mrt] = (float(data['results'][0]['LATITUDE']),float(data['results'][0]['LONGITUDE']))
        except HTTPError as http_err:
            print(f'HTTP error occurred: {http_err}')
        except Exception as err:
            print(f'Other error occurred: {err}')
            print(f'Error for {mrt, data}')

    with open('static/mrt_dict.json', 'w')as f:
        json.dump(mrt_coordinates, f, indent=4)

def get_mrt_coordinates(location = 'static/mrt_dict.json'):
    with open(location, 'r') as f:
        file = f.read()
        data = json.loads(file)
        return data


In [19]:
mrt_coordinates = get_mrt_coordinates()

In [35]:
start = "1.32283703302242,103.939124525951"
routeType = ['walk', 'drive', 'pt', 'cycle']
end = "1.36126901451361,103.854642365822"
try:
    response = requests.get(f"https://developers.onemap.sg/privateapi/routingsvc/route?start={start}&end={end}&routeType={routeType[1]}&token={credentials}")
    response.raise_for_status()
    data = response.json()
    # print(data)
except HTTPError as http_err:
    print(f'HTTP error occurred: {http_err}')
except Exception as err:
    print(f'Other error occurred: {err}')


# phyroute is a nested dictionary within the first dictionary returned
# what if there are more routes? Need to test

route_1 = data['route_summary']
route_2 = data['phyroute']['route_summary']

pprint(route_1)
pprint(route_2)

{'end_point': 'ANG MO KIO AVENUE 10',
 'start_point': 'BEDOK SOUTH AVENUE 2',
 'total_distance': 14540,
 'total_time': 1292}
{'end_point': 'ANG MO KIO AVENUE 10',
 'start_point': 'BEDOK SOUTH AVENUE 2',
 'total_distance': 13039,
 'total_time': 1412}
