# Singapore Public Housing (HDB) Resale Price Prediction Model (Part 1)
### Data Collection - Housing Address Coordinates

## 1. Introduction

In the late 19th century, American steel mogul Andrew Carnegie famously made a bold prediction that investment in real estate would be the wise way to multiply one's wealth. This statement has turned out to be relevent and agreeable to most of us 100 years later in the 21th century.

>“Ninety percent of all millionaires become so through owning real estate. More money has been made in real estate than in all industrial investments combined. The wise young man or wage earner of today invests his money in real estate.” - Andrew Cargie 

According to Forbes rich list in 2018, out of the 22 billionaires in Singapore, 16 of them are real-estate tycoons. The total fortune came up to a staggering $43.7 billion, which grew up to nearly 10 percent from the previous year (Channel News Asia, 2018). We cannot deny that all the successful real-estate investors have had either natural talent or decade-trained experience in them to tell a good or bad property listing or development project apart. Nevertheless, it would be fascinating if we could unravel and quantify the investors' - or ordinary home-buyer - judgement with the help of Data Science. So we set out a journey to create a machine learning model that could benchmark a property pricing based on its attributes.

In this project, we will develop a predictive model on HDB unit resale price using several machine learning techniques of different classes. We believe that a model like this would be very valuable for any real state agent or home-buyer to set a benchmark pricing to judge whether a property listing is over or under-valued.

One thing worth noting is that HDB-developed flats are in a highly controlled environment by the Government to avoid market speculation on housing price. In fact, investment in these flats is technically prohibited as one could only own a maximum of one HDB flat. However, this situation does not render our study and model useless, simply because of the fact that this project still serve as a window to look into home buyer's psychology and how premium is justifiable by numerical features or attributes. So the same methodology could still be applied to Condominium sales in Singapore or real estates in City States that resembles Singapore where some of the key features could still be applicable and worth researching.

## 2. Problem Statement

Singapore has always had one of the most expensive housing market in the world, so it becomes crucial for the locals to make sure every dollar they spent are worthwhile. However, most of the housing price are still benchmarked manually by experienced appraiser today. Hence, it would really be helpful for home buyer if there is a predictive model available to find out any undervalued property listing and maximize their savings.

## 3. About this Notebook

The raw data for the project is hailing from the official Singapore Government portal, where the archive can be found in [this link](https://data.gov.sg/dataset/resale-flat-prices). It is transaction record for all resale public housing flat under the development and management of Housing Development Board (HDB) in Singapore. These flats are commonly known as "HDB" by the locals in the Singapore. 

The transaction data available range from 1990 to 2020, housing price has rose significantly through the last few decades and the property price inflation would not be well captured by the model without thorough research through public policies and global economic climate. Hence, the project will only encapsulated property sold from January 2017 to March 2020.

The raw data has come with some basic information on the property attributes, such as the situated town, storey range, floor area, remaining lease and the resale price (the target variable). All HDB flats are bound by a standard 99-year lease before government has the right to seize the property for future development, so it is an important indicator of housing price in Singapore.

In this notebook, we will clean up and transform the necessary raw data to its numerical format for the ease of modelling and visualization. Then, with the use of OneMap public API, we will submit each address and transform each data entries into their respective coordinates so that we can input more geospatial data in the later stage. 

## 4. Initialization

In [1]:
# Import Vanilla Libraries
import requests, json, time, random, math
import pandas as pd
import numpy as np

In [2]:
# Read property dataset as dataframe
hdb = pd.read_csv('./Dataset/Raw/hdb_2017_2020.csv')

In [3]:
hdb.head()

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,remaining_lease,resale_price
0,2017-01,ANG MO KIO,2 ROOM,406,ANG MO KIO AVE 10,10 TO 12,44.0,Improved,1979,61 years 04 months,232000.0
1,2017-01,ANG MO KIO,3 ROOM,108,ANG MO KIO AVE 4,01 TO 03,67.0,New Generation,1978,60 years 07 months,250000.0
2,2017-01,ANG MO KIO,3 ROOM,602,ANG MO KIO AVE 5,01 TO 03,67.0,New Generation,1980,62 years 05 months,262000.0
3,2017-01,ANG MO KIO,3 ROOM,465,ANG MO KIO AVE 10,04 TO 06,68.0,New Generation,1980,62 years 01 month,265000.0
4,2017-01,ANG MO KIO,3 ROOM,601,ANG MO KIO AVE 5,01 TO 03,67.0,New Generation,1980,62 years 05 months,265000.0


## 5. Feature Engineering

A quick glance through the first 5 rows of data reveals that some of the data (such as month and remaining lease) are not in a usable format. We will need to extract and transform these data into numerical form.

### 5.1 Date Time Data

In [4]:
# Extract Month and Year from 'month' column
hdb['sold_year'] = hdb['month'].str[:4].astype(int)
hdb['sold_month'] = hdb['month'].str[-2:].astype(int)

In [5]:
# Function to convert 'remaining lease' to numerical column
def extract_lease_years(row):
    if len(row)==4:
        return round(int(row[0]) + int(row[2])/12, 2)
    else:
        return int(row[0])

In [6]:
# Apply function to 'remaining lease'
hdb['remaining_lease'] = hdb['remaining_lease'].str.split().apply(extract_lease_years)

In [7]:
hdb.sample(5)

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,remaining_lease,resale_price,sold_year,sold_month
14657,2017-09,SENGKANG,4 ROOM,213B,COMPASSVALE LANE,10 TO 12,95.0,Premium Apartment,2012,94.08,435000.0,2017,9
9449,2017-06,TOA PAYOH,3 ROOM,117,POTONG PASIR AVE 1,04 TO 06,67.0,New Generation,1984,66.33,265000.0,2017,6
56931,2019-09,BEDOK,4 ROOM,724,BEDOK RESERVOIR RD,01 TO 03,103.0,Model A,1984,63.83,410000.0,2019,9
23594,2018-03,JURONG WEST,4 ROOM,822,JURONG WEST ST 81,07 TO 09,106.0,Model A,1993,74.83,320000.0,2018,3
6004,2017-04,YISHUN,5 ROOM,876,YISHUN ST 81,01 TO 03,121.0,Improved,1987,69.67,430000.0,2017,4


After transforming the date into month and year and remaining lease into float number, we would need to concatenate the block number to its respective street name so that we can pass the full address into OneMap API for more accurate response of the coordinates.

In [8]:
# Concatenate block number and street name to form full address for API Calls
hdb['address'] = hdb['block'] + ' ' + hdb['street_name']

In [9]:
# Drop all the pre-engineered column
hdb.drop(['month', 'block', 'street_name'], axis=1, inplace=True)

### 5.2 Storey Range

For the storey range, we will simply extract the median of each range; since the range only covers 3 levels, the data loss would be almost negligible.

In [10]:
# Check on storey range
hdb.storey_range.unique()

array(['10 TO 12', '01 TO 03', '04 TO 06', '07 TO 09', '13 TO 15',
       '19 TO 21', '22 TO 24', '16 TO 18', '34 TO 36', '28 TO 30',
       '37 TO 39', '49 TO 51', '25 TO 27', '40 TO 42', '31 TO 33',
       '46 TO 48', '43 TO 45'], dtype=object)

In [11]:
# Convert 'storey_range' into numerical by taking the midpoint of each range
hdb.storey_range = (hdb.storey_range.str[:2].astype(int) + hdb.storey_range.str[-2:].astype(int))
hdb.storey_range = (hdb.storey_range/2).astype(int)

In [12]:
# Sanity check
hdb.head()

Unnamed: 0,town,flat_type,storey_range,floor_area_sqm,flat_model,lease_commence_date,remaining_lease,resale_price,sold_year,sold_month,address
0,ANG MO KIO,2 ROOM,11,44.0,Improved,1979,61.33,232000.0,2017,1,406 ANG MO KIO AVE 10
1,ANG MO KIO,3 ROOM,2,67.0,New Generation,1978,60.58,250000.0,2017,1,108 ANG MO KIO AVE 4
2,ANG MO KIO,3 ROOM,2,67.0,New Generation,1980,62.42,262000.0,2017,1,602 ANG MO KIO AVE 5
3,ANG MO KIO,3 ROOM,5,68.0,New Generation,1980,62.08,265000.0,2017,1,465 ANG MO KIO AVE 10
4,ANG MO KIO,3 ROOM,2,67.0,New Generation,1980,62.42,265000.0,2017,1,601 ANG MO KIO AVE 5


### 5.3 Coordinates

#### 5.3.1 OneMap API Calling

In order to speed up the API calling, instead of passing each entries, which requires 70,104 requests, we will only pass in unique address we have in the raw dataset. We have only 8,712 unique address, so the time required for API calling has been saved by 88%.

In [14]:
unique_add = pd.Series(hdb['address'].unique())

In [15]:
unique_add = pd.DataFrame(unique_add, columns=["address"])

In [19]:
print("Time Saving on API Calling:", round(1 - unique_add.shape[0] / hdb.shape[0], 2) * 100, '%')

Time Saving on API Calling: 88.0 %


In [15]:
# Extract coordinates from OneMap API by sending full address of each block
latitude = []
longitude = []

if 'latitude' not in unique_add.columns:
    unique_add['latitude'] = np.nan
    unique_add['longitude'] = np.nan

for i, address in enumerate(unique_add['address']):
    if math.isnan(unique_add.loc[i, 'latitude']):
        try:
            print('\rWaiting... ({})... {} addresses remaining... '.format(address, len(unique_add)-i-1), end='.')
            query = "https://developers.onemap.sg/commonapi/search?searchVal=" + address + "&returnGeom=Y&getAddrDetails=N"
            response = requests.get(query)
            coor_json = json.loads(response.content)
            unique_add.loc[i, 'latitude'] = coor_json['results'][0]['LATITUDE']
            unique_add.loc[i, 'longitude'] = coor_json['results'][0]['LONGITUDE']

        except:
            unique_add.loc[i, 'latitude'] = np.nan
            unique_add.loc[i, 'longitude'] = np.nan

        # Export dataframe every loop to csv for inspection
        unique_add.to_csv('./Dataset/Transitional/unique_address.csv', index=False)

        # Sleeping time to avoid overloading server
        time.sleep(random.randint(1,2)/4)

if unique_add['latitude'].isnull().sum() == 0:
    print("--- Data is complete ---")

Waiting... (666A YISHUN AVE 4)... 0 addresses remaining... . . ... ...

#### 5.3.2 Addressing Missing Data

It seems that we have a systematic pattern of missing data from the API calling. All entries have full address containing the word "St. George's". A quick trial on OneMap has showed that the system only register the word 'Saint' rather than the abbreviation of 'St.'. So we will replace all the abbreviation and pass it again into the loop. But this time, the function will be updated with a NaN checking feature, so only missing value will be updated, saving us previous time and resources.

In [16]:
unique_add[unique_add['latitude'].isnull()]['address'].unique()

array(["3 ST. GEORGE'S RD", "21 ST. GEORGE'S RD", "11 ST. GEORGE'S RD",
       "8 ST. GEORGE'S LANE", "18 ST. GEORGE'S RD", "15 ST. GEORGE'S RD",
       "9 ST. GEORGE'S RD", "4B ST. GEORGE'S LANE", "7 ST. GEORGE'S LANE",
       "5 ST. GEORGE'S LANE", "22 ST. GEORGE'S RD", "20 ST. GEORGE'S RD",
       "13 ST. GEORGE'S RD", "6 ST. GEORGE'S LANE", "2 ST. GEORGE'S RD",
       "14 ST. GEORGE'S RD", "23 ST. GEORGE'S RD", "16 ST. GEORGE'S RD",
       "1 ST. GEORGE'S RD", "10 ST. GEORGE'S RD", "17 ST. GEORGE'S RD"],
      dtype=object)

In [17]:
missing_index = unique_add[unique_add['latitude'].isnull()].index

In [18]:
unique_add.loc[missing_index, 'address'] = unique_add.loc[missing_index, 'address'].str.replace('ST.', 'SAINT')

In [19]:
unique_add['latitude'] = unique_add['latitude'].astype(float)
unique_add['longitude'] = unique_add['longitude'].astype(float)

#### 5.3.3 Second Round API Calling

In [20]:
# Repassing dataframe for API Calling but only missing data is processed
for i, lat in enumerate(unique_add.latitude):
    
    # Only if latitude is missing, fire up API calling query
    if math.isnan(lat):
        address = unique_add.loc[i, 'address']
        try:
            print('\rWaiting... ({})... {} neighbourhood remaining... '.format(address, len(unique_add)-i-1), end='.')
            query = "https://developers.onemap.sg/commonapi/search?searchVal=" + address + "&returnGeom=Y&getAddrDetails=N"
            response = requests.get(query)
            coor_json = json.loads(response.content)
            unique_add.loc[i, 'latitude'] = coor_json['results'][0]['LATITUDE']
            unique_add.loc[i, 'longitude'] = coor_json['results'][0]['LONGITUDE']
            time.sleep(random.randint(1,2)/4)

        except:
            unique_add.loc[i, 'latitude'] = np.nan
            unique_add.loc[i, 'longitude'] = np.nan
            time.sleep(random.randint(1,2)/4)
            
unique_add.to_csv('./Dataset/Transitional/unique_address.csv', index=False)

if unique_add['latitude'].isnull().sum() == 0:
    print("--- Data is complete ---")

Waiting... (17 SAINT GEORGE'S RD)... 616 neighbourhood remaining... ....--- Data is complete ---


In [21]:
# Sanity check on missing data
unique_add = pd.read_csv("./Dataset/Transitional/unique_address.csv")

In [22]:
unique_add.isnull().sum()

address      0
latitude     0
longitude    0
dtype: int64

## 6. Final Data Export

In [24]:
hdb = pd.merge(hdb, unique_add, how='left', on='address')

In [26]:
# Export final dataframe to CSV
hdb.to_csv('./Dataset/Transitional/complete_data.csv', index=False)

In [27]:
# Sanity check on the final dataframe
hdb = pd.read_csv('./Dataset/Transitional/complete_data.csv')
hdb.isnull().sum()

town                     0
flat_type                0
storey_range             0
floor_area_sqm           0
flat_model               0
lease_commence_date      0
remaining_lease          0
resale_price             0
sold_year                0
sold_month               0
address                  0
latitude               164
longitude              164
dtype: int64

In [31]:
missing_data = hdb[hdb['latitude'].isnull()].index
hdb.loc[missing_data, 'address'] = hdb.loc[missing_data, 'address'].str.replace('ST.', 'SAINT')
hdb = pd.merge(hdb, unique_add, how='left', on='address')

After the data from unique address is merged to the final data from the first round, we now have 2 sets of coordinates. So we will be dropping the data from the first round as they are encompassed in the second round data as well. The 2 columns will also be renamed for the ease of future references.

In [32]:
hdb.isnull().sum()

town                     0
flat_type                0
storey_range             0
floor_area_sqm           0
flat_model               0
lease_commence_date      0
remaining_lease          0
resale_price             0
sold_year                0
sold_month               0
address                  0
latitude_x             164
longitude_x            164
latitude_y               0
longitude_y              0
dtype: int64

In [33]:
hdb.drop(['latitude_x', 'longitude_x'], axis=1, inplace=True)
hdb.rename(columns={'latitude_y': 'latitude', 'longitude_y': 'longitude'}, inplace=True)

In [37]:
# Export final dataframe to CSV
hdb.to_csv('./Dataset/Transitional/complete_data.csv', index=False)

Now that the final data has been checked and exported, we will move on the second part on data scraping. In the 2nd notebook, data on amenities and infrastructure will be scraped from online resource and merged with our HDB dataset so that the effect of each aspects on housing price can be studied.