# Overview

I have decided to go with my second choice, which is to predict the price of HDB resale flats. If time permits I will also predict the price private residential properties.

# Problem Statement

For most people, purchasing a property is their biggest investment and so it is important for buyers to know if they can afford the downpayment and the mortage payment. In Singapore, the government gives generous subsidies for first-time home buyers of public housing flats built by Housing and Development Board (HDB). Singaporeans that meet the criteria (there is an income ceiling) took advantage of this policy to own their homes, resulted in 80% of the residents living in HDB flats.

The private property market is for local high income earners and HDB flat upgraders. Many foreigners also see Singapore as a safe haven to park thier monies here.

The price of a property is determined by many factors. Beside the condition, size and location of the house, it is also affected by the state of the economy, government policy and supply and demand. 

I will compare different regression models to predict the prices of HDB flats, based on the propertys' attributes, their proximity to points of interest, market supply and demand and macroeconomic factors. 

I will be using RMSE to measure model performance, and the model should at least improve upon baseline by 10%. Baseline is defined as the average of property prices.

## Risks and Assumptions

**May not be able to fetch Geocodes for all HDB flats**

I have tried using Google Map and Nominatim to get geocodes of HDB flats and points of interest. 

Google Map is able to get geocodes for all physical addresses and but Google charges USD7 per 1000 requests for the first 100,000 requests. 

Nominatim is FREE but is unable to get geocodes for about 15% of the addresses. In order to mitigate this risk I will collect more HDB resale transaction data and drop those observations that it failed to get geocodes. Alternatively I will pay for Google service, use manual method or explore other free geocoding services. 

## Import libraries and data

In [15]:
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
import geopy.geocoders
from geopy.geocoders import Nominatim, GoogleV3
from geopy.exc import GeocoderTimedOut
from geopy.exc import GeocoderServiceError
from geopy.exc import GeopyError
import scipy.stats as sps
import seaborn as sns
from time import time
from matplotlib import pyplot
from bs4 import BeautifulSoup
import requests
import json
import re

# File organization
 - notebook in codes folder
 - data files to be imported in datasets/input folder
 - data files exported in datasets/output folder

## Data Collection, Cleaning and Munging

Types of Data collected

1.HDB Flat Resale transactions (2017 to 2020)
- flat attributes and prices

2.Supply and Demand Factors
- Supply of new HDB flats
- Supply of new private properties
- population size
- number of married people

3.Macroeconomic factors
- Consumer Price Index
- Purchasing Manager Index
- Composite Leading Index
- GDP Growth
- CPF interest rates
- Singapore Interbank Offered Rate (SIBOR)
- umemployment rate
- median income of residents

4.Cost Factors
- HDB flat price index
- Private property pric index

5.Points of Interest
- Shopping Malls (more malls to be manually added)
- Nature Parks
- Columbaria/crematoria/cemeteries
- Schools
- Sports Facitilites
- MRT/LRT stations
- Hawker centres and markets


## Progress
State of Data collection - 99%<br/>
State of Data Munging - 99%<br/>
State of EDA - 70%

## 1. HDB Flat Resale Transactions

Souce: data.gov.sg<br/>
Observation: per sale transaction

In [2]:
hdb_df = pd.read_csv('../datasets/input/resale-flat-prices-2017-2020.csv')
hdb_df.tail()

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,remaining_lease,resale_price
70874,2020-05,YISHUN,5 ROOM,342A,YISHUN RING RD,10 TO 12,117.0,Premium Apartment,2016,94 years 09 months,578000.0
70875,2020-05,YISHUN,5 ROOM,335C,YISHUN ST 31,13 TO 15,112.0,Improved,2015,94 years 06 months,550000.0
70876,2020-05,YISHUN,5 ROOM,505A,YISHUN ST 51,13 TO 15,112.0,Improved,2016,94 years 11 months,540000.0
70877,2020-05,YISHUN,EXECUTIVE,391,YISHUN AVE 6,10 TO 12,142.0,Apartment,1988,67 years 02 months,553000.0
70878,2020-05,YISHUN,EXECUTIVE,837,YISHUN ST 81,01 TO 03,145.0,Maisonette,1988,66 years 09 months,652888.0


In [3]:
# check for any null value
hdb_df.isna().sum().any()

False

In [4]:
# Change all text values to lowercase
hdb_df['town'] = hdb_df['town'].str.lower()
hdb_df['flat_type'] = hdb_df['flat_type'].str.lower()
hdb_df['block'] = hdb_df['block'].str.lower()
hdb_df['street_name'] = hdb_df['street_name'].str.lower()
hdb_df['storey_range'] = hdb_df['storey_range'].str.lower()
hdb_df['flat_model'] = hdb_df['flat_model'].str.lower()

# combine block and street_name to create the address column, and drop the original columns
hdb_df['address'] = hdb_df['block'] + ' ' + hdb_df['street_name']
hdb_df.drop(['block','street_name'],axis=1,inplace=True)

hdb_df[['year','month']] = hdb_df['month'].str.split("-", 1, expand=True)

hdb_df['year'] = hdb_df['year'].astype(np.int32)
hdb_df['month'] = hdb_df['month'].astype(np.int32)

hdb_df = hdb_df.reindex(['year', 'month', 'town', 'flat_type', 'storey_range', 'floor_area_sqm',
       'flat_model', 'lease_commence_date', 'remaining_lease', 'resale_price',
       'address', 'lat', 'lng'],axis=1)

In [5]:
# Many addresses in Singapore use abbreviations, which can affect the ability of geocoding services to get geocodes
# The following will convert abbreviations in addresses to their original forms

hdb_df.replace(regex={
        r"\b(upp)\b":"upper",r"\b(rd)\b":"road",r"\b(lor)\b":"lorong",
        r"\b(ave)\b":"avenue",r"\b(jln)\b":"jalan",r"\b(sth)\b":"south",r"\b(nth)\b":"north",
        r"\b(ctrl)\b":"central",r"\b(blk)\b":"block",r"\b(blvd)\b":"boluevard",
        r"\b(bt)\b":"bukit",r"\b(c'wealth)\b":"commonwealth",r"\b(cl)\b":"close",r"\b(cplx)\b":"complex",
        r"\b(ctr)\b":"centre",r"\b(dr)\b":"drive",r"\b(est)\b":"estate",r"\b(gdn)\b":"garden",
        r"\b(gdns)\b":"gardens",r"\b(gr)\b":"grove",r"\b(hse)\b":"house",r"\b(hts)\b":"heights",
        r"\b(ind)\b":"industrial",r"\b(distripk)\b":"distripark",r"\b(intl)\b":"international",
        r"\b(lk)\b":"link",r"\b(mkt)\b":"market",r"\b(mjd)\b":"masjid",r"\b(mt)\b":"mount",
        r"\b(natl)\b":"national",r"\b(opp)\b":"opposite",r"\b(pk)\b":"park",r"\b(pl)\b":"place",
        r"\b(pt)\b":"point",r"\b(resvr)\b":"reservoir",r"\b(sch)\b":"school",r"\b(sci)\b":"science",
        r"\b(sq)\b":"square",r"\b(ter)\b":"terrace",r"\b(tg)\b":"tanjong",
        r"\b(tp)\b":"temple",r"\b(twr)\b":"tower",r"\b(w'lands)\b":"woodlands",r"\b(wk)\b":"walk",
        r"\b(wtr)\b":"water",r"\b(v)\b":"village",r"\b(veh)\b":"vehicle",r"\b(warehse)\b":"warehouse",
        r"\b(bef)\b":"before",r"\b(aft)\b":"after",r"\b(svc)\b":"service",
        r"\b(svcs)\b":"services",r"\b(sg)\b":"sungei",r"\b(kg)\b":"kampong"}, inplace=True)

In [6]:
# Using Google Geocoding API to get geocodes

# I have removed my Google API Key because anyone can use it to access Google services and charge to my account, 
# so the below code snippet is NOT Workable without the key


def get_pos_by_name(location_name):
    
    geolocator = GoogleV3(api_key='#### Google API Key ######')
    loc = geolocator.geocode(location_name, timeout=10)
    if not loc:
        return 0,0

    return (loc.latitude, loc.longitude) 

In [7]:
# Using Nominatim to get geocodes is FREE but 

def get_lat_long(str_location):
    
    geolocator = Nominatim(user_agent='user_1')
    geolocator.country_bias = 'SG'
    

    str_location = str_location.replace('\n',', ')
    str_location = str_location + " Singapore"
    float_lat = 0
    float_lon = 0
    while float_lat==0 and len(str_location)>0:
        try:
            
            location = geolocator.geocode(str_location, timeout=10, countries='singapore')  
            # greatly reduce missing values with timeout   

            float_lat, float_lon = location.latitude, location.longitude
        except:
            print('Sorry cannot get geocode for '+str_location)
            break
    
    return float_lat, float_lon 

The computation of lat and long take many hours. If I do it for all records at the same time and the computer will hang, so I will break them up by year.

In order to avoid SettingWithCopyWarning, I will be using copy(deep=True)

In [None]:
hdb_2019_df = hdb_df[hdb_df['year']==2019].copy(deep=True)

In [None]:
hdb_2020_df = hdb_df[hdb_df['year']==2020].copy(deep=True)

In [None]:
def update_df_hdb(df,yr):
  print('fetching lat/long and updating the dataframe .....')
  end = len(df)
  for i in range(end):
    df.iloc[i,11], df.iloc[i,12] = get_lat_long(df.iloc[i,10])

  path = '../datasets/output/hdb-resale-flat-prices-'+str(yr)+'.csv'
  df.to_csv(path, index = False)

  print('completed for year '+str(yr))


In [None]:
#update_df_hdb(hdb_2020_df,int(2020))

In [None]:
#update_df_hdb(hdb_2019_df,int(2019))

## 2. Supply and Demand Factors

In [22]:
# The observations in the datasets are monnthly, quarterly or yearly.
# 
# Function to Expand quarterly figures to spread over the 3 months, to be used for datasets with quarterly records

def qtr_2_mth(df_in):
    df = df_in.copy(deep=True)
    
    size = len(df)
    
    for i in range(0,size*3,3):
        for j in range(0,2):
                
            mth = df.iloc[i+j,1]+1

            replica = pd.DataFrame({"year":int(df.iloc[i+j,0]),"month":mth,df.columns[2]:df.iloc[i+j,2]},index=[i+j+1])

            df = pd.concat([df.iloc[:i+j+1],replica,df.iloc[i+j+1:]]).reset_index(drop=True)
    
    return df

### Supply of private houses

Source: data.gov.sg<br/>
Observation: quarterly

- The dataset includes completed and non-completed, which I will split them up into 2 dataframes - 1 for completed units and the other group for those in the pipeline.
- Expand the quarterly figures to monthly

In [None]:
pte_status_df = pd.read_csv('../datasets/input/completion-status-of-private-residential.csv')
pte_status_df.tail()

In [None]:
# Split into 2 dataframes - 1 for completed houses, 1 for incomplete/planned

pte_pipe_df = pte_status_df.loc[:,['quarter']].copy(deep=True)
pte_pipe_df['pte_pipe'] = (pte_status_df['provisional_permission'] 
                        + pte_status_df['written_permission'] 
                        + pte_status_df['building_plan_approval'] 
                        + pte_status_df['building_commencement'])

pte_built_df = pte_status_df.loc[:,['quarter']].copy(deep=True)
pte_built_df['pte_built'] = pte_status_df['building_completion']

In [None]:
# split the date field to year and quarter
pte_pipe_df[['year','month']] = pte_pipe_df['quarter'].str.split("-", 1, expand=True)
pte_built_df[['year','month']] = pte_built_df['quarter'].str.split("-", 1, expand=True)

# convert quarter to the starting month of the quarter in numeric form 
pte_pipe_df = pte_pipe_df.replace('Q1',1).replace('Q2',4).replace('Q3',7).replace('Q4',10)
pte_built_df = pte_built_df.replace('Q1',1).replace('Q2',4).replace('Q3',7).replace('Q4',10)

# reorder the columns
pte_pipe_df = pte_pipe_df.reindex(['year','month','pte_pipe'],axis=1)
pte_built_df = pte_built_df.reindex(['year','month','pte_built'],axis=1)

# change data type of numeric columns to integer
pte_pipe_df['year'] = pte_pipe_df['year'].astype(np.int32)
pte_pipe_df['month'] = pte_pipe_df['month'].astype(np.int32)
pte_built_df['year'] = pte_built_df['year'].astype(np.int32)
pte_built_df['month'] = pte_built_df['month'].astype(np.int32)


In [None]:
# IMPORTANT - DO NOT RUN MORE THAN ONCE

pte_pipe_df = qtr_2_mth(pte_pipe_df)
pte_built_df = qtr_2_mth(pte_built_df)

In [None]:
pte_built_df.tail()

### Supply of HDB flats

Source: data.gov.sg<br/>
Observation: yearly

The dataset includes completed and non-completed, which I will split them up into 2 dataframes - 1 for completed flats and the other group in the pipeline

In [None]:
hdb_status_df = pd.read_csv('../datasets/input/completion-status-of-hdb-residential-developments.csv')
hdb_status_df.tail()

In [None]:
# Get a list of HDB flats in the pipeline
# Eligilibity criteria and other terms for DBSS are almost the same as normal HDB flats so they will be combined 

hdb_pipe_df =  hdb_status_df[hdb_status_df['status'] != 'Completed']

# na means zero flats built for that year
hdb_pipe_df = hdb_pipe_df.replace('na',0)

# change datatype of numeric columns to integer 
hdb_pipe_df['no_of_units'] = hdb_pipe_df['no_of_units'].astype(np.int32)
hdb_pipe_df['financial_year'] = hdb_pipe_df['financial_year'].astype(np.int32)

# sum up DBSS and normal HDB flats
hdb_pipe_df = hdb_pipe_df.groupby(['financial_year'])['no_of_units'].sum().reset_index()

# rename columns
hdb_pipe_df.rename(columns={'financial_year':'year','no_of_units':'hdb_pipe'},inplace=True)

hdb_pipe_df

In [None]:
# Get a list of HDB flats built
# Eligilibity criteria and other terms for DBSS are almost the same as normal HDB flats so they will be combined 

hdb_built_df =  hdb_status_df[hdb_status_df['status']=='Completed']

# na means zero flats built for that year
hdb_built_df = hdb_built_df.replace('na',0)

# change datatype of numeric columns to integer 
hdb_built_df['no_of_units'] = hdb_built_df['no_of_units'].astype(np.int32)
hdb_built_df['financial_year'] = hdb_built_df['financial_year'].astype(np.int32)

# sum up DBSS and normal HDB flats
hdb_built_df = hdb_built_df.groupby(['financial_year'])['no_of_units'].sum().reset_index()

# rename columns
hdb_built_df.rename(columns={'financial_year':'year','no_of_units':'hdb_new'},inplace=True)

hdb_built_df

### New HDB flats booked

source: data.gov.sg<br/>
observation: yearly

This is another indicator of supply of flats. When buyers book a flat, they can expect to wait between a few months to a few years before they can move in 

In [None]:
hdb_booked_df = pd.read_csv('../datasets/input/bookings-for-new-flats.csv')
hdb_booked_df.head()

In [None]:
# change datatype of numeric columns to integer 
hdb_booked_df['no_of_units'] = hdb_booked_df['no_of_units'].astype(np.int32)
hdb_booked_df['financial_year'] = hdb_booked_df['financial_year'].astype(np.int32)

# Rename columns
hdb_booked_df.rename(columns={'financial_year':'year','no_of_units':'new_hdb_booked'},inplace=True)

### Population and types of home

Source: singstat<br/>
Observation: yearly

I am mainly interested to find out if the population size affects house prices.

I will also get the number of residents living in different types of houses. 

In [None]:
p_df = pd.read_csv('../datasets/input/population_singstat.csv')
p_df.head(9)

In [None]:
# Retrieve total residents and breakdown of residents for each type of house

s1 = p_df.iloc[0:1,0:].T.iloc[1:,0]
s2 = p_df.iloc[1:2,0:].T.iloc[1:,0]
s3 = p_df.iloc[2:3,0:].T.iloc[1:,0]
s4 = p_df.iloc[3:4,0:].T.iloc[1:,0]
s5 = p_df.iloc[4:5,0:].T.iloc[1:,0]
s6 = p_df.iloc[5:6,0:].T.iloc[1:,0]
s7 = p_df.iloc[6:7,0:].T.iloc[1:,0]
s8 = p_df.iloc[7:8,0:].T.iloc[1:,0]
s9 = p_df.iloc[8:9,0:].T.iloc[1:,0]

pop_df = pd.DataFrame()
pop_df['year'] = s1.index
pop_df['total_residents'] = s1.values
pop_df['hdb_dwellers'] = s2.values
pop_df['hdb_2r_dwellers'] = s3.values
pop_df['hdb_3r_dwellers'] = s4.values
pop_df['hdb_4r_dwellers'] = s5.values
pop_df['hdb_5r_dwellers'] = s6.values
pop_df['condo_dwellers'] = s7.values
pop_df['landed_dwellers'] = s8.values
pop_df['other_dwellers'] = s9.values
pop_df.tail()

### Number of Married People

Source: singstat<br/>
Observation: yearly

Majority of HDB flats are purchased by married couples, so it would be interestingly to see how the number of married people affect flat prices

In [None]:
m_df = pd.read_csv('../datasets/input/married_singstat.csv')
m_df.tail()

In [None]:
# transpose data from first row to a column

s = m_df.iloc[2:3,0:].T.iloc[1:,0]

married_df = pd.DataFrame()
married_df['year'] = s.index
married_df['married'] = s.values

married_df.tail()

In [None]:
# remove commmas from married
married_df = married_df.replace(",","",regex=True)

# change datatype of numeric columns to integer 
married_df['year'] = married_df['year'].astype(np.int32)
married_df['married'] = married_df['married'].astype(np.int64)

In [None]:
married_df.dtypes

## 3. Macroeconomic factors

I will collect data from the Singapore government portals (data.gov.sg, singstat, SPIMM and associations of banks in singapore) such as CPF interest rates, Interbank interest rate, consumer price index, PMI, composite index, income, unemployment rates, GDP growth rates, etc

The observatons are monthly, quarterly and yearly. I will expand the quarterly and yearly figures to monthly.

### Composite Leading Index (CLI)
Source: data.gov.sg<br/>
Observation: quarterly

- Singapore's Composite Leading Index is used to anticipate the turning points of growth cycles, or fluctuations in the economy’s growth rate
- Expand the quarterly figures to monthly

In [None]:
cli_df = pd.read_csv('../datasets/input/composite-leading-index.csv')
cli_df.tail()

In [None]:
# split the date field to year and quarter
cli_df[['year','month']] = cli_df['quarter'].str.split("-", 1, expand=True)

# convert quarter to the starting month of the quarter in numeric form 
cli_df = cli_df.replace('Q1',1).replace('Q2',4).replace('Q3',7).replace('Q4',10)

In [None]:
# rename column
cli_df.rename(columns={"value":"cli"},inplace=True)

# reorder the columns
cli_df = cli_df.reindex(['year','month','cli'],axis=1)

cli_df['year'] = cli_df['year'].astype(np.int32)
cli_df['month'] = cli_df['month'].astype(np.int32)

In [None]:
cli_df = qtr_2_mth(cli_df)

In [None]:
cli_df.tail()

### Perchasing Manager Index (PMI)
Source: Singapore Institute of Purchasing & Materials Management<br/>
Observation: Monthly

The PMI is an indicator of business activity - both in the manufacturing and services sectors

In [None]:
pmi_df = pd.read_csv('../datasets/input/pmi_sipmm.csv')
pmi_df.tail()

In [None]:
# split the date field to year and month
pmi_df[['month','year']] = pmi_df['Month/Year'].str.split("-", 1, expand=True)

# Rename column
pmi_df.rename(columns={"Singapore PMI":"pmi"},inplace=True)

# change month to numeric form
dic = {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6,'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12}
pmi_df.month = pmi_df.month.map(dic)

# change datatype of numeric columns to integer 
pmi_df['month'] = pmi_df['month'].astype(np.int32)
pmi_df['year'] = pmi_df['year'].astype(np.int32)

# change year to 4-digit format
pmi_df['year'] = pmi_df['year']+2000

# reorder the columns
pmi_df = pmi_df.reindex(['year','month','pmi'],axis=1)


### CPF interest rates

Source: data.gov.sg<br/>
observation: monthly

When you take a housing loan from HDB, you will enjoy a concessionary interest rate. This concessionary interest rate is pegged at 0.10% above the prevailing CPF Ordinary Account (OA) interest rate, and may be adjusted in January, April, July, and October, in line with CPF interest rate revisions.[Source: HDB](https://www.hdb.gov.sg/cs/infoweb/residential/servicing-your-hdb-loan/mortgage-loan/interest-rate)

In [None]:
# import CPF interest rates
ci_df = pd.read_csv('../datasets/input/cpf-interest-rates.csv')

# housing loan interest rate is based on Ordinary account
cpf_df = ci_df[ci_df['account_type']=='Ordinary'].copy(deep=True)

In [None]:
# split the date field to year and quarter
cpf_df[['year','month']] = cpf_df['mth'].str.split("-", 1, expand=True)

In [None]:
cpf_df.tail()

In [None]:
# housing loan interest rate is based on Ordinary account
cpf_df = ci_df[ci_df['account_type']=='Ordinary'].copy(deep=True)

# split the date field to year and quarter
cpf_df[['year','month']] = cpf_df['mth'].str.split("-", 1, expand=True)

# Drop unnecessary columns
cpf_df.rename(columns={"interest_rate":"cpf_rate"},inplace=True)

# reorder the columns
cpf_df = cpf_df.reindex(['year','month','cpf_rate'],axis=1)

# change datatypes of numeric columns to integer/float
cpf_df['year'] = cpf_df['year'].astype(np.int32)
cpf_df['month'] = cpf_df['month'].astype(np.int32)
cpf_df['cpf_rate'] = cpf_df['cpf_rate'].astype(np.float)

In [None]:
cpf_df.dtypes

In [None]:
cpf_df.tail()

### Consumer Price Index

source: singstat<br/>
observation: monthly

In [None]:
c_df = pd.read_csv('../datasets/input/cpi_singstat.csv')

In [None]:
s = c_df.iloc[0:1,0:].T.iloc[1:,0]

cpi_df = pd.DataFrame()
cpi_df['month'] = s.index
cpi_df['cpi_index'] = s.values

cpi_df.tail()

In [None]:
# split the date field to year and quarter
cpi_df[['year','month']] = cpi_df['month'].str.split(" ", 1, expand=True)

# change months to numeric form
dic = {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6,'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12}
cpi_df.month = cpi_df.month.map(dic)

# rearrange columns
cpi_df = cpi_df.reindex(['year','month','cpi_index'],axis=1)

# change datatypes of numeric columns to integer
cpi_df['year'] = cpi_df['year'].astype(np.int32)
cpi_df['month'] = cpi_df['month'].astype(np.int32)

### GDP Growth Rates

source: singstat<br/>
observation: quarterly

- Expand the quarterly figures to monthly

In [None]:
g_df = pd.read_csv('../datasets/input/gdp_growth_singstat.csv')
g_df.tail()

In [None]:
s = g_df.iloc[0:1,0:].T.iloc[1:,0]

gdp_df = pd.DataFrame()
gdp_df['month'] = s.index
gdp_df['gdp_growth'] = s.values

gdp_df.tail()

In [None]:
# split the date field to year and quarter
gdp_df[['year','month']] = gdp_df['month'].str.split(" ", 1, expand=True)

# convert quarter to the starting month of the quarter in numeric form 
gdp_df = gdp_df.replace('1Q',1).replace('2Q',4).replace('3Q',7).replace('4Q',10)

# reorder the columns
gdp_df = gdp_df.reindex(['year','month','gdp_growth'],axis=1)

gdp_df['year'] = gdp_df['year'].astype(np.int32)
gdp_df['month'] = gdp_df['month'].astype(np.int32)

In [None]:
# DO NOT RUN more than once. Otherwise it will keep expanding the rows
gdp_df = qtr_2_mth(gdp_df)

In [None]:
gdp_df.tail()

### Unemployment Rate

Source: singstat<br/>
observation: quarterly

- Expand the quarterly figures to monthly

In [None]:
u_df = pd.read_csv('../datasets/input/unemployment_singstat.csv')
u_df.tail()

In [None]:
s = u_df.iloc[0:1,0:].T.iloc[1:,0]

unemployed_df = pd.DataFrame()
unemployed_df['month'] = s.index
unemployed_df['unemployed_rate'] = s.values

unemployed_df.tail()

In [None]:
# split the date field to year and quarter

unemployed_df[['year','month']] = unemployed_df['month'].str.split(" ", 1, expand=True)

# convert quarter to the starting month of the quarter in numeric form 

unemployed_df = unemployed_df.replace('1Q',1).replace('2Q',4).replace('3Q',7).replace('4Q',10)

# reorder the columns
unemployed_df = unemployed_df.reindex(['year','month','unemployed_rate'],axis=1)

# change datatype of numeric columns to integer
unemployed_df['year'] = unemployed_df['year'].astype(np.int32)
unemployed_df['month'] = unemployed_df['month'].astype(np.int32)

unemployed_df.dtypes

In [None]:
# DO NOT RUN more than once. Otherwise it will keep expanding the rows
unemployed_df = qtr_2_mth(unemployed_df)

unemployed_df.tail(15)

### SIBOR 

Source: abs.org.sg<br/>
Observation : business days

- Housing loans are commonly pegged to SIBOR (Singapore Interbank Offered Rate). There are several types of SIBOR, and the common ones for housing loans are 1-month and 3-month.

- Calculate average the rates for each month

In [None]:
sibor_df = pd.read_csv('../datasets/input/sibor_abs.csv')
sibor_df.tail()

In [None]:
# split the date to day, month, year
sibor_df[['day','month','year']] = sibor_df['SIBOR DATE'].str.split("/", 2, expand=True)

# convert numeric columns to integer
sibor_df['month'] = sibor_df['month'].astype(np.int32)
sibor_df['year'] = sibor_df['year'].astype(np.int32)

# store sibor_3m average in a temp dataframe

df = sibor_df.groupby(['year','month'])['SIBOR 3M'].mean().reset_index()

# calcalute average of sibor_1m by month

sibor_df = sibor_df.groupby(['year','month'])['SIBOR 1M'].mean().reset_index()
sibor_df

# merge the 1m and 3m rates into the original dataframe
sibor_df['sibor_3m'] = df['SIBOR 3M']

# rename column to lowercase and replace space with underscore
sibor_df.rename(columns={'SIBOR 1M':'sibor_1m'},inplace=True)

sibor_df.tail(15)

### Monthly Income

source: singstat<br/>
observation: yearly


I will get the median income of the population of each year and see how it affects house prices

In [None]:
ic_df = pd.read_csv('../datasets/input/monthly_income_singstat.csv')
ic_df.head()

In [None]:
# transpose first row of dataframe to get median monthly income

s = ic_df.iloc[0:1,0:].T.iloc[1:,0]

income_df = pd.DataFrame()
income_df['year'] = s.index
income_df['income'] = s.values
income_df.head()

In [None]:
# There is a missing value for income in 2005
# i will get the average of incomes for 2004 and 2006

# remove commas from income
income_df = income_df.replace(",","",regex=True)

# get the incomes 
inc_2004 = int(income_df[income_df['year']=='2004'].values[0][1])
inc_2006 = int(income_df[income_df['year']=='2006'].values[0][1])

# update dataframe with the average
income_df.iloc[4,1] = (inc_2004+inc_2006)/2

In [None]:
income_df

In [None]:
# change datatype of numeric columns to integer 
income_df['year'] = income_df['year'].astype(np.int32)
income_df['income'] = income_df['income'].astype(np.int32)

## 4. Cost Factors

### HDB Resale Price Index

Source: singstat<br/>
observation : quarterly

- Expand the quarterly figures to monthly

In [None]:
hi_df = pd.read_csv('../datasets/input/hdb_price_index_singstat.csv')
hi_df.tail()

In [None]:
# transpose first row of dataframe
s = hi_df.iloc[0:1,0:].T.iloc[1:,0]
hdb_index_df = pd.DataFrame()
hdb_index_df['month'] = s.index
hdb_index_df['hdb_index'] = s.values

# split the month into year and month
hdb_index_df[['year','month']] = hdb_index_df['month'].str.split(" ", 1, expand=True)

# convert quarters to the first month of the quarter
hdb_index_df = hdb_index_df.replace('1Q',1).replace('2Q',4).replace('3Q',7).replace('4Q',10)

# convert numeric columns to integer
hdb_index_df['year'] = hdb_index_df['year'].astype(np.int32)
hdb_index_df['month'] = hdb_index_df['month'].astype(np.int32)

In [None]:
# reorder the columns
hdb_index_df = hdb_index_df.reindex(['year','month','hdb_index'],axis=1)

In [None]:
# DO NOT RUN more than once. Otherwise it will keep expanding the rows
hdb_index_df = qtr_2_mth(hdb_index_df)

In [None]:
hdb_index_df.tail(10)

### Private property price index

Source: data.gov.sg<br/>
observation: quarterly

- Expand the quarterly figures to monthly

In [None]:
pi_df = pd.read_csv('../datasets/input/private-residential-property-price-index.csv') 
pi_df.tail()

In [None]:
# split into condo and all private residential

condo_index_df = pi_df[pi_df['property_type']=='Non-Landed'].copy(deep=True)
pte_index_df = pi_df[pi_df['property_type']=='All Residential'].copy(deep=True)

In [None]:
# split quarter into year and month 
condo_index_df[['year','month']] = condo_index_df['quarter'].str.split("-", 1, expand=True)
pte_index_df[['year','month']] = pte_index_df['quarter'].str.split("-", 1, expand=True)

# drop unnecessary columns
# condo_index_df.drop(columns=['quarter','property_type'],inplace=True)

# rename column to meaningful name
condo_index_df.rename(columns={"index":"condo_index"},inplace=True)
pte_index_df.rename(columns={"index":"pte_index"},inplace=True)

# preparing for expanding quarterly figures to monthly
condo_index_df = condo_index_df.replace('Q1',1).replace('Q2',4).replace('Q3',7).replace('Q4',10)
pte_index_df = pte_index_df.replace('Q1',1).replace('Q2',4).replace('Q3',7).replace('Q4',10)

# convert numeric columns to integer
condo_index_df['year'] = condo_index_df['year'].astype(np.int32)
condo_index_df['month'] = condo_index_df['month'].astype(np.int32)
pte_index_df['year'] = pte_index_df['year'].astype(np.int32)
pte_index_df['month'] = pte_index_df['month'].astype(np.int32)

# reorder the columns
condo_index_df = condo_index_df.reindex(['year','month','condo_index'],axis=1)
pte_index_df = pte_index_df.reindex(['year','month','pte_index'],axis=1)

In [None]:
# Expand quarterly figures to monthly
# DO NOT RUN more than once. Otherwise it will keep expanding the rows
condo_index_df = qtr_2_mth(condo_index_df)
pte_index_df = qtr_2_mth(pte_index_df)

## 5. Points of Interest

### Primary and Secondary Schools

Source: data.gov.sg

- fetch the geocodes of schools based on their physical addresses

In [None]:
school_df = pd.read_csv('../datasets/input/general-information-of-schools.csv')

school_df = school_df[['school_name','address','mainlevel_code']].copy(deep=True)

school_df['school_name'] = school_df['school_name'].str.lower()
school_df['address'] = school_df['address'].str.lower()
school_df['mainlevel_code'] = school_df['mainlevel_code'].str.lower()

school_df['lat'] = 0
school_df['lng'] = 0

print('fetching lat/long and updating the dataframe .....')
#end = len(school_df)
#for i in range(end):
#    school_df.iloc[i,3], school_df.iloc[i,4] = get_lat_long(school_df.iloc[i,1])

#path = '../datasets/output/schools.csv'
#school_df.to_csv(path, index = False)

print('completed')

### Hawker Centres and Markets

Source: data.gov.sg

- fetch the geocodes of hawker centres based on their physical addresses

In [None]:
hc_df = pd.read_csv('../datasets/input/list-of-government-markets-hawker-centres.csv')
hc_df.head()

In [None]:
hc_df = hc_df[['name_of_centre','location_of_centre','type_of_centre']].copy(deep=True)

hc_df.rename(columns={'name_of_centre':'hawker_centre','location_of_centre':'address'},inplace=True)

hc_df['lat'] = 0
hc_df['lng'] = 0

hc_df.columns

print('fetching lat/long and updating the dataframe .....')
#end = len(hc_df)
#for i in range(end):
#    hc_df.iloc[i,3], hc_df.iloc[i,4] = get_lat_long(hc_df.iloc[i,1])

#path = '../datasets/output/hawker_centre_market.csv'
#hc_df.to_csv(path, index = False)

print('completed')

### MRT and LRT Stations

Source: https://www.kaggle.com/yxlee245/singapore-train-station-coordinates?select=mrt_lrt_data.csv

In [None]:
station_df = pd.read_csv('../datasets/input/datasets_287088_590207_mrt_lrt_data.csv')

station_df

### Shopping Malls

Source: wikipedia

- use Beautiful Soup to scrap the data from web page
- fetch geocodes based on mall's names (only work with Google geocoding service)

In [None]:
# Use the requests library to get the html from the home page
res = requests.get('https://en.wikipedia.org/wiki/List_of_shopping_malls_in_Singapore')

# Create a soup object from the html
soup = bs(res.content, 'lxml')

In [None]:
malls = soup.find('div',{'class':'mw-parser-output'})

In [None]:
# create a new dataframe
df_malls = pd.DataFrame(columns=['mall','lat','lng'])

In [None]:
df_malls

In [None]:
# Loop through each <a> tag to get mall's name and get its geocode
# The codes below will NOT run successfullly because I have removed the Google API key from the function
# Only Google service can get the geocodes of the malls. The free Nominatim failed to get any geocode for malls

#i = 0
#for m in malls.find_all('a', {'class': 'new'}):
#    print(m.text)
    
#    lat, lng = get_pos_by_name(m.text + ' Singapore')
    
#    df_malls.loc[i] = [m.text,lat,lng]
    
#    i += 1

In [None]:
df_malls

In [None]:
#df_malls.to_csv('../datasets/output/malls.csv', index=False)

### Nature Parks

Source: data.gov.sg

- scrape the data from the geojson file provided

In [None]:
# create a new dataframe for nature parks
df_parks = pd.DataFrame(columns=['park','lat','lng'])

with open('../datasets/input/parks-geojson.geojson') as f:
    data = json.load(f)

i=0
for feature in data['features']:
    
    soup = BeautifulSoup(feature['properties']['Description'], 'lxml')
    
    table = soup.find('table')
       
    for line in table.findAll('tr'):
        
        name = re.findall(r"\bNAME\b",line.getText())
        
        if len(name)>0:
            park = re.findall(r"NAME\s([\S\s]*)",line.getText())
            
            df_parks.loc[i] = [park[0],feature['geometry']['coordinates'][1],feature['geometry']['coordinates'][0]]
            
            i += 1


In [None]:
df_parks

In [None]:
df_parks.to_csv('../datasets/output/parks.csv', index=False)

### Crematoria, Columbaria and Cemeteries

source: data.gov.sg

- scrape the data from the geojson file provided

In [18]:

# create a new dataframe for crematoria
df_crematoria = pd.DataFrame(columns=['crematorium','lat','lng'])

with open('../datasets/input/crematoria-geojson.geojson') as f:
    data = json.load(f)

i=0
for feature in data['features']:
    
    soup = BeautifulSoup(feature['properties']['Description'], 'lxml')
    
    table = soup.find('table')
       
    for line in table.findAll('tr'):
        
        name = re.findall(r"\bNAME\b",line.getText())
        
        if len(name)>0:
            crema = re.findall(r"NAME\s([\S\s]*)",line.getText())
            
            df_crematoria.loc[i] = [crema[0],feature['geometry']['coordinates'][1],feature['geometry']['coordinates'][0]]
            
            i += 1


In [19]:
df_crematoria

Unnamed: 0,crematorium,lat,lng
0,Kong Meng San Phor Kark See Monastery (Bright ...,1.361505,103.835805
1,Tze Tho Aum Temple,1.36186,103.838299
2,Choa Chu Kang Crematorium,1.370578,103.686983
3,Mandai Crematorium,1.413912,103.809632


In [None]:
df_crematoria.to_csv('../datasets/output/crematoria.csv', index=False)

In [16]:
# create a new dataframe for crematoria
df_columbaria = pd.DataFrame(columns=['columbaria','lat','lng'])

with open('../datasets/input/dedicated-columbaria-geojson.geojson') as f:
    data = json.load(f)

i=0
for feature in data['features']:
    
    soup = BeautifulSoup(feature['properties']['Description'], 'lxml')
    
    table = soup.find('table')
       
    for line in table.findAll('tr'):
        
        name = re.findall(r"\bNAME\b",line.getText())
        
        if len(name)>0:
            colum = re.findall(r"NAME\s([\S\s]*)",line.getText())
            
            df_columbaria.loc[i] = [colum[0],feature['geometry']['coordinates'][1],feature['geometry']['coordinates'][0]]
            
            i += 1


In [21]:
df_columbaria.to_csv('../datasets/output/columbaria.csv', index=False)

In [23]:
# create a new dataframe for cemeteries
df_cemeteries = pd.DataFrame(columns=['cemetery','lat','lng'])

with open('../datasets/input/active-cemeteries-geojson.geojson') as f:
    data = json.load(f)

i=0
for feature in data['features']:
    
    soup = BeautifulSoup(feature['properties']['Description'], 'lxml')
    
    table = soup.find('table')
       
    for line in table.findAll('tr'):
        
        name = re.findall(r"\bNAME\b",line.getText())
        
        if len(name)>0:
            cem = re.findall(r"NAME\s([\S\s]*)",line.getText())
            
            df_cemeteries.loc[i] = [cem[0],feature['geometry']['coordinates'][1],feature['geometry']['coordinates'][0]]
            
            i += 1



In [24]:
df_cemeteries

Unnamed: 0,cemetery,lat,lng
0,Chua Chu Kang Ahmadiyya Jama'at Cemetery,1.3694,103.688042
1,Chua Chu Kang Bahai Cemetery,1.374913,103.692977
2,Chua Chu Kang Chinese Cemetery,1.381858,103.68639
3,Chua Chu Kang Christian Cemetery,1.373497,103.689689
4,Chua Chu Kang Hindu Cemetery,1.369422,103.685886
5,Chua Chu Kang Jewish Cemetery,1.371647,103.699983
6,Chua Chu Kang Muslim Cemetery,1.383451,103.687554
7,Chua Chu Kang Parsi Cemetery,1.371738,103.699554
8,Lawn Cemetery,1.373497,103.689689
9,"State Cemetery, Kranji",1.41948,103.757139


In [25]:
df_cemeteries.to_csv('../datasets/output/cemeteries.csv', index=False)

### Sports Facilities

Source: data.gov.sg

- scrape the data from the geojson file provided
- convert categorical data to dummy variables

In [None]:
# create a new dataframe for sports facilities
df_sports = pd.DataFrame(columns=['road','facilities','lat','lng'])

# Create an empty list to hold all facilities found
facilities_list = []

with open('../datasets/input/sportsg-sport-facilities-geojson.geojson') as f:
    data = json.load(f)

i=0
for feature in data['features']:
    
    soup = BeautifulSoup(feature['properties']['Description'], 'lxml')
    
    table = soup.find('table')
       
    for line in table.findAll('tr'):
        
        
        fa = re.findall(r"\bFACILITIES\b",line.getText())
        if len(fa)>0:
            fac = re.findall(r"FACILITIES\s([\S\s]*)",line.getText())
            
            # add to facilities_list
            facilities_list.extend(fac[0].lower().strip().split('/'))
        
        
        rd = re.findall(r"\bROAD_NAME\b",line.getText())
        if len(rd)>0:
            road = re.findall(r"ROAD_NAME\s([\S\s]*)",line.getText())
            
            #print(road[0])
            
            
            gp = feature['geometry']['coordinates']

        
    if len(fac)>0 and len(road)>0 and g[0][1]>0:
        df_sports.loc[i] = [road[0],fac[0].lower().strip(),gp[0][0][1],gp[0][0][0]]

    i += 1

In [None]:
# Now facilities_list has collected all types of facilities, many of them are duplicates
# So use set to extract unique types
f_list=[]
for i in facilities_list:
    f_list.append(i.strip())
    
facilities_set = set(f_list)
print(facilities_set)

In [None]:
# Now create the dummy variables, one for each type of facility
for i in facilities_set:
    df_sports[i] = 0

for index, row in df_sports.iterrows():
    f_list = row['facilities'].strip().split('/') # split the facilities string
    
    for j in f_list:
        #set corresponding column to 1
        if j in facilities_set:
            df_sports.loc[index,j]=1

In [None]:
df_sports

In [None]:
df_sports.rename(columns={'gateball & petanque courts':'gateball_petanque_courts',
                'swimming complex':'swimming_complex','squash centre':'squash_centre',
                'netball centre':'netball_centre','lawn bowl':'lawn_bowl','sports hall':'sports_hall',
                'futsal court':'futsal_court','tennis centre':'tennis_centre',
                'practice track':'practice_track','hockey pitch':'hockey_pitch'
                },inplace=True)

In [None]:
df_sports.drop(labels='facilities',axis=1,inplace=True)

In [None]:
df_sports.head()

In [None]:
df_sports.to_csv('../datasets/output/sports.csv', index=False)