In [139]:
# import packages

import pandas as pd

# Model Prediction Task I

You are tasked to build a predictive algorithm to determine the factors affecting prices of residential properties in Singapore. You need to provide insights to your reporting officer to detail one or more strategies in curbing housing prices inflation.

Your fellow colleagues should be able to access and contribute to your code to replicate the same insights. Their local devices do not have GPU access. Provide justification for any of the choices you have made.

### Exploratory Data Analysis

In [172]:
approval_1990_1999 = pd.read_csv("data/resale-flat-prices-based-on-approval-date-1990-1999.csv")
approval_2000_2012 = pd.read_csv("data/resale-flat-prices-based-on-approval-date-2000-feb-2012.csv")
registration_2012_2014 = pd.read_csv("data/resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.csv")
registration_2015_2016 = pd.read_csv("data/resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv")
registration_2017 = pd.read_csv("data/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")

In [173]:
print(f"resale-flat-prices-based-on-approval-date-1990-1999 with shape {approval_1990_1999.shape}\n", approval_1990_1999.head())
print(f"\nresale-flat-prices-based-on-approval-date-2000-feb-2012 {approval_2000_2012.shape}\n", approval_2000_2012.head())

resale-flat-prices-based-on-approval-date-1990-1999 with shape (287200, 10)
      month        town flat_type block       street_name storey_range  \
0  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     10 TO 12   
1  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     04 TO 06   
2  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     10 TO 12   
3  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     07 TO 09   
4  1990-01  ANG MO KIO    3 ROOM   216  ANG MO KIO AVE 1     04 TO 06   

   floor_area_sqm      flat_model  lease_commence_date  resale_price  
0            31.0        IMPROVED                 1977          9000  
1            31.0        IMPROVED                 1977          6000  
2            31.0        IMPROVED                 1977          8000  
3            31.0        IMPROVED                 1977          6000  
4            73.0  NEW GENERATION                 1976         47200  

resale-flat-prices-based-on-approval-date-2000-feb-2012 (

In [174]:
print(f"resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014 with shape {registration_2012_2014.shape}\n", registration_2012_2014.head())
print(f"\nresale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016 {registration_2015_2016.shape}\n", registration_2015_2016.head())
print(f"\nresale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv {registration_2017.shape}\n", registration_2017.head())

resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014 with shape (52203, 10)
      month        town flat_type block        street_name storey_range  \
0  2012-03  ANG MO KIO    2 ROOM   172   ANG MO KIO AVE 4     06 TO 10   
1  2012-03  ANG MO KIO    2 ROOM   510   ANG MO KIO AVE 8     01 TO 05   
2  2012-03  ANG MO KIO    3 ROOM   610   ANG MO KIO AVE 4     06 TO 10   
3  2012-03  ANG MO KIO    3 ROOM   474  ANG MO KIO AVE 10     01 TO 05   
4  2012-03  ANG MO KIO    3 ROOM   604   ANG MO KIO AVE 5     06 TO 10   

   floor_area_sqm      flat_model  lease_commence_date  resale_price  
0            45.0        Improved                 1986      250000.0  
1            44.0        Improved                 1980      265000.0  
2            68.0  New Generation                 1980      315000.0  
3            67.0  New Generation                 1984      320000.0  
4            67.0  New Generation                 1980      321000.0  

resale-flat-prices-based-on-regi

Upon reading the names of the 5 csv files, there are two distinct datasets that can be created by combining some files:

<br>
1. resale-flat-prices-based-on-approval-date
    <br>- resale-flat-prices-based-on-approval-date-1990-1999.csv
    <br>- resale-flat-prices-based-on-approval-date-2000-feb-2012.csv

<br><br>
2. resale-flat-prices-based-on-registration-date
     <br>- resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.csv
     <br>- resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv
     <br>- resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv

However, there is some data manipulation required as the variable input data types and value styles differ across the years (i.e. datasets), with some datasets even having new variables. A deeper analysis of the differences between datasets to be combined will be conducted to determine the final dataset data types and standardisation of variables.

#### Create approval_df 
using resale-flat-prices-based-on-approval-date-1990-1999.csv and resale-flat-prices-based-on-approval-date-2000-feb-2012.csv

In [143]:
"""
Input: <list> list of dataframes
       <list> list of str of dataframe names
Description: iterates through the dataframes to compare each variable's datatypes and returns the comparison
Output: <DataFrame> dataframe comparing variables and datatypes
"""

def get_datatypes(dfs, df_names) :
    
    df_datatypes = pd.DataFrame()
    for i in range(len(dfs)) :
        df = dfs[i]
        df_name = df_names[i]
        col_datatypes = {}
        for col in df.columns :
            col_datatypes[col] = [type(df[col][0])]
        col_df = pd.DataFrame.from_dict(col_datatypes, orient='index').reset_index()
        col_df.columns = ["variable", "data type"]
        if i == 0 :
            df_datatypes = col_df
        else :
            df_datatypes = pd.merge(df_datatypes, col_df, on="variable", suffixes=('_1990_1999', '_2000_2012'))

    return df_datatypes


In [144]:
approval_dfs = [approval_1990_1999, approval_2000_2012]
get_datatypes(approval_dfs, ["approval_1990_1999", "approval_2000_2012"])

Unnamed: 0,variable,data type_1990_1999,data type_2000_2012
0,month,<class 'str'>,<class 'str'>
1,town,<class 'str'>,<class 'str'>
2,flat_type,<class 'str'>,<class 'str'>
3,block,<class 'str'>,<class 'str'>
4,street_name,<class 'str'>,<class 'str'>
5,storey_range,<class 'str'>,<class 'str'>
6,floor_area_sqm,<class 'numpy.float64'>,<class 'numpy.float64'>
7,flat_model,<class 'str'>,<class 'str'>
8,lease_commence_date,<class 'numpy.int64'>,<class 'numpy.int64'>
9,resale_price,<class 'numpy.int64'>,<class 'numpy.float64'>


In [145]:
print(f"resale-flat-prices-based-on-approval-date-1990-1999 with shape {approval_1990_1999.shape}\n", approval_1990_1999.head())
print(f"\nresale-flat-prices-based-on-approval-date-2000-feb-2012 {approval_2000_2012.shape}\n", approval_2000_2012.head())

resale-flat-prices-based-on-approval-date-1990-1999 with shape (287200, 10)
      month        town flat_type block       street_name storey_range  \
0  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     10 TO 12   
1  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     04 TO 06   
2  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     10 TO 12   
3  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     07 TO 09   
4  1990-01  ANG MO KIO    3 ROOM   216  ANG MO KIO AVE 1     04 TO 06   

   floor_area_sqm      flat_model  lease_commence_date  resale_price  
0            31.0        IMPROVED                 1977          9000  
1            31.0        IMPROVED                 1977          6000  
2            31.0        IMPROVED                 1977          8000  
3            31.0        IMPROVED                 1977          6000  
4            73.0  NEW GENERATION                 1976         47200  

resale-flat-prices-based-on-approval-date-2000-feb-2012 (

1990_1999 and 2000_2012 datasets have the exact same variables and datatypes except for resale_price where the former uses int64 and the latter uses float64. Since resale_price can have cents, float64 will be used instead of int64. 1990_1999 capitalises all variable inputs while 2000_2012 did not capitalise for flat_model. To standardise, flat_model in 2000_2012 will be capitalised when combining with 1990_1999.

In [170]:
def create_approval_df(approval_1990_1999, approval_2000_2012) :
    # change resale_price datatype
    approval_1990_1999["resale_price"] = approval_1990_1999["resale_price"].astype(float)
    # capitalise flat_model
    approval_2000_2012["flat_model"] = approval_2000_2012["flat_model"].str.upper()
    # create approval_df
    approval_df = pd.concat([approval_1990_1999, approval_2000_2012])
    
    return approval_df

In [171]:
approval_df = create_approval_df(approval_1990_1999, approval_2000_2012)
approval_df

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price
0,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,10 TO 12,31.0,IMPROVED,1977,9000.0
1,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,04 TO 06,31.0,IMPROVED,1977,6000.0
2,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,10 TO 12,31.0,IMPROVED,1977,8000.0
3,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,07 TO 09,31.0,IMPROVED,1977,6000.0
4,1990-01,ANG MO KIO,3 ROOM,216,ANG MO KIO AVE 1,04 TO 06,73.0,NEW GENERATION,1976,47200.0
...,...,...,...,...,...,...,...,...,...,...
369646,2012-02,YISHUN,5 ROOM,212,YISHUN ST 21,10 TO 12,121.0,IMPROVED,1985,476888.0
369647,2012-02,YISHUN,5 ROOM,758,YISHUN ST 72,01 TO 03,122.0,IMPROVED,1986,490000.0
369648,2012-02,YISHUN,5 ROOM,873,YISHUN ST 81,01 TO 03,122.0,IMPROVED,1988,488000.0
369649,2012-02,YISHUN,EXECUTIVE,664,YISHUN AVE 4,07 TO 09,181.0,APARTMENT,1992,705000.0


#### Create registration_df 
using resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.csv, resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv and resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv

# Model Prediction Task II

Since we are building a predictive algorithm, we will transform the categorical variables (town, storey_range, flat_model) using one hot encoding since most algorithms produce better results with numerical variables.

# Link Analysis Task I

# Link Analysis Task II

# Bonus Question: Link Analysis Task III 