## Libraries
<hr>
The libraries used are pandas, numpy, seaborn, Counter, matplotlib, axes3d, linearregression.

In [2]:
# Data analysis
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import sklearn
from collections import Counter

# Visualization
%matplotlib inline 
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('png2x','pdf')
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
from IPython.display import IFrame
import geocoder

# machine learning library
from sklearn import datasets, linear_model, cross_validation
from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_val_predict, ShuffleSplit
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
from sklearn.neural_network import MLPRegressor

# other
import time

pd.options.mode.chained_assignment = None #SettingWithCopyWarning for confusing chained assignment disabled

## 1 Data Acquisition
<hr>
The data for this report is acquired from the Singapore government website [1]. Data are collected from the period 1990 until January 2018. The data is provided in four seperate files, which will be merged into Python. The third file (> 20 MB) was seperated into periods of 2006-2012 and 2012-2014. This was necessary to make use of the Github repository.

In [3]:
# Load datasets
data1 = pd.read_csv('sg-resale-flat-prices-1990-1999.csv', sep =',')
data2 = pd.read_csv('sg-resale-flat-prices-2000-2005.csv', sep =',')
data3 = pd.read_csv('sg-resale-flat-prices-2006-2012.csv', sep =',')
data4 = pd.read_csv('sg-resale-flat-prices-2012-2014.csv', sep =',')
data5 = pd.read_csv('sg-resale-flat-prices-2014-2018.csv', sep =',')

In [4]:
data5 = data5.drop('remaining_lease',1)
print('Number of training data =', data5.shape)

Number of training data = (58631, 10)


In [5]:
#concatenate dataset
sets = [data1, data2, data3, data4, data5]
data = pd.concat(sets)
print('Number of training data =', data.shape)


Number of training data = (768629, 10)


## 3 Cleaning and Preprocessing the Dataset
<hr>
The cleaning and preprocessing section are divided into four sections, namely cleaning, encoding, feature engineering and one hot encoding.

### 3.1 Data Cleaning
During the exploration, there are some cleaning that should be performed. First, the flat types consist of eight types, which should be seven types instead. The flat type "Multi Generation" has a unique value with a space in between and one with a hyphen. Second, the flat models consist of 32 models, which should be 21 instead. This is also because of the capital usage. These doubles are removed by cleaning the data. Third, some storey range values in data set 4 are not correct. In the exploration, we have research the difference and since *** these values will be too different, these data points have been removed from the data set. Fourth, the feature "month" consists of sales year and sales month, e.g. 1990-01. To include the years and months in the model, this variable will be seperated to a variable called sales year and a variable called sales month. 


In [6]:
pd.options.mode.chained_assignment = None #SettingWithCopyWarning for confusing chained assignment disabled

#remove doubles
data['flat_type'][data['flat_type'] == 'MULTI-GENERATION'] = 'MULTI GENERATION'

#flat_type count
count_flat_type = data['flat_type'].nunique()
print("Total Flat Type Count:", count_flat_type)
flat_type_count = data['flat_type'].value_counts()
print("Flat Type \n" +str(flat_type_count))

Total Flat Type Count: 7
Flat Type 
4 ROOM              285136
3 ROOM              258482
5 ROOM              156260
EXECUTIVE            58177
2 ROOM                8859
1 ROOM                1246
MULTI GENERATION       469
Name: flat_type, dtype: int64


In [7]:
#remove doubles
data['flat_model'][data['flat_model'] == 'MODEL A'] = 'Model A'
data['flat_model'][data['flat_model'] == 'IMPROVED'] = 'Improved'
data['flat_model'][data['flat_model'] == 'NEW GENERATION'] = 'New Generation'
data['flat_model'][data['flat_model'] == 'PREMIUM APARTMENT'] = 'Premium Apartment'
data['flat_model'][data['flat_model'] == 'SIMPLIFIED'] = 'Simplified'
data['flat_model'][data['flat_model'] == 'STANDARD'] = 'Standard'
data['flat_model'][data['flat_model'] == 'APARTMENT'] = 'Apertment'
data['flat_model'][data['flat_model'] == 'MAISONETTE'] = 'Maisonette'
data['flat_model'][data['flat_model'] == 'ADJOINED FLAT'] = 'Adjoined flat'
data['flat_model'][data['flat_model'] == 'MODEL A-MAISONETTE'] = 'Model A-Maisonette'
data['flat_model'][data['flat_model'] == 'TERRACE'] = 'Terrace'
data['flat_model'][data['flat_model'] == 'MULTI GENERATION'] = 'Multi Generation'
data['flat_model'][data['flat_model'] == 'IMPROVED-MAISONETTE'] = 'Improved-Maisonette'
data['flat_model'][data['flat_model'] == '2-ROOM'] = '2-room'

#flat_model count
count_flat_model = data['flat_model'].nunique()
print("Total Flat Model Count:", count_flat_model)
flat_model_count = data['flat_model'].value_counts()
print("Flat Model Count \n" +str(flat_model_count))

Total Flat Model Count: 21
Flat Model Count 
Model A                   208633
Improved                  202602
New Generation            169643
Simplified                 51604
Standard                   38234
Premium Apartment          28886
Maisonette                 25136
Apartment                  19745
Apertment                   9901
Model A2                    8382
Adjoined flat               1913
Model A-Maisonette          1784
Terrace                      609
DBSS                         601
Multi Generation             469
Type S1                      183
Improved-Maisonette          105
Type S2                       80
Premium Maisonette            75
2-room                        38
Premium Apartment Loft         6
Name: flat_model, dtype: int64


In [8]:
#remove storey range outliers
#data = data.ix[data['storey_range'].isin(['01 TO 05','06 TO 10','11 TO 15','16 TO 20','21 TO 25','26 TO 30','31 TO 35','36 TO 40'])]
data = data.loc[data['storey_range'].isin(['01 TO 03','04 TO 06','07 TO 09','10 TO 12','13 TO 15','16 TO 18','19 TO 21','22 TO 24','25 TO 27','28 TO 30','31 TO 33','34 TO 36','37 TO 39','40 TO 42','43 TO 45','46 TO 48','49 TO 51'])]

#storey range count
count_storey_range = data['storey_range'].nunique()
print("Total Storey Range Count:", count_storey_range)
storey_range_count = data['storey_range'].value_counts()
print("Storey Range Count \n" +str(storey_range_count))

Total Storey Range Count: 17
Storey Range Count 
04 TO 06    196169
07 TO 09    177012
01 TO 03    158446
10 TO 12    149470
13 TO 15     46780
16 TO 18     16906
19 TO 21      8337
22 TO 24      5233
25 TO 27      2100
28 TO 30       788
34 TO 36       151
31 TO 33       151
37 TO 39       148
40 TO 42        73
46 TO 48        11
43 TO 45        11
49 TO 51         5
Name: storey_range, dtype: int64


In [9]:
#add sales year variable
if ('sales_year' not in data.columns):
    data.insert(1,'sales_year',(pd.DatetimeIndex(data['month']).year))

#add sales year variable
if ('sales_month' not in data.columns):
    data.insert(1,'sales_month',(pd.DatetimeIndex(data['month']).month))
    
#add sales year variable
if ('month' in data.columns):
    del data['month']

Note that two variables are not going to be used, which are respectively block and street name.

In [10]:
#remove unnecessary variables
data = data.drop('block',1)
data = data.drop('street_name',1)

data.head(5)

Unnamed: 0,sales_month,sales_year,town,flat_type,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price
0,1,1990,ANG MO KIO,1 ROOM,10 TO 12,31.0,Improved,1977,9000.0
1,1,1990,ANG MO KIO,1 ROOM,04 TO 06,31.0,Improved,1977,6000.0
2,1,1990,ANG MO KIO,1 ROOM,10 TO 12,31.0,Improved,1977,8000.0
3,1,1990,ANG MO KIO,1 ROOM,07 TO 09,31.0,Improved,1977,6000.0
4,1,1990,ANG MO KIO,3 ROOM,04 TO 06,73.0,New Generation,1976,47200.0


### 3.2 Encoding
When using categorical data, strings are not able to be interpreted by algorithms. Therefore, these values needs to be translated to a numerical value. For example, the towns in the dataset will be translated to $1,2,…,n$. Since there are rows in our dataset containing characters. These rows (town, flat type, flat model and storey range) are transformed into dummy variables to clarify their levels, with other words, to quantify the qualitative data. 

In [11]:
#Note that data.copy() is used to make a copy of data, which will be used for analysis
data_enc = data.copy() 

In [12]:
#dummies for town
town_array = np.unique(data['town'])
n = len(town_array)

for i in range(0,n):
    data_enc['town'][data['town'] == town_array[i]] = i+1

#count_town = data['town'].nunique()
#print("Total Town Count:", count_town)
#town_count = data['town'].value_counts()
#print("Town Count \n" +str(town_count))

In [13]:
#dummies for flat types
data_enc['flat_type'][data.flat_type == '1 ROOM'] = 1
data_enc['flat_type'][data.flat_type == '2 ROOM'] = 2
data_enc['flat_type'][data.flat_type == '3 ROOM'] = 3
data_enc['flat_type'][data.flat_type == '4 ROOM'] = 4
data_enc['flat_type'][data.flat_type == '5 ROOM'] = 5
data_enc['flat_type'][data.flat_type == 'MULTI GENERATION'] = 6
data_enc['flat_type'][data.flat_type == 'EXECUTIVE'] = 7

#flat_type_count = data['flat_type'].value_counts()
#print("Flat Type \n" +str(flat_type_count))

In [14]:
#dummies for storey ranges 
storey_range_array = np.unique(data['storey_range'])
n = len(storey_range_array)

for i in range(0,n):
    data_enc['storey_range'][data['storey_range'] == storey_range_array[i]] = i+1

#count_storey_range = data['storey_range'].nunique()
#print("Total Storey Range Count:", count_storey_range)
#storey_range_count = data['storey_range'].value_counts()
#print("Storey Range Count \n" +str(storey_range_count))

In [15]:
#dummies for flat models
flat_model_array = np.unique(data['flat_model'])
n = len(flat_model_array)

for i in range(0,n):
    data_enc['flat_model'][data['flat_model'] == flat_model_array[i]] = i+1

#count_flat_model = data['flat_model'].nunique()
#print("Total Flat Model Count:", count_flat_model)
#flat_model_count = data['flat_model'].value_counts()
#print("Flat Model Count \n" +str(flat_model_count))

In [16]:
data_enc.head()

Unnamed: 0,sales_month,sales_year,town,flat_type,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price
0,1,1990,1,1,4,31.0,6,1977,9000.0
1,1,1990,1,1,2,31.0,6,1977,6000.0
2,1,1990,1,1,4,31.0,6,1977,8000.0
3,1,1990,1,1,3,31.0,6,1977,6000.0
4,1,1990,1,3,2,73.0,13,1976,47200.0


### 3.4 One Hot Encoding (Categorical Data)
Label encoding is a traditional way of translating strings into numerical values. The disadvantage of this method is the fact that algorithms might misinterpret these values. A higher town value does not necessarily mean that it has the potential of having higher resale prices.

To cope with this problem, the one hot encoding approach is utilised. Instead of giving a numerical value, new columns are created per feature value. Continuing with the town example, this would mean that every town would have a new column. When the datapoint is part of this value, it will receive a $1$, whilst the other values receive a $0$ in this column. Therefore, it can be considered as a boolean solution for the feature values; either it is part of the value (True) or it is not (False). The disadvantage of the method is the fact that a significant amount of columns will be added to the dataset.

To use the one hot encoding approach, the Panda feature get_dummies is used. This is similar to the LabelBinarizer function used in the Scikit-learn package. We chose for the pandas approach as our data was already converted to a pandas DataFrame. One hot encoding are used for the following features: town/area, flat type, flat model and storey range. 

In [17]:
#Note that data.copy() is used to make a copy of data, which will be used for analysis
data_henc = data.copy()

In [18]:
#one hot encoding for town
dummies = pd.get_dummies(data_henc['town']).rename(columns=lambda x: 'town_' + str(x))
data_henc = pd.concat([data_henc, dummies], axis=1)

In [19]:
#one hot encoding for flat types
dummies = pd.get_dummies(data_henc['flat_type']).rename(columns=lambda x: 'flat_type_' + str(x))
data_henc = pd.concat([data_henc, dummies], axis=1)

#source: http://www.hdb.gov.sg/cs/infoweb/residential/buying-a-flat/new/types-of-flats&rendermode=preview

In [20]:
#one hot encoding for storey ranges
dummies = pd.get_dummies(data_henc['storey_range']).rename(columns=lambda x: 'storey_range_' + str(x))
data_henc = pd.concat([data_henc, dummies], axis=1)


In [21]:
#one hot encoding for flat models
dummies = pd.get_dummies(data_henc['flat_model']).rename(columns=lambda x: 'flat_model_' + str(x))
data_henc = pd.concat([data_henc, dummies], axis=1)


In [22]:
#remove unnecessary variables
data_henc = data_henc.drop('town',1)
data_henc = data_henc.drop('flat_type',1)
data_henc = data_henc.drop('storey_range',1)
data_henc = data_henc.drop('flat_model',1)

print(data_henc.shape)

(761791, 77)


In [23]:
data_henc.head(5)

Unnamed: 0,sales_month,sales_year,floor_area_sqm,lease_commence_date,resale_price,town_ANG MO KIO,town_BEDOK,town_BISHAN,town_BUKIT BATOK,town_BUKIT MERAH,...,flat_model_Multi Generation,flat_model_New Generation,flat_model_Premium Apartment,flat_model_Premium Apartment Loft,flat_model_Premium Maisonette,flat_model_Simplified,flat_model_Standard,flat_model_Terrace,flat_model_Type S1,flat_model_Type S2
0,1,1990,31.0,1977,9000.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1990,31.0,1977,6000.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1990,31.0,1977,8000.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1990,31.0,1977,6000.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,1990,73.0,1976,47200.0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


### 3.4 New Features
In this part, we will explore new features that we can add to make our data more valuable. Since the data consists of seven objects, two floats and one integer, the seven objects will be researched and to see which can and will be changed. (Note that adding and dropping variables have been changed to comments, because an error would pop up otherwise. This is because the variable is already added or dropped, thus it cannot be performed again.) <br>

***Remaining Lease*** Linear regression will not be able to read the years, since it can see it as another numerical value. Therefore, the remaining lease year is calculated. Once the sales year variable is created, the remaining lease year can be computed by using the following formula: $remaining lease year = 99 - (sales year - lease commence date)$.

To check whether the remaining lease variable is correct, the data tail from dataset 5 in the data acquisition is used to compare with the new data. Since only the fifth data set consists of this data, we could use the column for validation. 

In [24]:
#Note that data.copy() is used to make a copy of data, which will be used for analysis
data_f = data_henc.copy()

In [25]:
#compute remaining lease variable
if ('remaining_lease' not in data_f.columns):
    data_f['remaining_lease'] = 99 - (data.sales_year - data.lease_commence_date)

***Longtitude and Latitude*** Another interesting feature would be the longtitude and latitude of the street name. Fortunately, Google has such a package to make this possible. Unfortunately, this is only possible for 2,500 data points per day. Since we have 768.629 data points, this task was not possible for us. However, we still want to show that we have tried running the code underneath. Note that this can be seen as a limitation for our study.

In [15]:
#compute longlat variable
#data['long_lat'] = geocoder.google(data['street_name']).lating

Status code Unknown from https://maps.googleapis.com/maps/api/geocode/json: ERROR - ('Connection aborted.', OSError("(32, 'EPIPE')",))


***Area*** Instead of longtitude and latitutde, we have made an extra variable called "Area". The Area captures all the town in a specific region, which is based on the information of the official Singapore government site. 

In [81]:
#Note that data.copy() is used to make a copy of data, which will be used for analysis
data_2f = data_f.copy()

In [82]:
#add area variable
data_2f.insert(1,'area',(data['town']))

In [83]:
#divide towns into areas
data_2f['area'][data_2f.area == 'BUKIT MERAH'] = 'CENTRAL'
data_2f['area'][data_2f.area == 'TOA PAYOH'] = 'CENTRAL'
data_2f['area'][data_2f.area == 'QUEENSTOWN'] = 'CENTRAL'
data_2f['area'][data_2f.area == 'GEYLANG'] = 'CENTRAL'
data_2f['area'][data_2f.area == 'KALLANG/WHAMPOA'] = 'CENTRAL'
data_2f['area'][data_2f.area == 'BISHAN'] = 'CENTRAL'
data_2f['area'][data_2f.area == 'MARINE PARADE'] = 'CENTRAL'
data_2f['area'][data_2f.area == 'CENTRAL AREA'] = 'CENTRAL'
data_2f['area'][data_2f.area == 'BUKIT TIMAH'] = 'CENTRAL'
data_2f['area'][data_2f.area == 'TAMPINES'] = 'NORTH'
data_2f['area'][data_2f.area == 'YISHUN'] = 'NORTH'
data_2f['area'][data_2f.area == 'BEDOK'] = 'NORTH'
data_2f['area'][data_2f.area == 'PASIR RIS'] = 'NORTH'
data_2f['area'][data_2f.area == 'JURONG WEST'] = 'WEST'
data_2f['area'][data_2f.area == 'BUKIT BATOK'] = 'WEST'
data_2f['area'][data_2f.area == 'CHOA CHU KANG'] = 'WEST'
data_2f['area'][data_2f.area == 'CLEMENTI'] = 'WEST'
data_2f['area'][data_2f.area == 'JURONG EAST'] = 'WEST'
data_2f['area'][data_2f.area == 'BUKIT PANJANG'] = 'WEST'
data_2f['area'][data_2f.area == 'WOODLANDS'] = 'EAST'
data_2f['area'][data_2f.area == 'SEMBAWANG'] = 'EAST'
data_2f['area'][data_2f.area == 'LIM CHU KANG'] = 'EAST'
data_2f['area'][data_2f.area == 'ANG MO KIO'] = 'NORTH EAST'
data_2f['area'][data_2f.area == 'HOUGANG'] = 'NORTH EAST'
data_2f['area'][data_2f.area == 'SERANGOON'] = 'NORTH EAST'
data_2f['area'][data_2f.area == 'SENGKANG'] = 'NORTH EAST'
data_2f['area'][data_2f.area == 'PUNGGOL'] = 'NORTH EAST'

area_count = data_2f['area'].value_counts()
print("Area \n" +str(area_count))

#source: http://www.hdb.gov.sg/cs/infoweb/about-us/history/hdb-towns-your-home

Area 
NORTH         213559
WEST          191750
CENTRAL       158781
NORTH EAST    134634
EAST           63067
Name: area, dtype: int64


In [84]:
#one hot encoding for area
dummies = pd.get_dummies(data_2f['area']).rename(columns=lambda x: 'area_' + str(x))
data_2f = pd.concat([data_2f, dummies], axis=1)

del data_2f['area']

***GDP*** As an emerging market, Singapore went through rapid economic transform since the 1980s. Together with the growth of international trade, and its increased significance in the financial markets. The increase in GDP, and thereby the income of Singaporeans, will likely have an effect on the resale prices of HDB flats.

Source: https://data.worldbank.org/country/singapore

In [90]:
gdp = pd.read_csv('per-capita-gni-and-per-capita-gdp-at-current-market-prices-in-usd-annual.csv', sep =',')
idx = (gdp['year'] > 1989)
gdp = gdp[idx]
gdp = gdp.drop('level_1',axis=1)
gdp['sales_year'] = gdp['year']
gdp['gdp_per_capita'] = gdp['value']
del gdp['year']
del gdp['value']
print(gdp)

     sales_year  gdp_per_capita
60         1990           12638
61         1990           12766
62         1991           14285
63         1991           14504
64         1992           16153
65         1992           16144
66         1993           17916
67         1993           18302
68         1994           21607
69         1994           21578
70         1995           25117
71         1995           24937
72         1996           26073
73         1996           26261
74         1997           26903
75         1997           26387
76         1998           22146
77         1998           21824
78         1999           22044
79         1999           21797
80         2000           23648
81         2000           23794
82         2001           21351
83         2001           21577
84         2002           21400
85         2002           22017
86         2003           22756
87         2003           23574
88         2004           25474
89         2004           27403
90      

In [129]:
#data_3f = pd.concat([data_2f,gdp])
data_3f = data_2f.merge(gdp[['sales_year','gdp_per_capita']], on='sales_year',how='outer')
#data_3f = data_2f.copy()

In [130]:
print(data_2f.shape)
print(data_3f.shape)

print(data_3f.columns)

(761791, 83)
(1502104, 84)
Index(['sales_month', 'sales_year', 'floor_area_sqm', 'lease_commence_date',
       'resale_price', 'town_ANG MO KIO', 'town_BEDOK', 'town_BISHAN',
       'town_BUKIT BATOK', 'town_BUKIT MERAH', 'town_BUKIT PANJANG',
       'town_BUKIT TIMAH', 'town_CENTRAL AREA', 'town_CHOA CHU KANG',
       'town_CLEMENTI', 'town_GEYLANG', 'town_HOUGANG', 'town_JURONG EAST',
       'town_JURONG WEST', 'town_KALLANG/WHAMPOA', 'town_LIM CHU KANG',
       'town_MARINE PARADE', 'town_PASIR RIS', 'town_PUNGGOL',
       'town_QUEENSTOWN', 'town_SEMBAWANG', 'town_SENGKANG', 'town_SERANGOON',
       'town_TAMPINES', 'town_TOA PAYOH', 'town_WOODLANDS', 'town_YISHUN',
       'flat_type_1 ROOM', 'flat_type_2 ROOM', 'flat_type_3 ROOM',
       'flat_type_4 ROOM', 'flat_type_5 ROOM', 'flat_type_EXECUTIVE',
       'flat_type_MULTI GENERATION', 'storey_range_01 TO 03',
       'storey_range_04 TO 06', 'storey_range_07 TO 09',
       'storey_range_10 TO 12', 'storey_range_13 TO 15',
       '

***Land*** Located on the bottom of the Malaysian peninsula, Singapore is a small city-state with limited land mass. Due to this situation, housing prices tend to be higher than for other countries. To cope with the limited amount of land, the government initiated projects to increase the land mass. From the period 1960 until 2018, the land mass of the country increased with 24% from 581.5 sq/km to 721.5 sq/km.

source: https://data.gov.sg/dataset/total-land-area-of-singapore

In [60]:
land = pd.read_csv('total-land-area-of-singapore.csv', sep =',')
idx = (land['year'] > 1990)
land = land[idx]
land['sales_year'] = land['year']
del land['year']
#print(land)

In [61]:
data_4f = data_3f.merge(land, on='sales_year',how='left')
#print(data_4f.columns)

***Demand for Rental and Sold Flats*** Besides key metrics regarding the nation’s economy, the Singaporean government also supplies information concerning the demand for rental and sold HDB flats. The data dates back from 1960 until 2016. This data gives insight in the demand from Singaporeans, and is provided in batches of five years. To work with the data, the average of the five years is included per year.

source: https://data.gov.sg/dataset/key-stats-since-1960-demand-for-rental-and-sold-flats

In [62]:
demand = pd.read_csv('demand-for-rental-and-sold-flats.csv', sep =',')
print(demand.columns)
demand = demand.drop(columns=['start_year','end_year','flat_type','demand_for_rental.1'],axis=1)
#print(demand)

Index(['sales_year', 'demand_for_rental', 'demand_for_owner', 'start_year',
       'end_year', 'flat_type', 'demand_for_rental.1'],
      dtype='object')


In [63]:
data_5f = data_3f.merge(demand, on='sales_year',how='left')
#print(data_5f.columns)

In [64]:
print(data_2f.shape)
print(data_3f.shape)

(761791, 83)
(1476880, 84)


### 3.5 Normalizations
Data normalization is known as a fundamental preprocessing task to improve the prediction of a model. A normalized dataset might enhances the learning capability with minimum error, since the quality of the data is guaranteed before using any learning algorithm. The data will be scaled in the same range of values for the input features to minize bias [source]. We are using this technique as well to research if this method might improve our scores, since the input are on widely different scales. Two different types of normalization will be used, respectively Z-scoring and Max/Min normalization.

[source] S.C. Nayak, B.B. Misra, and H.S. Behera (2016) Impact of Data Normalization on Stock Index Forecasting, p. 257-269 Volume 6 https://pdfs.semanticscholar.org/f412/4953553981e32c39273bb2745a140311d160.pdf

#### 3.5.1 Z-scoring
Z-scoring normalizes the values of the features according to the mean and standard deviation.

In [139]:
data_z = data_2f.copy()

In [140]:
# z-scoring
data_z[['sales_month', 'sales_year','floor_area_sqm','remaining_lease','lease_commence_date']] = (data_z[['sales_month', 'sales_year', 'floor_area_sqm','remaining_lease','lease_commence_date']] - data_z[['sales_month', 'sales_year', 'floor_area_sqm','remaining_lease','lease_commence_date']].mean())/data_z[['sales_month', 'sales_year', 'floor_area_sqm','remaining_lease','lease_commence_date']].std()


#### 3.5.2 Max/Min-Normalization
This method normalizes the values of the features according to the minimum and maximum of these values. 

In [141]:
data_n = data_2f.copy()

In [142]:
# max/min
data_n[['sales_month', 'sales_year', 'floor_area_sqm','remaining_lease','lease_commence_date']] = (data_n[['sales_month', 'sales_year', 'floor_area_sqm','remaining_lease','lease_commence_date']] - data_n[['sales_month', 'sales_year', 'floor_area_sqm','remaining_lease','lease_commence_date']].min())/(data_n[['sales_month', 'sales_year', 'floor_area_sqm','remaining_lease','lease_commence_date']].max() - data_n[['sales_month', 'sales_year', 'floor_area_sqm','remaining_lease','lease_commence_date']].min())


### 3.6 Additional One Hot Encoding (Discrete Values)

In [143]:
data_henc_2 = data_2f.copy()

In [144]:
#one hot encoding for sales_year
dummies = pd.get_dummies(data_henc_2['sales_year']).rename(columns=lambda x: 'sy_' + str(x))
data_henc_2 = pd.concat([data_henc_2, dummies], axis=1)

#one hot encoding for sales_month
dummies = pd.get_dummies(data_henc_2['sales_month']).rename(columns=lambda x: 'sm_' + str(x))
data_henc_2 = pd.concat([data_henc_2, dummies], axis=1)

#one hot encoding for lease_commence_date
dummies = pd.get_dummies(data_henc_2['lease_commence_date']).rename(columns=lambda x: 'lcd_' + str(x))
data_henc_2 = pd.concat([data_henc_2, dummies], axis=1)

In [145]:
data_henc_2 = data_henc_2.drop(columns=['sales_year','sales_month','lease_commence_date'],axis=1)

In [146]:
print(data_henc_2.shape)

(761791, 171)


## 4 Data Analysis
<hr>
This part of the report will show algorithms that have been applied to predict the housing prices. We have focused on regressions with different features. This section will first define all the models that will be used and afterwards the results of applying the models to the different scenarios.

### 4.1 Models
In our data analysis, we used different techniques to predict the future resale price. First, the linear regression is discussed. Afterwards, two ensemble learning techniques are introduced. To cover the scope of ensembling, we will  both use a bagging and boosting technique. The random forest bagging technique is introduced first, before moving to the gradient boosting technique. To complete the analysis, and compare the predictive functions, we also introduce the neural network prediction. This is done to validate the research objective of this study.

#### Linear Regression
Linear regression is one of the most common used modeling technique where the dependent variable is continuous and the independent variables can be either continuous or discrete [source]. Since our dataset has the same setting, this technique is used as one of the possible models.

[source] B. Patel, 17 Apr. 2017, Predicting house value using regression analysis, https://towardsdatascience.com/regression-analysis-model-used-in-machine-learning-318f7656108a

In [147]:
# Linear Regression
def lin_reg(data,cv):
    start = time.time()

    data_input = data.drop('resale_price' ,axis=1)
    data_output = data['resale_price']
    x_train, x_test, y_train, y_test = train_test_split(data_input, data_output, test_size=0.33, random_state=42)

    model_lin_reg = LinearRegression()
    model_lin_reg.fit(x_train, y_train)
    y_pred_l = model_lin_reg.predict(x_test)
    #y_pred_l_train = model_lin_reg.predict(x_train)
    
    mae_l = mean_absolute_error(y_test, y_pred_l)
    #mae_l_train = mean_absolute_error(y_train, y_pred_l_train)
    print("\nMAE for Linear Regression is: %.0f"%mae_l)
    #print("For the train set: %.0f" %mae_l_train)
    
    if (cv == 1):
        cv = ShuffleSplit(n_splits=3, test_size=0.33, random_state=42)
        model_lin_reg_cv = LinearRegression()
        scores_lr = cross_val_score(model_lin_reg_cv, data_input, data_output, cv=cv, scoring='neg_mean_absolute_error')
        scores_lr = - scores_lr
        print(scores_lr)
        print("CV MAE: %0.2f (+/- %0.2f)" % (scores_lr.mean(), scores_lr.std() * 2))
    
    
    print('LR Time = %.0f'%(time.time() - start))

#### Random Forest
Random Forest is a learning technique that is part of the bootstrap aggregating techniques, also known as bagging. The technique can be utilised for both classification and regression problems. However, [SOURCE] notes that in case of overfitting data, the random forest will make it even worse. However, when there are enough trees in the forest, the classifier will not easily overfit the model. [SOURCE2] describe random forests as multiple decision trees that are being merged, resulting in a more accurate and stable prediction. 

SOURCE: Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
SOURCE2: https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd

In [148]:
# Run Random Forest
def random_f(data,version,cv):
    start = time.time()
    
    data_input = data.drop('resale_price' ,axis=1)
    data_output = data['resale_price']
    x_train, x_test, y_train, y_test = train_test_split(data_input, data_output, test_size=0.33, random_state=42)

    model_Forest = RandomForestRegressor()
    model_Forest.fit(x_train, y_train)
    y_pred_f = model_Forest.predict(x_test)
    #y_pred_f_train = model_Forest.predict(x_train)
    
    mae_f = mean_absolute_error(y_test, y_pred_f)
    #mae_f_train = mean_absolute_error(y_train, y_pred_f_train)
    print("\nMAE for Random Forest is: %.0f"%mae_f)
    #print("For the train set: %.0f" %mae_f_train)
    
    cdf_f = r2_score(y_test, y_pred_f)
    print('R-squared for Random Forest: %.2f' % cdf_f)

    if (version == 1):
        importances = model_Forest.feature_importances_
        indices = np.argsort(importances)[::-1]
        columns = np.array(list(data_input))
        return importances
        
        # Print the feature ranking
        print("\nFeature ranking:")
        
        for f in range(x_train.shape[1]):
            print("%d. %s (%f)" % (f + 1, columns[indices[f]], importances[indices[f]]))
        
    if (cv == 1):
        cv = ShuffleSplit(n_splits=3, test_size=0.33, random_state=42)
        scores_rf = cross_val_score(model_Forest, data_input, data_output, cv=cv, scoring='neg_mean_absolute_error')
        scores_rf = - scores_rf
        print(scores_rf)
        print("CV MAE: %0.2f (+/- %0.2f)" % (scores_rf.mean(), scores_rf.std() * 2))
    
    print('RF Time = %.0f'%(time.time() - start))

#### Gradient Boosting Regressor
Gradient boosting is a machine learning technique for both regression and classification problems [SOURCE]. The algorithm is able to produce a decision tree based on input data. Boosting is an ensemble technique, where the predictors are not made independently, but sequentially [SOURCE2]. By including this algorithm, we perform two types of ensembling in our analysis: bagging (random forest) and boosting (gradient boosting).

In terms of gradient boosting, we apply three different techniques. The first technique is the standard gradient boosting regressor and the latter two are based on the technique with tweaks in the learning process.


SOURCES:
Ridgeway, G. (1999). The state of boosting. Computing Science and Statistics, 172-181.

Meir, R., & Rätsch, G. (2003). An introduction to boosting and leveraging. In Advanced lectures on machine learning (pp. 118-183). Springer, Berlin, Heidelberg.

In [149]:
# Run GradientBoostingRegressor
def gbr(data,cv):
    start = time.time()
    
    data_input = data.drop('resale_price' ,axis=1)
    data_output = data['resale_price']
    x_train, x_test, y_train, y_test = train_test_split(data_input, data_output, test_size=0.33, random_state=42)

    model_gbr = GradientBoostingRegressor()
    model_gbr.fit(x_train, y_train)
    y_pred_gbr = model_gbr.predict(x_test)
    
    mae_gbr = mean_absolute_error(y_test, y_pred_gbr)
    print("\nMean Absolute Error for GradientBoostingRegressor is: %.0f" %mae_gbr) 

    cdf_gbr = r2_score(y_test, y_pred_gbr)
    print('R-squared for GradientBoostingRegressor: %.2f' % cdf_gbr)

    if (cv == 1):
        cv = ShuffleSplit(n_splits=3, test_size=0.33, random_state=42)
        scores_gbr = cross_val_score(model_gbr, data_input, data_output, cv=cv, scoring='neg_mean_absolute_error')
        scores_gbr = - scores_gbr
        print(scores_gbr)
        print("CV MAE: %0.2f (+/- %0.2f)" % (scores_gbr.mean(), scores_gbr.std() * 2))
     
    print('GBR Time = %.0f'%(time.time() - start))

#### Ada Boost

In [150]:
# Run AdaBoost
def ada(data,cv):
    start = time.time()
    
    data_input = data.drop('resale_price' ,axis=1)
    data_output = data['resale_price']
    x_train, x_test, y_train, y_test = train_test_split(data_input, data_output, test_size=0.33, random_state=42)

    model_abr = AdaBoostRegressor()
    model_abr.fit(x_train, y_train)
    y_pred_abr = model_abr.predict(x_test)
    
    mae_abr = mean_absolute_error(y_test, y_pred_abr)
    print("\nMean Absolute Error for AdaBoost is: %.0f" %mae_abr)
    
    cdf_abr = r2_score(y_test, y_pred_abr)
    print('R-squared for AdaBoost: %.2f' % cdf_abr)
 
    if (cv == 1):
        cv = ShuffleSplit(n_splits=3, test_size=0.33, random_state=42)
        scores_ada = cross_val_score(model_abr, data_input, data_output, cv=cv, scoring='neg_mean_absolute_error')
        scores_ada = - scores_ada
        print(scores_ada)
        print("CV MAE: %0.2f (+/- %0.2f)" % (scores_ada.mean(), scores_ada.std() * 2))
    
    print('Ada Time = %.0f'%(time.time() - start))


#### XG Boost

In [151]:
# Run XG Boost
def xg_boost(data,cv):
    start = time.time()
    
    data_input = data.drop('resale_price' ,axis=1)
    data_output = data['resale_price']
    x_train, x_test, y_train, y_test = train_test_split(data_input, data_output, test_size=0.33, random_state=42)
    
    dtrain = xgb.DMatrix(x_train, label = y_train)
    dtest = xgb.DMatrix(x_test, label = y_train)
    param = {
        'max_depth': 3,  # the maximum depth of each tree. Try with max_depth: 2 to 10.
        'eta': 0.3,  # the training step for each iteration. Try with ETA: 0.1, 0.2, 0.3...
        'silent': 1,  # logging mode - quiet
        'objective': 'reg:linear'}  # defines the loss function to be minimized  
    num_round = 20  # the number of training iterations. Try with num_round around few hundred!
    #----------------
    bst = xgb.train(param, dtrain, num_round)
    y_pred_xgb = bst.predict(dtest)
    best_preds = np.asarray([np.argmax(line) for line in y_pred_xgb])

    mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
    print("\nMean Absolute Error for XGBoost is: %.0f" %mae_xgb)
    #xgb.plot_importance(bst)
    #plt.show()

    cdf_xgb = r2_score(y_test, y_pred_xgb)
    print('R-squared for XGBoost: %.2f' % cdf_xgb)
    
    if (cv == 1):
        cv = ShuffleSplit(n_splits=3, test_size=0.33, random_state=42)
        scores_xgb = cross_val_score(xgb.train, data_input, data_output, cv=cv, scoring='neg_mean_absolute_error')
        scores_xgb = - scores_xgb
        print(scores_xgb)
        print("CV MAE: %0.2f (+/- %0.2f)" % (scores_xgb.mean(), scores_xgb.std() * 2))
    
    print('XGB Time = %.0f'%(time.time() - start))

#### Neural Network
Predictive Neural Network is a powerful predictive modeling technique, which can learn to perform predictive tasks. It can for example be trained to predict numerical values, such as housing prices. As mentioned in the introduction, previous studies showed that this technique might perform better compared to a regression model. However, since Singapore is a quasi-open market, Neural Network is used to research if this is the case.

In [152]:
# Neural Network
def neural(data,cv):
    start = time.time()
    
    data_input = data.drop('resale_price' ,axis=1)
    data_output = data['resale_price']
    x_train, x_test, y_train, y_test = train_test_split(data_input, data_output, test_size=0.33, random_state=42)

    model_n = MLPRegressor()
    model_n.fit(x_train, y_train)
    y_pred_n = model_n.predict(x_test)
    
    mae_n = mean_absolute_error(y_test, y_pred_n)
    print("\nMean Absolute Error for Neural Network is: %.0f" %mae_n)  
    
    cdf_n = r2_score(y_test, y_pred_n)
    print('R-squared for Neural Network: %.2f' % cdf_n)
    
    if (cv == 1):
        cv = ShuffleSplit(n_splits=3, test_size=0.33, random_state=42)
        scores = cross_val_score(model_n, data_input, data_output, cv=cv, scoring='neg_mean_absolute_error')
        scores = - scores
        print(scores)
        print("CV MAE: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    
    print('Neural Time = %.0f'%(time.time() - start))

### 4.2 Analysis
The analysis section will show the Mean Absolute Error and R-squared of every model with different scenarios. 

***Mean Absolute Error***

Formula: <br>
$$MAE = \frac{1}{n}\sum_{j=1}^{n}|y_j - \hat{y}_j|$$

***Coefficient of Determination***
The coefficient of determination, also known as $R^2$, is the proportion variance between the dependent variable and the prediction from the independent variable. This coefficient ranges from 0 to 1, where 1 means that there is no variance between the predicted and actual.

Formula:<br> 
$$R^2 = 1 - \frac{\sum(y - \hat{y})^2}{\sum(y - \bar{y})^2}$$

where $y$ is the actual value, $\hat{y}$ is the predicted value of y, and $\bar{y}$ is the mean value of y.

#### Scenario 1: Only data encoding
Some columns are still of 'object' type and need to be changed to int or float in order to run XGBoost.

In [153]:
#print(data_enc.dtypes)
data_enc['flat_type'] = pd.to_numeric(data_enc['flat_type'])
data_enc['storey_range'] = pd.to_numeric(data_enc['storey_range'])
data_enc['flat_model'] = pd.to_numeric(data_enc['flat_model'])
data_enc['town'] = pd.to_numeric(data_enc['town'])
print(data_enc.dtypes)

sales_month              int64
sales_year               int64
town                     int64
flat_type                int64
storey_range             int64
floor_area_sqm         float64
flat_model               int64
lease_commence_date      int64
resale_price           float64
dtype: object


To quickly different algorithms and features, we use a sample size of the full train data.<br>

In [154]:
data_sample = data_enc.sample(frac=0.1)

In [155]:
lin_reg(data_enc,1)
random_f(data_enc,0,1)
gbr(data_enc,1)
xg_boost(data_enc,0)
neural(data_enc,0)


MAE for Linear Regression is: 1222104978
[  1.09201779e+16   4.60056157e+16   2.89980905e+10]
CV MAE: 18975274203502268.00 (+/- 39252751419199512.00)
LR Time = 2

MAE for Random Forest is: 15973
R-squared for Random Forest: 0.97


KeyboardInterrupt: 

#### Scenorio 2: Analysis on one hot encoding

In [400]:
#data_sample = data_henc.sample(frac=0.1)

In [159]:
lin_reg(data_henc,1)
random_f(data_henc,0,1)
#gbr(data_sample,1)
#xg_boost(data_sample,1)
#neural(data_sample,0)


MAE for Linear Regression is: 47967
R-squared for Linear Regression: 0.80
[ 55523.06253656  49459.87404337  43044.16671053  67857.85820959
  96245.58648405]

CV MAE: 62426.11 (+/- 37573.35)

LR Time = 32

MAE for Random Forest is: 15874
R-squared for Random Forest: 0.97

CV MAE: 56892.06 (+/- 65887.97)

RF Time = 341


#### Scenario 3: Analysis on one new feature (remaining lease)

In [461]:
#data_sample = data_f.sample(frac=0.1)

In [89]:
lin_reg(data_f,1)
random_f(data_f,0,1)
#gbr(data_f,1)
#xg_boost(data_f,1)
#neural(data_f,0)


MAE for Linear Regression is: 47969
[ 47967.84234952  48116.02335898  47996.2386813 ]
CV MAE: 48026.70 (+/- 128.43)
LR Time = 19

MAE for Random Forest is: 15900
R-squared for Random Forest: 0.97
[ 15870.00850873  15892.01256976  15853.29174657]
CV MAE: 15871.77 (+/- 31.71)
RF Time = 336


Scores improved and thus we will keep this feature.

#### Scenario 4: Analysis on two new features (remaining lease and area)

In [None]:
#data_sample = data_2f.sample(frac=0.1)

In [116]:
lin_reg(data_2f,1)
random_f(data_2f,0,1)
#gbr(data_2f,1)
#xg_boost(data_2f,1)
#neural(data_2f,0)


MAE for Linear Regression is: 47967

CV MAE for LR is: 98571776
[  5.55347464e+04   4.95112696e+04   4.29991840e+04   6.78804067e+04
   4.92643553e+08]

CV MAE: 98571895.67 (+/- 394071657.39)

LR Time = 50

MAE for Random Forrest is: 15738

CV MAE: 56892.00 (+/- 67088.69)

RF Time = 345


#### Scenario 5: Analysis on three new features (remaining lease, area and GDP per capita)

In [None]:
#data_sample = data_3f.sample(frac=0.1)

In [116]:
lin_reg(data_3f,1)
random_f(data_3f,0,1)
#gbr(data_3f,1)
#xg_boost(data_3f,1)
#neural(data_3f,0)


MAE for Linear Regression is: 47967

CV MAE for LR is: 98571776
[  5.55347464e+04   4.95112696e+04   4.29991840e+04   6.78804067e+04
   4.92643553e+08]

CV MAE: 98571895.67 (+/- 394071657.39)

LR Time = 50

MAE for Random Forrest is: 15738

CV MAE: 56892.00 (+/- 67088.69)

RF Time = 345


#### Scenario 6: Analysis on three new features (remaining lease, area, GDP per capita and land)

In [None]:
#data_sample = data_4f.sample(frac=0.1)

In [116]:
lin_reg(data_4f,1)
random_f(data_4f,0,1)
#gbr(data_4f,1)
#xg_boost(data_4f,1)
#neural(data_4f,0)


MAE for Linear Regression is: 47967

CV MAE for LR is: 98571776
[  5.55347464e+04   4.95112696e+04   4.29991840e+04   6.78804067e+04
   4.92643553e+08]

CV MAE: 98571895.67 (+/- 394071657.39)

LR Time = 50

MAE for Random Forrest is: 15738

CV MAE: 56892.00 (+/- 67088.69)

RF Time = 345


#### Scenario 7: Analysis on four new features (remaining lease, area, GDP per capita, land and demand)

In [None]:
#data_sample = data_5f.sample(frac=0.1)

In [116]:
lin_reg(data_5f,1)
random_f(data_5f,0,1)
#gbr(data_5f,1)
#xg_boost(data_5f,1)
#neural(data_5f,0)


MAE for Linear Regression is: 47967

CV MAE for LR is: 98571776
[  5.55347464e+04   4.95112696e+04   4.29991840e+04   6.78804067e+04
   4.92643553e+08]

CV MAE: 98571895.67 (+/- 394071657.39)

LR Time = 50

MAE for Random Forrest is: 15738

CV MAE: 56892.00 (+/- 67088.69)

RF Time = 345


#### Scenario 8: Analysis on Normalization Data

In [None]:
#data_sample = data_z.sample(frac=0.1)

In [None]:
print('\n### z-scoring ###')
lin_reg(data_z,1)
random_f(data_z,0,1)
#gbr(data_z,1)
#xg_boost(data_z,1)
#neural(data_z,0)

#### Scenario 9: Analysis on max-min normalization data

In [None]:
#data_sample = data_n.sample(frac=0.1)

In [None]:
print('\n### max/min ###')
lin_reg(data_n,1)
random_f(data_n,0,1)
#gbr(data_n,1)
#xg_boost(data_n,1)
#neural(data_n,0)

nothing happened

#### Scenario 10: Addition One hot Encoding of Months and Years

In [382]:
#data_sample = data_henc_2.sample(frac=0.1)

In [383]:
lin_reg(data_henc_2,1)
random_f(data_henc_2,0,1)
#gbr(data_henc_2,1)
#xg_boost(data_henc_2,1)
#neural(data_henc_2,0)


MAE for Linear Regression is: 29559
LR Time = 9

MAE for Random Forrest is: 16084
RF Time = 145


its horsecrap, will not do

### 4.3 Reduced Features

In [139]:
data_r = data_2f.copy()

In [140]:
print(data_r.shape)

(761791, 83)


In [141]:
#data_sample = data_r.sample(frac=0.1)

In [142]:
importances = random_f(data_r,1)


MAE for Random Forest is: 15725
R-squared for Random Forest: 0.97


In [143]:
#indices = np.argsort(importances)[::-1]
columns = np.array(list(data_r.drop('resale_price',1)))

for i in range(0,len(columns)):
    feature = columns[i]
    weight = importances[i]
    if (weight < 0.005):
        del data_r[feature]
        
print(data_r.shape)

(761791, 10)


In [144]:
lin_reg(data_r,1)
random_f(data_r,0,1)
#gbr(data_r,1)
#xg_boost(data_r,1)
#neural(data_r,0)


MAE for Linear Regression is: 53504
R-squared for Linear Regression: 0.75

CV MAE for LR is: 67267
[  56148.96950623   53980.03231626   49540.52757144   74387.40058907
  102306.47657642]

CV MAE: 67272.68 (+/- 38913.72)

LR Time = 4

MAE for Random Forest is: 24652
R-squared for Random Forest: 0.93

CV MAE: 60318.27 (+/- 63016.36)

RF Time = 100


- deleting any features with low importance only made the result worse
- some overall not important at all but maybe very important for the few datapoints

### 4.4 Split Datasets according to Periods

Following our assumption in the previous chapter, we also try to run the regressions on seperate datasets to research whether the accuracy will increase.

Looking at the average price per square meter in the sales years, periods can be identified. The first period identified is the economic growth from 1990 until 1997 [13]. The "Asian Crisis" of 1997-1998 affected Singapore and other emerging markets, which is visible from the decline in resale price in the data [14]. In the subsequent years, Singapore had a stable growth in economic terms, but coped with the economic slowdown in the US, Japan and the EU. Combined with the SARS outbreak in 2003, the resale prices remained relatively stable until 2007. According to [15], the HDB resale prices from 2007 onwards grew even faster than the private property market. [15] argues that the increase in price is the result of an increase in median income of Singaporeans. 


Splitting the dataset in these periods could help to predict the resale prices of HDB in Singapore. We thereby assume that the resale prices of data in the first period (i.e. 1990-1997) will be less accurate to predict the resale price in 2018. This is based on both economic motives, as well as demographic motives (e.g. increased population and land mass).

In [85]:
# Creating the datasets based on the periods described
data_period1 = data_2f.loc[data['sales_year'].isin(['1990','1991','1992','1993','1994','1995','1996','1997','1998'])]
data_period2 = data_2f.loc[data['sales_year'].isin(['1999','2000','2001','2002','2003','2004','2005','2006','2007'])]
data_period3 = data_2f.loc[data['sales_year'].isin(['2008','2009','2010','2011','2012','2013''2014','2015','2016','2017','2018'])]


In [86]:
#data_sample_p1 = data_period1.sample(frac=0.1)
#data_sample_p2 = data_period2.sample(frac=0.1)
#data_sample_p3 = data_period3.sample(frac=0.1)


In [87]:
periods = [data_period1,data_period2,data_period3]
for i in range(0,3):
    period = periods[i]
    print(period.shape)

(230238, 83)
(309490, 83)
(189870, 83)


In [88]:
period_names = ['1990-1998','1999-2007','2008-2018']

for i in range(0,3):
    print('\n### For',period_names[i],'###')
    period = periods[i]
    #lin_reg(period,1)
    random_f(period,0,1)
    #gbr(period,1)
    #xg_boost(period,1)
    #neural(period,0)


### For 1990-1998 ###

MAE for Random Forest is: 13955
R-squared for Random Forest: 0.98
[ 13919.90539571  13949.29864603  13988.00869607]
CV MAE: 13952.40 (+/- 55.78)
RF Time = 65

### For 1999-2007 ###

MAE for Random Forest is: 13109
R-squared for Random Forest: 0.96
[ 13130.03719051  13086.40567397  13111.15103569]
CV MAE: 13109.20 (+/- 35.73)
RF Time = 84

### For 2008-2018 ###

MAE for Random Forest is: 21205
R-squared for Random Forest: 0.94
[ 21093.12909408  21335.68599991  21127.06180391]
CV MAE: 21185.29 (+/- 214.49)
RF Time = 53


In [138]:
overall_mae = (13944*230238 + 13119*309490 + 21123*189870)/(230238+309490+189870)
print('Overall Mean Absolute Error: %.2f' % overall_mae)

overall_rsq = (0.98*230238 + 0.96*309490 + 0.94*189870)/(230238+309490+189870)
print('Overall R-squared: %.2f' % overall_rsq)

Overall Mean Absolute Error: 15462.30
Overall R-squared: 0.96


## 5 Conclusions, Limitations & Future Research
<hr>
Add here

## References
<hr>
Add here