## Pre-processing
- One-hot encode categorical variables.
- Train/test split your data.
- Scale your data.
- Consider using automated feature selection.

## Modeling
- **Establish your baseline score.**
- Fit linear regression. Look at your coefficients. Are any of them wildly overblown?
- Fit lasso/ridge/elastic net with default parameters.
- Go back and remove features that might be causing issues in your models.
- Tune hyperparameters.
- **Identify a production model.** (This does not have to be your best performing Kaggle model, but rather the model that best answers your problem statement.)
- Refine and interpret your production model.


In [1]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import train_test_split, cross_val_score, KFold, RandomizedSearchCV
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler, Normalizer, PolynomialFeatures 

import statsmodels.api as sm
import math


In [2]:
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [3]:
housing = pd.read_csv('./housing.csv', low_memory=False)
housing.head()

Unnamed: 0.1,Unnamed: 0,flat_type,flat_model,resale_price,Tranc_Year,max_floor_lvl,commercial,planning_area,Mall_Within_2km,Hawker_Within_2km,hawker_food_stalls,hawker_market_stalls,mrt_interchange,pri_sch_name,sec_sch_name,resale_price_log,1_2_3_rooms_sold,4_5_rooms_sold,Other_rooms_sold,est_floor_level,age_at_sale,Mall_Nearest_Distance_log,Hawker_Nearest_Distance_log,mrt_nearest_distance_log,bus_stop_nearest_distance_log,pri_sch_nearest_distance_log,sec_sch_nearest_dist_log,region
0,0,4 ROOM,Model A,680000.0,2016,25,N,Kallang,7.0,13.0,84,60,0,Geylang Methodist School,Geylang Methodist School,13.429848,0,142,0,11,10,6.997679,5.041833,5.799344,3.381926,7.037584,7.037584,Central
1,1,5 ROOM,Improved,665000.0,2012,9,N,Bishan,3.0,7.0,80,77,1,Kuo Chuan Presbyterian Primary School,Kuo Chuan Presbyterian Secondary School,13.407542,0,112,0,8,25,6.764971,6.461706,6.806453,4.064019,6.029741,6.104557,Central
2,3,4 ROOM,Model A,550000.0,2012,11,Y,Bishan,4.0,9.0,32,86,1,Catholic High School,Catholic High School,13.217674,0,75,0,3,20,6.856646,6.587846,6.810642,3.770379,5.964904,5.964904,Central
3,4,4 ROOM,Simplified,298000.0,2017,4,N,Yishun,2.0,1.0,45,0,0,Naval Base Primary School,Orchid Park Secondary School,12.604849,0,48,0,2,30,6.592732,7.339636,6.021856,4.863084,5.994462,5.743085,North
4,5,3 ROOM,Improved,335000.0,2013,12,Y,Geylang,6.0,11.0,79,82,1,Saint Margaret's Primary School,Geylang Methodist School,12.721886,188,5,0,8,38,6.527964,5.000034,6.519577,5.436689,6.387096,6.411553,Central


In [4]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144640 entries, 0 to 144639
Data columns (total 28 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Unnamed: 0                     144640 non-null  int64  
 1   flat_type                      144640 non-null  object 
 2   flat_model                     144640 non-null  object 
 3   resale_price                   144640 non-null  float64
 4   Tranc_Year                     144640 non-null  int64  
 5   max_floor_lvl                  144640 non-null  int64  
 6   commercial                     144640 non-null  object 
 7   planning_area                  144640 non-null  object 
 8   Mall_Within_2km                144640 non-null  float64
 9   Hawker_Within_2km              144640 non-null  float64
 10  hawker_food_stalls             144640 non-null  int64  
 11  hawker_market_stalls           144640 non-null  int64  
 12  mrt_interchange               

In [5]:
# Identifying categorical predictors for one-hot encoding

numerical_var = housing.select_dtypes(include=['number'])
numerical_var.columns

Index(['Unnamed: 0', 'resale_price', 'Tranc_Year', 'max_floor_lvl',
       'Mall_Within_2km', 'Hawker_Within_2km', 'hawker_food_stalls',
       'hawker_market_stalls', 'mrt_interchange', 'resale_price_log',
       '1_2_3_rooms_sold', '4_5_rooms_sold', 'Other_rooms_sold',
       'est_floor_level', 'age_at_sale', 'Mall_Nearest_Distance_log',
       'Hawker_Nearest_Distance_log', 'mrt_nearest_distance_log',
       'bus_stop_nearest_distance_log', 'pri_sch_nearest_distance_log',
       'sec_sch_nearest_dist_log'],
      dtype='object')

In [6]:
numerical_var.drop(columns = ['resale_price','resale_price_log'], inplace = True)

In [7]:
# Standardising predictors

ss = StandardScaler()
num_ss = ss.fit_transform(numerical_var)
num_ss = pd.DataFrame(num_ss, columns = numerical_var.columns)

### Pre-Processing

In [8]:
# Converting the following variables to categorical 

housing['Tranc_Year'] = housing['Tranc_Year'].astype('object')

In [9]:
# Identifying categorical predictors for one-hot encoding

categorical_var = housing.select_dtypes(include=['object'])
categorical_var.columns

Index(['flat_type', 'flat_model', 'Tranc_Year', 'commercial', 'planning_area',
       'pri_sch_name', 'sec_sch_name', 'region'],
      dtype='object')

In [10]:
categorical_var = housing.select_dtypes(include=['object'])
categorical_var

Unnamed: 0,flat_type,flat_model,Tranc_Year,commercial,planning_area,pri_sch_name,sec_sch_name,region
0,4 ROOM,Model A,2016,N,Kallang,Geylang Methodist School,Geylang Methodist School,Central
1,5 ROOM,Improved,2012,N,Bishan,Kuo Chuan Presbyterian Primary School,Kuo Chuan Presbyterian Secondary School,Central
2,4 ROOM,Model A,2012,Y,Bishan,Catholic High School,Catholic High School,Central
3,4 ROOM,Simplified,2017,N,Yishun,Naval Base Primary School,Orchid Park Secondary School,North
4,3 ROOM,Improved,2013,Y,Geylang,Saint Margaret's Primary School,Geylang Methodist School,Central
...,...,...,...,...,...,...,...,...
144635,EXECUTIVE,Apartment,2020,Y,Woodlands,Evergreen Primary School,Evergreen Secondary School,North
144636,5 ROOM,Improved,2017,N,Jurong West,Jurong West Primary School,Boon Lay Secondary School,West
144637,EXECUTIVE,Apartment,2020,N,Bedok,Maha Bodhi School,Manjusri Secondary School,East
144638,3 ROOM,Improved,2016,N,Queenstown,New Town Primary School,Queensway Secondary School,Central


In [11]:
# Dummifying all categorical variables

categorical_var = pd.get_dummies(columns=categorical_var.columns,data=housing, drop_first=True)

  categorical_var = pd.get_dummies(columns=categorical_var.columns,data=housing, drop_first=True)


In [12]:
housing = pd.concat([num_ss, categorical_var], axis=1)

In [13]:
housing.head()

Unnamed: 0.2,Unnamed: 0,Tranc_Year,max_floor_lvl,Mall_Within_2km,Hawker_Within_2km,hawker_food_stalls,hawker_market_stalls,mrt_interchange,1_2_3_rooms_sold,4_5_rooms_sold,Other_rooms_sold,est_floor_level,age_at_sale,Mall_Nearest_Distance_log,Hawker_Nearest_Distance_log,mrt_nearest_distance_log,bus_stop_nearest_distance_log,pri_sch_nearest_distance_log,sec_sch_nearest_dist_log,Unnamed: 0.1,resale_price,max_floor_lvl.1,Mall_Within_2km.1,Hawker_Within_2km.1,hawker_food_stalls.1,hawker_market_stalls.1,mrt_interchange.1,resale_price_log,1_2_3_rooms_sold.1,4_5_rooms_sold.1,Other_rooms_sold.1,est_floor_level.1,age_at_sale.1,Mall_Nearest_Distance_log.1,Hawker_Nearest_Distance_log.1,mrt_nearest_distance_log.1,bus_stop_nearest_distance_log.1,pri_sch_nearest_distance_log.1,sec_sch_nearest_dist_log.1,flat_type_2 ROOM,flat_type_3 ROOM,flat_type_4 ROOM,flat_type_5 ROOM,flat_type_EXECUTIVE,flat_type_MULTI-GENERATION,flat_model_Adjoined flat,flat_model_Apartment,flat_model_DBSS,flat_model_Improved,flat_model_Improved-Maisonette,...,sec_sch_name_Peirce Secondary School,sec_sch_name_Ping Yi Secondary School,sec_sch_name_Presbyterian High School,sec_sch_name_Punggol Secondary School,sec_sch_name_Queenstown Secondary School,sec_sch_name_Queensway Secondary School,sec_sch_name_Raffles Girls' School,sec_sch_name_Raffles Institution,sec_sch_name_Regent Secondary School,sec_sch_name_River Valley High School,sec_sch_name_Riverside Secondary School,sec_sch_name_Saint Andrew's Secondary School,sec_sch_name_Saint Anthony's Canossian Secondary School,sec_sch_name_Saint Gabriel's Secondary School,sec_sch_name_Saint Hilda's Secondary School,sec_sch_name_Saint Margaret's Secondary School,sec_sch_name_Saint Patrick's School,sec_sch_name_Sembawang Secondary School,sec_sch_name_Seng Kang Secondary School,sec_sch_name_Serangoon Garden Secondary School,sec_sch_name_Serangoon Secondary School,sec_sch_name_Springfield Secondary School,sec_sch_name_Swiss Cottage Secondary School,sec_sch_name_Tampines Secondary School,sec_sch_name_Tanglin Secondary School,sec_sch_name_Tanjong Katong Secondary School,sec_sch_name_Teck Whye Secondary School,sec_sch_name_Temasek Junior College,sec_sch_name_Temasek Secondary School,sec_sch_name_Unity Secondary School,sec_sch_name_West Spring Secondary School,sec_sch_name_Westwood Secondary School,sec_sch_name_Whitley Secondary School,sec_sch_name_Woodgrove Secondary School,sec_sch_name_Woodlands Ring Secondary School,sec_sch_name_Woodlands Secondary School,sec_sch_name_Xinmin Secondary School,sec_sch_name_Yio Chu Kang Secondary School,sec_sch_name_Yishun Secondary School,sec_sch_name_Yishun Town Secondary School,sec_sch_name_Yuan Ching Secondary School,sec_sch_name_Yuhua Secondary School,sec_sch_name_Yusof Ishak Secondary School,sec_sch_name_Yuying Secondary School,sec_sch_name_Zhenghua Secondary School,sec_sch_name_Zhonghua Secondary School,region_East,region_North,region_North-East,region_West
0,-1.731723,-0.160166,1.914547,0.54547,2.42405,1.872676,0.047956,-0.596441,-0.592605,1.276735,-0.316256,0.604276,-1.221962,1.088056,-1.654436,-1.156403,-2.383493,2.133487,1.674724,0,680000.0,25,7.0,13.0,84,60,0,13.429848,0,142,0,11,10,6.997679,5.041833,5.799344,3.381926,7.037584,7.037584,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,-1.7317,-1.617931,-1.055757,-0.634582,0.874458,1.66048,0.353766,1.676612,-0.592605,0.707454,-0.316256,0.003485,0.052966,0.706116,-0.200632,0.534001,-1.071584,0.373422,0.086207,1,665000.0,9,3.0,7.0,80,77,1,13.407542,0,112,0,8,25,6.764971,6.461706,6.806453,4.064019,6.029741,6.104557,0,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,-1.731654,-1.617931,-0.684469,-0.339569,1.390989,-0.885872,0.515666,1.676612,-0.592605,0.005341,-0.316256,-0.997833,-0.37201,0.856581,-0.071477,0.541033,-1.636358,0.260192,-0.15156,3,550000.0,11,4.0,9.0,32,86,1,13.217674,0,75,0,3,20,6.856646,6.587846,6.810642,3.770379,5.964904,5.964904,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,-1.731631,0.204275,-1.983977,-0.929595,-0.675134,-0.196235,-1.031373,-0.596441,-0.592605,-0.507012,-0.316256,-1.198097,0.477942,0.423421,0.698278,-0.782924,0.465305,0.311811,-0.529216,4,298000.0,4,2.0,1.0,45,0,0,12.604849,0,48,0,2,30,6.592732,7.339636,6.021856,4.863084,5.994462,5.743085,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,-1.731609,-1.25349,-0.498825,0.250457,1.90752,1.607431,0.44371,1.676612,2.040287,-1.322981,-0.316256,0.003485,1.157903,0.317117,-1.697233,0.052488,1.568553,0.997495,0.608879,5,335000.0,12,6.0,11.0,79,82,1,12.721886,188,5,0,8,38,6.527964,5.000034,6.519577,5.436689,6.387096,6.411553,0,1,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [14]:
housing.drop(columns = "Unnamed: 0", inplace = True)

In [15]:
housing.describe(include = 'all')

Unnamed: 0,Tranc_Year,max_floor_lvl,Mall_Within_2km,Hawker_Within_2km,hawker_food_stalls,hawker_market_stalls,mrt_interchange,1_2_3_rooms_sold,4_5_rooms_sold,Other_rooms_sold,est_floor_level,age_at_sale,Mall_Nearest_Distance_log,Hawker_Nearest_Distance_log,mrt_nearest_distance_log,bus_stop_nearest_distance_log,pri_sch_nearest_distance_log,sec_sch_nearest_dist_log,resale_price,max_floor_lvl.1,Mall_Within_2km.1,Hawker_Within_2km.1,hawker_food_stalls.1,hawker_market_stalls.1,mrt_interchange.1,resale_price_log,1_2_3_rooms_sold.1,4_5_rooms_sold.1,Other_rooms_sold.1,est_floor_level.1,age_at_sale.1,Mall_Nearest_Distance_log.1,Hawker_Nearest_Distance_log.1,mrt_nearest_distance_log.1,bus_stop_nearest_distance_log.1,pri_sch_nearest_distance_log.1,sec_sch_nearest_dist_log.1,flat_type_2 ROOM,flat_type_3 ROOM,flat_type_4 ROOM,flat_type_5 ROOM,flat_type_EXECUTIVE,flat_type_MULTI-GENERATION,flat_model_Adjoined flat,flat_model_Apartment,flat_model_DBSS,flat_model_Improved,flat_model_Improved-Maisonette,flat_model_Maisonette,flat_model_Model A,...,sec_sch_name_Peirce Secondary School,sec_sch_name_Ping Yi Secondary School,sec_sch_name_Presbyterian High School,sec_sch_name_Punggol Secondary School,sec_sch_name_Queenstown Secondary School,sec_sch_name_Queensway Secondary School,sec_sch_name_Raffles Girls' School,sec_sch_name_Raffles Institution,sec_sch_name_Regent Secondary School,sec_sch_name_River Valley High School,sec_sch_name_Riverside Secondary School,sec_sch_name_Saint Andrew's Secondary School,sec_sch_name_Saint Anthony's Canossian Secondary School,sec_sch_name_Saint Gabriel's Secondary School,sec_sch_name_Saint Hilda's Secondary School,sec_sch_name_Saint Margaret's Secondary School,sec_sch_name_Saint Patrick's School,sec_sch_name_Sembawang Secondary School,sec_sch_name_Seng Kang Secondary School,sec_sch_name_Serangoon Garden Secondary School,sec_sch_name_Serangoon Secondary School,sec_sch_name_Springfield Secondary School,sec_sch_name_Swiss Cottage Secondary School,sec_sch_name_Tampines Secondary School,sec_sch_name_Tanglin Secondary School,sec_sch_name_Tanjong Katong Secondary School,sec_sch_name_Teck Whye Secondary School,sec_sch_name_Temasek Junior College,sec_sch_name_Temasek Secondary School,sec_sch_name_Unity Secondary School,sec_sch_name_West Spring Secondary School,sec_sch_name_Westwood Secondary School,sec_sch_name_Whitley Secondary School,sec_sch_name_Woodgrove Secondary School,sec_sch_name_Woodlands Ring Secondary School,sec_sch_name_Woodlands Secondary School,sec_sch_name_Xinmin Secondary School,sec_sch_name_Yio Chu Kang Secondary School,sec_sch_name_Yishun Secondary School,sec_sch_name_Yishun Town Secondary School,sec_sch_name_Yuan Ching Secondary School,sec_sch_name_Yuhua Secondary School,sec_sch_name_Yusof Ishak Secondary School,sec_sch_name_Yuying Secondary School,sec_sch_name_Zhenghua Secondary School,sec_sch_name_Zhonghua Secondary School,region_East,region_North,region_North-East,region_West
count,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,...,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0,144640.0
mean,-4.061938e-14,-1.634877e-16,8.331585000000001e-17,-2.5544950000000004e-17,-4.1068430000000006e-17,2.004296e-17,5.978502e-17,2.6527450000000003e-17,-4.8830160000000005e-17,-2.6527450000000003e-17,4.3033420000000007e-17,1.13454e-16,3.697632e-16,-1.036143e-15,-9.897687e-16,-1.234558e-15,-6.446171e-16,5.131097e-16,433821.77859,14.686995,5.15103,3.614111,48.699129,57.334098,0.262396,12.94233,42.314595,74.718536,6.041275,7.982598,24.376839,6.334752,6.657655,6.488306,4.621161,5.815913,6.053924,0.013046,0.268626,0.412465,0.230911,0.074157,0.000228,0.001258,0.038876,0.006554,0.254515,0.000131,0.024972,0.312286,...,0.001618,0.016517,0.0015,0.014754,0.002925,0.013392,0.005344,0.001728,0.002779,0.010094,0.005248,0.005434,0.001625,0.002745,0.008269,0.000733,0.004134,0.010868,0.012396,0.006195,0.002102,0.002295,0.007163,0.006603,0.003533,0.002468,0.003001,0.00578,0.001051,0.013606,0.006921,0.005227,0.003077,0.006769,0.010882,0.00878,0.007011,0.006312,0.011719,0.011449,0.004418,0.004715,0.003775,0.006603,0.010979,0.004287,0.16566,0.172829,0.245693,0.247117
std,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,121070.637266,5.386673,3.389694,3.871999,18.850562,55.590277,0.439938,0.276081,71.404631,52.698244,19.10255,4.993436,11.76541,0.609279,0.976664,0.595782,0.519926,0.572619,0.587359,0.113473,0.443246,0.49228,0.421417,0.262026,0.015103,0.03545,0.1933,0.080693,0.43559,0.011461,0.156041,0.463427,...,0.04019,0.127453,0.038704,0.120567,0.054,0.114946,0.072909,0.041539,0.052646,0.099961,0.07225,0.073517,0.040275,0.052319,0.090557,0.027061,0.064166,0.103684,0.110647,0.078462,0.045797,0.047855,0.084329,0.080988,0.059333,0.04962,0.054695,0.075806,0.0324,0.11585,0.082902,0.072107,0.055382,0.081992,0.103749,0.093292,0.083435,0.079199,0.107617,0.106387,0.06632,0.068505,0.061324,0.080988,0.104204,0.065331,0.371776,0.378101,0.430499,0.431337
min,-1.617931,-2.355265,-1.51962,-0.9333998,-2.58344,-1.031373,-0.5964407,-0.592605,-1.417861,-0.316256,-1.198097,-1.986919,-10.39717,-6.174053,-5.704396,-4.661824,-3.483145,-4.073475,150000.0,2.0,0.0,0.0,0.0,0.0,0.0,11.918391,0.0,0.0,0.0,2.0,1.0,0.0,0.627699,3.089742,2.197367,3.821405,3.661341,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.8890485,-0.4988246,-0.6345816,-0.6751344,-0.4614801,-1.031373,-0.5964407,-0.592605,-0.9434604,-0.316256,-0.5973058,-0.7969864,-0.5987973,-0.7326821,-0.5836748,-0.5852302,-0.6882688,-0.6669764,345000.0,12.0,3.0,1.0,40.0,0.0,0.0,12.7513,0.0,25.0,0.0,5.0,15.0,5.969919,5.942073,6.140564,4.316886,5.421799,5.66217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.204275,-0.3131806,-0.04455584,-0.416869,-0.3023331,-0.1679094,-0.5964407,-0.592605,0.1761253,-0.316256,0.003484952,0.0529657,0.1146586,0.0328318,0.08536791,0.1095461,0.0587098,0.0733421,417000.0,13.0,5.0,2.0,43.0,48.0,0.0,12.940842,0.0,84.0,0.0,8.0,25.0,6.404611,6.689721,6.539166,4.678117,5.849531,6.097001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.9331573,0.2437514,0.2504571,0.6161926,0.5994999,0.7495204,1.676612,0.331708,0.6695021,-0.316256,0.6042757,0.7329274,0.7096647,0.8105158,0.676635,0.7004766,0.6974868,0.6863644,505000.0,16.0,6.0,6.0,60.0,99.0,1.0,13.132314,66.0,110.0,0.0,11.0,33.0,6.767134,7.449254,6.891431,4.985356,6.215306,6.457064,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.66204,6.555647,11.16593,3.973643,9.405634,7.549294,1.676612,6.801899,5.944838,7.117332,6.612184,2.517827,2.994926,1.865598,2.827971,2.836153,3.994874,3.652865,779000.0,50.0,43.0,19.0,226.0,477.0,1.0,13.565766,528.0,388.0,142.0,41.0,54.0,8.15949,8.479712,8.173154,6.095745,8.103446,8.199458,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [16]:
housing.shape

(144640, 415)

### Train Test Split

In [17]:
# Identifying X and y variables

X = housing.drop(columns = ['resale_price','resale_price_log'])
y = housing['resale_price_log']
y_orig = housing['resale_price']

In [18]:
# # Instantiate our PolynomialFeatures object to create all two-way terms.
# poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)

# # Fit and transform our X data.
# X_overfit = poly.fit_transform(X)

In [19]:
#poly.get_feature_names_out(X.columns)

In [37]:
# Train-test-split (log)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42
)

In [38]:
# Train-test-split (not log)

X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(
    X,
    y_orig,
    test_size=0.3,
    random_state=42
)

### Baseline Score

In [55]:
# Mean resale price 

mean = np.mean(y_train_orig)
mean

433917.48494597414

In [56]:
pred = [mean for x in y_train_orig]

In [58]:
# Baseline model RMSE

baseline_rmse = np.sqrt(np.mean((pred-y_train_orig)**2))
print(f'Baseline RMSE: {baseline_rmse}')

Baseline RMSE: 121148.4311571188


### Linear Regression

In [23]:
# Instantiating logistic regression model

lr_model = LinearRegression()

In [24]:
# Fitting
lr_model.fit(X_train,y_train)

ValueError: Found input variables with inconsistent numbers of samples: [101248, 101249]

In [None]:
# Returning the R^2 for the model

#Training set R2
print('Training set R^2: ', lr_model.score(X_train, y_train))

#Test set R2
print('Test set R^2: ', lr_model.score(X_test, y_test))

In [None]:
# Cross validation

lr_model_scores = cross_val_score(lr_model, X_train, y_train, cv=10)

print (lr_model_scores)
print (np.mean(lr_model_scores))

In [None]:
# Summary results with p-value 

X_train_lr = sm.add_constant(X_train)
ols = sm.OLS(y_train, X_train_lr).fit()
ols.summary()


In [None]:
# Function for obtaining coefficient names, values and p-values from OLS model:

def get_coef_table(lin_reg, variable):
    err_series = lin_reg.params - lin_reg.conf_int()[0]
    
    coef_df = pd.DataFrame({'varname': variable.columns,
                            'coef': lin_reg.params.values[1:],
                            'ci_err': err_series.values[1:],
                            'pvalue': lin_reg.pvalues.round(4).values[1:]
                           })
    return coef_df

In [None]:
pd.set_option('display.max_rows', None)
get_coef_table(ols, X)

In [None]:
# Predictions using LR
lr_y_pred = lr_model.predict(X_test)

In [None]:
r_sq_score = lr_model.score(X_train, y_train)
print('R-Squared Score:', r_sq_score)

In [None]:
print("Linear Regression Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test,lr_y_pred)))

In [None]:
# Create scatterplot to show predicted values versus actual values

plt.figure(figsize=(8,8))
sns.regplot(data=X_train, x=lr_y_pred, y=y_test, marker='x', color='skyblue', line_kws={'color':'black'})
plt.xlabel('Predicted Sale Price', fontsize=14)
plt.ylabel('Actual Sale Price', fontsize=14)
plt.title('Linear Regression Predictions of Sale Price vs Actual Sale Price', fontsize=18)

The linear regression model shows clear signs of overfitting based on the large differences in individual cross-validation scores, and coefficients are highly overblown. Even though the root mean squared error is low, the model's interpretability is low and is less helpful for identifying factors that best predict housing prices.

Let's try Ridge Regression to see if regularization can help with penalising the coefficients of predictors to reduce model complexity and overfitting.

### Ridge Regression

In [None]:
# finding the optimal alpha 

ridge_alphas = np.logspace(0, 5, 200)

optimal_ridge = RidgeCV(alphas=ridge_alphas, cv=10)
optimal_ridge.fit(X_train, y_train)

print (optimal_ridge.alpha_)

In [None]:
# fitting to the model and getting R^2 scores

ridge = Ridge(alpha=optimal_ridge.alpha_)

ridge_scores = cross_val_score(ridge, X_train, y_train, cv=10)

print ('Cross-Validation scores:', ridge_scores)
print ('Mean Cross-Validation score:', np.mean(ridge_scores))

In [None]:
ridge.fit(X_train, y_train)

In [None]:
print("Training score:", ridge.score(X_train, y_train))
print("Test score:", ridge.score(X_test, y_test))

In [None]:
# Predictions using Ridge
ridge_y_pred = ridge.predict(X_test)
pd.DataFrame(ridge_y_pred).head()

In [None]:
# RMSE 

print("Ridge Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test,ridge_y_pred)))

In [None]:
# R-Squared

print("R-Squared:", metrics.r2_score(y_test, ridge_y_pred)

In [None]:
# Function for getting dollar increase in resale price for 1 unit change in predictor 

def coef_fx(model):
    transformed_coef = []
    for i in model.coef_:
        j = math.exp(i)
        transformed_coef.append(j)
        coef_effect = [(i-1)*housing['resale_price'].mean() for i in transformed_coef]
    return coef_effect

In [None]:
coef_fx(ridge)[:5]

In [None]:
# Summarising coefficients

ridge_coefs = pd.DataFrame({'variable':X.columns,
                            'coef':ridge.coef_,
                            'abs_coef':np.abs(ridge.coef_),
                            'coef_effect':coef_fx(ridge),
                           })

ridge_coefs.sort_values('abs_coef', inplace=True, ascending=False)
ridge_coefs.head(50)

In [None]:
# Create scatterplot to show predicted values versus actual values

plt.figure(figsize=(10,8))
sns.regplot(data=X_train, x=ridge_y_pred, y=y_test, marker='x', color='skyblue', line_kws={'color':'black'})
plt.xlabel('Predicted Sale Price', fontsize=14)
plt.ylabel('Actual Sale Price', fontsize=14)
plt.title('Ridge Regression Predictions of Sale Price vs Actual Sale Price', fontsize=18)

In [None]:
# Create scatterplot to show predicted values versus actual values (not logged)

plt.figure(figsize=(10,8))
sns.regplot(data=X_train, x=np.exp(ridge_y_pred), y=np.exp(y_test), 
            marker='x', color='orange', line_kws={'color':'black'})
plt.xlabel('Predicted Sale Price', fontsize=14)
plt.ylabel('Actual Sale Price', fontsize=14)
plt.title('Ridge Regression Predictions of Sale Price vs Actual Sale Price', fontsize=22)

This time the model doesn't overfit, and the RMSE remains low. But perhaps more can done to improve the accuracy of the model and removing variables that do not explain much of the variance in resale price. We try L1 regularisation to see if a Lasso regression could perform automated feature selection and pick variables that are best at predicting resale prices. 

### Lasso Regression

In [None]:
# Using LassoCV to obtain the optimal alpha

optimal_lasso = LassoCV(n_alphas=1000, cv=10)
optimal_lasso.fit(X_train, y_train)

print (optimal_lasso.alpha_)

In [None]:
# Running cross-validation

lasso = Lasso(alpha=optimal_lasso.alpha_)
lasso_scores = cross_val_score(lasso, X_train, y_train, cv=10)

print ('Cross-Validation scores:', lasso_scores)
print ('Cross-Validation mean score:', np.mean(lasso_scores))

Minimal overfitting 

In [None]:
lasso.fit(X_train, y_train)

In [None]:
print("Training set score:", lasso.score(X_train, y_train))
print("Test set score:", lasso.score(X_test, y_test))

In [None]:
# Obtaining each predictor's coefficient and converting it to show the effect of every unit change

lasso_coefs = pd.DataFrame({'variable':X.columns,
                            'coef':lasso.coef_,
                            'abs_coef':np.abs(lasso.coef_),
                            'coef_effect':coef_fx(lasso),
                           })

lasso_coefs.sort_values('abs_coef', inplace=True, ascending=False)
lasso_coefs.head(20)

In [None]:
# Predictors that were removed

lasso_coefs[lasso_coefs['abs_coef'] == 0]

In [None]:
lasso_y_pred = lasso.predict(X_test)
lasso_y_pred

In [None]:
# Calculating the RMSE 

print("Lasso Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test,lasso_y_pred)))

In [None]:
# R-Squared

print("R-Squared:", metrics.r2_score(y_test, lasso_y_pred))

In [None]:
# Comparing predictions against the test set

lasso_predicted = pd.DataFrame({'y_hat':lasso.predict(X_test),
                               'y_actual': y_test,
                               'residuals': (y_test - lasso.predict(X_test)),
                               'actual test values': np.exp(y_test),
                               'predicted values': np.exp(lasso_y_pred).round(decimals = 1)                             
                              })

lasso_predicted.sort_values('residuals', inplace=True, ascending=False)
lasso_predicted.head(10)

In [None]:
# Create scatterplot to show predicted values versus actual values


plt.figure(figsize=(10,8))
sns.regplot(data=X_train, x=lasso_y_pred, y=y_test, marker='x', color='skyblue', line_kws={'color':'black'})
plt.xlabel('Predicted Sale Price', fontsize=14)
plt.ylabel('Actual Sale Price', fontsize=14)
plt.title('Lasso Predictions of Sale Price vs Actual Sale Price', fontsize=22)

In [None]:
# Create scatterplot to show predicted values versus actual values

plt.figure(figsize=(10,8))
sns.regplot(data=X_train, x=np.exp(lasso_y_pred), y=np.exp(y_test), 
            marker='x', color='orange', line_kws={'color':'black'})
plt.xlabel('Predicted Sale Price', fontsize=14)
plt.ylabel('Actual Sale Price', fontsize=14)
plt.title('Lasso Regression Predictions of Sale Price vs Actual Sale Price', fontsize=22)

RMSE of the Lasso Regression is higher than the Ridge Regression 0.081 vs 0.080. 

### ElasticNet

In [None]:
# Finding the optimal alpha and l1 ratio

l1_ratios = np.linspace(0.01, 1.0, 25)

optimal_enet = ElasticNetCV(l1_ratio=l1_ratios, n_alphas=1000, cv=10)
optimal_enet.fit(X_train, y_train)

print (f'Optimal alpha: {optimal_enet.alpha_}')
print (f'Optimal L1 ratio: {optimal_enet.l1_ratio_}')


In [None]:
# fitting the model

enet = ElasticNet(alpha=optimal_enet.alpha_, l1_ratio=optimal_enet.l1_ratio_)

enet_scores = cross_val_score(enet, X_train, y_train, cv=10)

print ("Cross-Validation scores: ", enet_scores)
print ("Cross-Validation mean score: ", np.mean(enet_scores))

In [None]:
enet.fit(X_train, y_train)

In [None]:
print("Training set scores: ", enet.score(X_train, y_train))
print("Test set scores: ", enet.score(X_test, y_test))

In [None]:
enet_y_pred = enet.predict(X_test)

In [None]:
print("ElasticNet Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test,enet_y_pred)))

In [None]:
# R-Squared

print("R-Squared:", metrics.r2_score(y_test, enet_y_pred))

In [None]:
# Predicting on the test set

enet_predicted = pd.DataFrame({'y_hat':enet.predict(X_test),
                               'y_actual': y_test,
                               'residuals': (y_test - enet.predict(X_test)),
                               'actual test values': np.exp(y_test),
                               'predicted values': np.exp(enet_y_pred).round(decimals = 1)                             
                              })

enet_predicted.sort_values('residuals', inplace=True, ascending=False)
enet_predicted.head(10)

In [None]:
# Coefficients of predictors

enet_coefs = pd.DataFrame({'variable':X.columns,
                           'enet_coef':enet.coef_,
                           'enet_abs_coef':np.abs(enet.coef_),
                           'coef_effect':coef_fx(enet)})

enet_coefs.sort_values('enet_abs_coef', inplace=True, ascending=False)
enet_coefs.head(30)

In [None]:
# Create scatterplot to show predicted values versus actual values (logged)

plt.figure(figsize=(10,8))
sns.regplot(data=X_train, x=enet_y_pred, y=y_test, marker='x', color='skyblue', line_kws={'color':'black'})
plt.xlabel('Predicted Sale Price', fontsize=14)
plt.ylabel('Actual Sale Price', fontsize=14)
plt.title('ElasticNet Predictions of Sale Price vs Actual Sale Price', fontsize=22)

In [None]:
# Create scatterplot to show predicted values versus actual values (not logged)

plt.figure(figsize=(10,8))
sns.regplot(data=X_train, x=np.exp(enet_y_pred), y=np.exp(y_test), 
            marker='x', color='orange', line_kws={'color':'black'})
plt.xlabel('Predicted Sale Price', fontsize=14)
plt.ylabel('Actual Sale Price', fontsize=14)
plt.title('ElasticNet Regression Predictions of Sale Price vs Actual Sale Price', fontsize=22)

---

### Review

**Comparison between the 3 Models**:

Presence of overfitting:
Linear Regression : Yes
Ridge Regression Root Mean Squared Error: No
Lasso Regression Root Mean Squared Error: No
ElasticNet Regression Root Mean Squared Error: No

Root Mean Squared Error:
Linear Regression : 0.0805
Ridge Regression Root Mean Squared Error: 0.0805
Lasso Regression Root Mean Squared Error: 0.0812
ElasticNet Regression Root Mean Squared Error: 0.0812

R-Squared Scores:
Linear Regression : 0.0805
Ridge Regression Root Mean Squared Error: 0.0805
Lasso Regression Root Mean Squared Error: 0.0812
ElasticNet Regression Root Mean Squared Error: 0.0812

- Baseline linear regression model performed most poorly with inflated coefficients, overfitting (evident in the cross-validation scores), and 
- Ridge, Lasso and ElasticNet perform similarly in terms of R^2 scores (about 87%)
- Strongest predictors: floor area, age, distance to mrt and hawkers, whether the estate is in Woodlands and floor level

**Interpretation:**

<ol>
    <li> With every one unit increase in floor area, resale price is estimated to increase by SGD 85K </li>
    <li>  With every one unit increase in lease commence date, the resale price is estimated to increase by SGD47K</li>
    <li>Compared to houses in Kallang, resale prices in Woodlands, Jurong East, Sembawang, Choa Chu Kang are estimated to be SGD 20K - 36K lower, while prices in Marine Parade, Bishan, Bt Merah and Queenstown are estimated to be SGD 11k - 15k higher. </li>
    <li>With every 1 unit increase in distance from mrt and hawker centres, the resale price is estimated to decrease by SGD 25K and SGD 23K respectively.</li>
    <li>With every 1 unit increase in the estimated height of the house relative to the highest storey, the resale price is estimated to increase by SGD 15k.</li>
</ol>


**Conclusion and Recommendations**

- Houses with larger floor area, have longer remaining lease periods, which are closer to hawker centres and mrt stations, and are mature estates located closer to central area tend to fetch higher resale prices.
- On the flipside, houses located far away from the city centre have lower resale prices. 

- The results indicate that people value having a larger and newer home, which allows owners to have more flexibilty in terms of family planning or rental. 
- Accessibility to public transport (mrt stations) and cheap F&B options (hawkers) are important factors that impact day-to-day activities such as commuting and overall cost of living. 
- Keeping these needs and preferences in mind when attempting to market or sell a house would be useful, by emphasising on the property's strengths and downplaying its weaknesses. 

Future Steps
- TBC

