# BMI 6018 Final Project & Final Exam
# Author: Tae Jang 

## Housing Price Prediction

* Encoding <<< includes Final Exam
* Data Split (train and test)
* Variable Selection (RFE method)
* Model Adjustments (ipywidgets)
* Prediction
* Validation

In [1]:
from sklearn.model_selection import train_test_split
import statsmodels.formula.api as smf 
import pandas as pd
import statistics as s
import matplotlib.pyplot as plt 
import seaborn as sn
import statsmodels.api as sm
import ipywidgets as widgets
from ipywidgets import Layout, Button, Box, FloatText, Textarea, Dropdown, Label, IntSlider, FloatSlider, SelectMultiple, interactive
import numpy as np
import random

  from pandas.core import datetools


In [2]:
%matplotlib inline

In [3]:
#data imputed in R with "missForest" package

np.random.seed(1234567)

data = pd.read_csv("real_estate_data.csv")
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,Pave,Reg,Lvl,AllPub,...,0,Fa,GdPrv,Shed,0,2,2008,WD,Normal,208500.0
1,2,20,RL,80.0,9600,Pave,Pave,Reg,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,5,2007,WD,Normal,181500.0
2,3,60,RL,68.0,11250,Pave,Pave,IR1,Lvl,AllPub,...,0,Fa,GdPrv,Shed,0,9,2008,WD,Normal,223500.0
3,4,70,RL,60.0,9550,Pave,Grvl,IR1,Lvl,AllPub,...,0,Gd,GdPrv,Shed,0,2,2006,WD,Abnorml,140000.0
4,5,60,RL,84.0,14260,Pave,Pave,IR1,Lvl,AllPub,...,0,Fa,GdPrv,Shed,0,12,2008,WD,Normal,250000.0


## Encoding

Encoding is necessary because for variable selection process. Most of variable selction methods (e.g. Recursive Feature Elimination (RFE) and regularization methods, Ridge and Lasso) do not support categorical variables. Encoding takes care of this problem as the outcome will turn into binary variables. We are encoding before we split the data.

I was frustrated with encoding in the beginning of this project, but I found this subject interesting. I wanted to explore different encoding methods. This part includes tutorial (Final Exam).


<em>***Code Source***

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
             

In [4]:
#creating sample data to play with.
sample = data[['Street', 'Alley', 'SaleCondition', 'LotFrontage', 'SalePrice']]
sample.head()

#Pandas get_dummies method take only Pandas dataframe
np.random.seed(1234567)
pd.get_dummies(sample).head()

Unnamed: 0,LotFrontage,SalePrice,Street_Grvl,Street_Pave,Alley_Grvl,Alley_Pave,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,65.0,208500.0,0,1,0,1,0,0,0,0,1,0
1,80.0,181500.0,0,1,0,1,0,0,0,0,1,0
2,68.0,223500.0,0,1,0,1,0,0,0,0,1,0
3,60.0,140000.0,0,1,1,0,1,0,0,0,0,0
4,84.0,250000.0,0,1,0,1,0,0,0,0,1,0


Prefix of variable name can be changed. If prefix is not specified, get_dummies will automatically grab categorical variables only and turn them into binary variables.


In [5]:
pd.get_dummies(sample, prefix=["ST", "Al", "SC"]).head()


Unnamed: 0,LotFrontage,SalePrice,ST_Grvl,ST_Pave,Al_Grvl,Al_Pave,SC_Abnorml,SC_AdjLand,SC_Alloca,SC_Family,SC_Normal,SC_Partial
0,65.0,208500.0,0,1,0,1,0,0,0,0,1,0
1,80.0,181500.0,0,1,0,1,0,0,0,0,1,0
2,68.0,223500.0,0,1,0,1,0,0,0,0,1,0
3,60.0,140000.0,0,1,1,0,1,0,0,0,0,0
4,84.0,250000.0,0,1,0,1,0,0,0,0,1,0


Let's look at a dataset with NAs.

In [6]:
#randomly inserting NaN
mask = np.random.choice([True, False], size=sample.shape)
mask[mask.all(1),-1] = 0
sample1 = sample.mask(mask)
print(sample1.head())
#encoding as it is
pd.get_dummies(sample1).head()

  Street Alley SaleCondition  LotFrontage  SalePrice
0   Pave  Pave           NaN          NaN   208500.0
1    NaN   NaN        Normal          NaN   181500.0
2    NaN   NaN        Normal         68.0   223500.0
3    NaN   NaN       Abnorml         60.0        NaN
4   Pave  Pave           NaN          NaN   250000.0


Unnamed: 0,LotFrontage,SalePrice,Street_Grvl,Street_Pave,Alley_Grvl,Alley_Pave,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,,208500.0,0,1,0,1,0,0,0,0,0,0
1,,181500.0,0,0,0,0,0,0,0,0,1,0
2,68.0,223500.0,0,0,0,0,0,0,0,0,1,0
3,60.0,,0,0,0,0,1,0,0,0,0,0
4,,250000.0,0,1,0,1,0,0,0,0,0,0


In [7]:
#dummy_na agrument gives "NaN columns"
pd.get_dummies(sample1, dummy_na=True).head()

Unnamed: 0,LotFrontage,SalePrice,Street_Grvl,Street_Pave,Street_nan,Alley_Grvl,Alley_Pave,Alley_nan,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,SaleCondition_nan
0,,208500.0,0,1,0,0,1,0,0,0,0,0,0,0,1
1,,181500.0,0,0,1,0,0,1,0,0,0,0,1,0,0
2,68.0,223500.0,0,0,1,0,0,1,0,0,0,0,1,0,0
3,60.0,,0,0,1,0,0,1,1,0,0,0,0,0,0
4,,250000.0,0,1,0,0,1,0,0,0,0,0,0,0,1


### LabelEncoder and OneHotEncoder
#### Slightly different kind of encoding

LabelEncoder assigns numerical values (not binaey values) to categorical variables. LabelEncoder takes only arrays. OneHotEncoder will turn these numerical values into binary matrix. Also, you can invert to the original categorical variables with LabelEncoder. A disadvantage of LabelEncoder and OneHotEncoder is that they do not support multiple columns at the same time. Each column has to be converted to array.

As we learned in Natural Language Processing lecture, it is convinient for word vectors. Applications with encoding is unlimited.


In [8]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from numpy import array
from numpy import argmax


sample_st= array(sample['Street'])
sample_al= array(sample['Alley'])
sample_sc= array(sample['SaleCondition'])
le = LabelEncoder()
integer_encoded = le.fit_transform(sample_sc) #<- change inputs here
#inv_sc=le.fit_transform(sample_sc)
#le.inverse_transform(inv_sc)

onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
onehot_encoded

#This array can be decomposed into 6 columns (just like get_dummies).

array([[0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       ...,
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.]])

In [9]:
#invert back to words
le.inverse_transform(integer_encoded)


  if diff:


array([['Normal'],
       ['Normal'],
       ['Normal'],
       ...,
       ['Normal'],
       ['Normal'],
       ['Normal']], dtype=object)

In [10]:
def encoder(x):
    'encoder take a dataset of any kind and turns into encoded dataset'
    return pd.get_dummies(pd.DataFrame(x))

In [11]:
encoder(sample_sc).head()

Unnamed: 0,0_Abnorml,0_AdjLand,0_Alloca,0_Family,0_Normal,0_Partial
0,0,0,0,0,1,0
1,0,0,0,0,1,0
2,0,0,0,0,1,0
3,1,0,0,0,0,0
4,0,0,0,0,1,0


## Now let us get back to the project


Notice that 81 variables turned into 290 variables

In [12]:
#Here, I'm simply using Pandas get_dummies because it is the easiest and fastest option out there.
data1 = pd.get_dummies(data)
data1.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,0,0,0,1,0,0,0,0,1,0
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,0,0,0,1,0,0,0,0,1,0
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,0,0,0,1,0,0,0,0,1,0
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,0,0,0,1,1,0,0,0,0,0
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,0,0,0,1,0,0,0,0,1,0


## Train and Test Data Split

In [13]:
train ,test = train_test_split(data1,test_size=0.3) 


print(len(train))
print(len(test))

#NAs are taken care of
sum(train.isna().sum())

1022
438


0

## Variable Selection: Recursive Feature Elimination Method

https://machinelearningmastery.com/feature-selection-machine-learning-python/

As we wanted to pick 20 most important variables, when we ran RFE with all 289 independent variables (including encoded variables), the result was only giving us encoded variable (which doesn't make any sense). So we decided to run RFE among continuous variables only (the result made more sense).

For the sake of practice, we combined those two results to include in our regression.

In [14]:
#grouping continuous variables
num_df= train.select_dtypes(include=['int64', 'float64']).copy()
num = pd.DataFrame(num_df)

In [15]:
#Recursive Feature Elimination Method

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

#dropping outcome variable for RFE (continuous variables only)
num1 = num.drop('SalePrice', axis = 1)

xvar = train.drop('SalePrice', axis = 1)
yvar = train['SalePrice']

model = LinearRegression()
#select 10 variables that are the most important
rfe = RFE(model, 10)
fit = rfe.fit(num1, yvar)
#fit1 = rfe.fit(xvar, yvar)
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
#print("Selected Features: %s" % (fit1.support_))
#print("Feature Ranking: %s" % (fit.ranking_))

rank1 = fit.ranking_
feat = fit.support_

l1=[i for i, x in enumerate(feat) if x]

z1 = []
for v in l1:
    z1.append(train.columns[v])
z1

Num Features: 10
Selected Features: [False False False False  True False False False False False False False
 False False False False False  True  True  True  True  True  True  True
  True False  True False False False False False False False False False
 False]


['OverallQual',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageCars']

In [16]:
#This RFE is including encoded variables (This is selecting only encoded variable!)
fit1 = rfe.fit(xvar, yvar)
fit1.n_features_
feat1 = fit1.support_

l=[i for i, x in enumerate(feat1) if x]

z = []
for v in l:
    z.append(train.columns[v])
z

['Functional_Typ',
 'FireplaceQu_Ex',
 'FireplaceQu_Fa',
 'FireplaceQu_Gd',
 'FireplaceQu_Po',
 'PoolQC_Fa',
 'PoolQC_Gd',
 'Fence_GdPrv',
 'Fence_GdWo',
 'Fence_MnPrv']

In [17]:
#selected variables (We are combing the two RFE results here)
v_selected = z+z1

#all variables except for SalePrice
v_list = list(train.columns)
v_list.pop(v_list.index('SalePrice'))

v_list

['Id',
 'MSSubClass',
 'LotFrontage',
 'LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 'X1stFlrSF',
 'X2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 'X3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal',
 'MoSold',
 'YrSold',
 'MSZoning_C (all)',
 'MSZoning_FV',
 'MSZoning_RH',
 'MSZoning_RL',
 'MSZoning_RM',
 'Street_Grvl',
 'Street_Pave',
 'Alley_Grvl',
 'Alley_Pave',
 'LotShape_IR1',
 'LotShape_IR2',
 'LotShape_IR3',
 'LotShape_Reg',
 'LandContour_Bnk',
 'LandContour_HLS',
 'LandContour_Low',
 'LandContour_Lvl',
 'Utilities_AllPub',
 'Utilities_NoSeWa',
 'LotConfig_Corner',
 'LotConfig_CulDSac',
 'LotConfig_FR2',
 'LotConfig_FR3',
 'LotConfig_Inside',
 'LandSlope_Gtl',

In [18]:
#list to string
#col_list = " ".join(str(x) for x in train.columns[:-1])
#col = col_list.replace(" ", " + ")
#list(train.columns)

In [19]:
#all-variable linear regression
#reg1 = smf.ols('SalePrice ~'+col, data=train).fit()
#print (reg1.summary())

## Model Adjustments

* Correlation Threshold (to identify multicollinearities)
* Pick your variables (default is the result of RFE)

In [20]:
def disp1a(Cor):
    corr = data.corr()
    corr = corr[corr > Cor]
    corr = corr[corr.sum()>1]
    corr1 = corr.filter(corr.sum() > 1, axis = 1)
    corr1 = corr1.to_string()
    corr1 = corr1.replace("[","")
    corr1 = corr1.replace("]","")
    corr1 = corr1.replace("Empty DataFrame\nColumns: \nIndex: ","")
    corr1 = corr1.split(", ")
    corr2 = corr.filter(corr1, axis = 1)
    corr2
    heat = sn.heatmap(corr2, annot=False, cmap="Reds")
    return plt.show()

w = interactive(disp1a, Cor = FloatSlider(value=0.7, min=0, max=1, step = 0.01))
display(w)

In [21]:
#Checking for multicollinearities. It seems okay with 70% correlation threshold.
v_selected

['Functional_Typ',
 'FireplaceQu_Ex',
 'FireplaceQu_Fa',
 'FireplaceQu_Gd',
 'FireplaceQu_Po',
 'PoolQC_Fa',
 'PoolQC_Gd',
 'Fence_GdPrv',
 'Fence_GdWo',
 'Fence_MnPrv',
 'OverallQual',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageCars']

In [26]:
def disp2a(S):
    SS = list(S)
    #print(SS)
    #list to string
    col_list = " ".join(str(x) for x in SS)
    col = col_list.replace(" ", " + ")
    #linear regression
    reg1 = smf.ols('SalePrice ~'+col, data=train).fit()
    print(reg1.summary())
    
P = interactive(disp2a, S = SelectMultiple(options=v_list, 
                                            value=v_selected,
                                            description = 'Correlation Threshold', style ={'description_width': 'initial'},
                                            layout=Layout(flex_flow='column',
                                                            border='solid 2px',
                                                            align_items='stretch',
                                                            width='40%',
                                                            height='200px')))
display(P)

## Prediction and Validation

Using the above model, we will predict and validate the model comparing Root Mean Squared Error (RMSE).

In [23]:
#Predict train set
X = train[v_selected]
Y = train['SalePrice']

regr = LinearRegression()
regr.fit(X, Y)
pred_train = regr.predict(X)

In [24]:
#Predict test set
X_test = test[v_selected]
Y_test = test['SalePrice']

regr.fit(X_test, Y_test)
pred_test = regr.predict(X_test)

In [25]:
#calculate RMSE of both train and test set
from sklearn.metrics import mean_squared_error

lin_mse = mean_squared_error(pred_train, Y)
lin_rmse = np.sqrt(lin_mse)
print('Linear Regression RMSE - Train: %.4f' % lin_rmse)

lin_mse = mean_squared_error(pred_test, Y_test)
lin_rmse = np.sqrt(lin_mse)
print('Linear Regression RMSE - Test: %.4f' % lin_rmse)

#The model is slightly underfit but acceptable

Linear Regression RMSE - Train: 37549.4296
Linear Regression RMSE - Test: 37382.7285
