# Part 3: Creating our Final Model 

The Data used in this notebook can be found on the Kaggle competition page. Here is the description of the challenge (from Kaggle): 

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

(https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

In [165]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
import numpy as np
import seaborn as sns
from sklearn.model_selection import KFold, cross_val_score,StratifiedKFold
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import mean_squared_error


pd.set_option('display.max_columns', None)  
pd.set_option('display.max_rows', None) 
pd.set_option('display.max_colwidth', -1)

In this notebook we create different models and check the cross validation score for them. Then we will optimize the model by tuning the hyperparameters. Using these models we blend them together to create the best possible model. 

## Step 1: Import and Randomize our Data

As usual, first we import the data and randomize it. Instead of splitting the dataset in training, testing, and validation data, we are just randomizing it. This is because during our hyperparameter tuning we will use Cross Validation

In [166]:
#import our Data

data_location = 'housing_data_final.csv'
df = pd.read_csv(data_location)

In [167]:
#Randomize our Data an call K-fold to preform cross validation 

df = df.sample(frac=1)

df = df.reset_index(drop=True)

df_X = df.drop(columns=['SalePrice'])
df_y = df['SalePrice']

kf = KFold(n_splits=5)

## Step 2: Implement and Tune Models

We will now use six different algorithms and run cross validation on them. This will optomoize our models hyperparameters. For each of the cell in this section (excluding the first one below this) remember to change the loop (and the model nested within it) too loop through the different hyperparameter variables.  

In [170]:
#function that does cross validation 

def cf_score(model, X, y):
    rmse = np.sqrt(-cross_val_score(model, X, y, 
                    scoring = "neg_mean_squared_error", cv = kf))
    
    return rmse

In [171]:
#Assses individuals models using cross validation 

from sklearn.linear_model import LinearRegression

linear = LinearRegression()

met = cf_score(linear, df_X, df_y)

print(f'Mean: {np.mean(met)}, Std. Dev.: {np.std(met)}')

Mean: 0.12763245416656266, Std. Dev.: 0.008258263033702888


In [173]:
#HyperParameter tuning (Ridge)

from sklearn.linear_model import Ridge


alpha = [.5 ,1, 1.5, 2, 2.5, 3]

score_mean = []
score_std = []

for a in alpha: 
    ridge = Ridge(alpha = a)
    met = cf_score(ridge, df_X, df_y)
    score_mean.append(np.mean(met))
    score_std.append(np.std(met))
    
    
scores = pd.DataFrame(data = {'alpha':alpha, 'mean':score_mean, 'Std. Dev.':score_std})
                       
scores

Unnamed: 0,alpha,mean,Std. Dev.
0,0.5,0.12721,0.008001
1,1.0,0.127081,0.007725
2,1.5,0.127056,0.007483
3,2.0,0.127077,0.007277
4,2.5,0.127125,0.0071
5,3.0,0.127188,0.006947


In [180]:
#HyperParameter tuning (RandomForest)


from sklearn.ensemble import RandomForestRegressor

n_est = [500, 750, 1000, 1250, 1500]
max_depth = [5, 10, 15, 20, 25]
min_samples_split = [2, 4, 5, 7, 10]
max_features = [1, .1, .25, .5, .75, 'auto', None]

score_mean = []
score_std = []


for i in max_features:  
    rf = RandomForestRegressor(n_estimators=1000,
                              max_depth=20,
                              min_samples_split= 5,
                              min_samples_leaf=1,
                              max_features=.25, 
                              oob_score=True,
                              random_state=42)

    met=cf_score(rf, df_X, df_y)
    score_mean.append(np.mean(met))
    print(i)
    print(np.mean(met))
    score_std.append(np.std(met))
    
scores = pd.DataFrame(data = {'metric':max_features, 'mean':score_mean, 'Std. Dev.':score_std})

scores

1
0.15448835259628496
0.1
0.14006288175890852
0.25
0.13683913305493425
0.5
0.13835232634419464
0.75
0.1402230458485898
auto
0.14234795915104584
None
0.14234795915104584


Unnamed: 0,metric,mean,Std. Dev.
0,1,0.154488,0.0099
1,0.1,0.140063,0.010221
2,0.25,0.136839,0.010007
3,0.5,0.138352,0.010413
4,0.75,0.140223,0.010363
5,auto,0.142348,0.010165
6,,0.142348,0.010165


In [195]:
#HyperParameter tuning (XGBRegressor)


from xgboost import XGBRegressor

learning_rate = [.05, .1, .2, .5, .75, 1]
max_depth = [2,4,6,10,12]
min_child_weight = [5,10,15,20]
gamma = [0, .1, .6, 1]
alpha = [0, .1, .5, 1, 1.5]
#alpha = np.arange(1,7,2)

score_mean = []
std_mean = []

for i in alpha:  ## change to alpha
    xgb = XGBRegressor(learning_rate= .05,
                           n_estimators=6000,
                           max_depth= 12,
                           min_child_weight=15,
                           gamma=0,  
                           objective='reg:squarederror',
                           nthread=-1,
                           seed=27,
                           reg_alpha=i,
                           random_state=42)
    print(i)

    met=cf_score(xgb, df_X, df_y)
    score_mean.append(np.mean(met))
    std_mean.append(np.std(met))
    print(met)
    
scores = pd.DataFrame(data = {'metric':alpha, 'mean':score_mean, 'Std. Dev.':std_mean})  #change metric:gamma metric:alpha

scores

0
[0.14455633 0.13561319 0.14502696 0.12949093 0.13560097]
0.1
[0.14277981 0.13535803 0.14136242 0.12752613 0.13329778]
0.5
[0.13862838 0.12718245 0.13835522 0.12378724 0.12939915]
1
[0.13623831 0.12481996 0.13651455 0.12320651 0.12978315]
1.5
[0.1365239  0.12136062 0.13866885 0.12345615 0.13224422]


Unnamed: 0,metric,mean,Std. Dev.
0,0.0,0.138058,0.005936
1,0.1,0.136065,0.005554
2,0.5,0.13147,0.006006
3,1.0,0.130112,0.005556
4,1.5,0.130451,0.006917


In [184]:
#HyperParameter tuning (Lasso)


from sklearn.linear_model import Lasso


alpha = [.00001 ,.00005, .0001, .00015, .0002,.00025,.0003]

score_mean = []
score_std = []

for a in alpha: 
    lasso = Lasso(alpha = a)
    met = cf_score(lasso, df_X, df_y)
    score_mean.append(np.mean(met))
    score_std.append(np.std(met))
    
    
scores = pd.DataFrame(data = {'alpha':alpha, 'mean':score_mean, 'Std. Dev.':score_std})
                       
scores

Unnamed: 0,alpha,mean,Std. Dev.
0,1e-05,0.127541,0.008252
1,5e-05,0.12725,0.008195
2,0.0001,0.127064,0.0081
3,0.00015,0.126961,0.008025
4,0.0002,0.126924,0.007894
5,0.00025,0.127033,0.007656
6,0.0003,0.127161,0.0074


In [187]:
#HyperParameter tuning (LGBMRegressor)


from lightgbm import LGBMRegressor 
    
    
num_leaves = np.arange(2,11,1)
n_estimators = np.arange(250,751,50)
learning_rate = np.arange(.01, .35, .01)
boosting_type = ['gbdt','dart','goss']


score_mean = []
score_std = []


for i in num_leaves:
    lgb = LGBMRegressor(boosting_type='gbdt',
                                  num_leaves = 3,
                                  n_estimators = 550,
                                  learning_rate = .06)
                    
    met = cf_score(lgb, df_X, df_y)
    score_mean.append(np.mean(met))
    score_std.append(np.std(met))
    
    
scores = pd.DataFrame(data = {'metrics':num_leaves, 'mean':score_mean, 'Std. Dev.':score_std})
                       
scores

Unnamed: 0,metrics,mean,Std. Dev.
0,2,0.13326,0.007918
1,3,0.128798,0.007409
2,4,0.129573,0.006936
3,5,0.129769,0.00635
4,6,0.13021,0.006291
5,7,0.131092,0.006231
6,8,0.131919,0.006767
7,9,0.131957,0.006701
8,10,0.133149,0.006266


## Part 3: Create a blended model, and assess our Models Training and Validation scores

In [196]:
from lightgbm import LGBMRegressor 
from sklearn.linear_model import Lasso
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression

Create our model with our optimized hyperparameters. 

In [197]:
#Create our Models based on optimized hyperparameter values 

rf = RandomForestRegressor(n_estimators=1000,
                           max_depth=20,
                           min_samples_split=5,
                           min_samples_leaf=1,
                           max_features=.25,
                           oob_score=True,
                           random_state=42)

ridge = Ridge(alpha = 1.5)

linear = LinearRegression()

xgb = XGBRegressor(learning_rate=.05,
                    n_estimators=6000,
                    max_depth=12,
                    min_child_weight=15,
                    gamma=0,
                    objective='reg:squarederror',
                    nthread=-1,
                    seed=27,
                    reg_alpha=1,
                    random_state=42)


lasso = Lasso(alpha = 0.0002)

lgb = LGBMRegressor(boosting_type='gbdt',
                    num_leaves = 3,
                    n_estimators = 550,
                    learning_rate = .06)


We will also reimport the data again, this time splitting them into trainng and validation datasets. This will be used to create the ideal blend for our final model. 

In [198]:
#re-import data and randomize it again 
data_location = 'housing_data_final.csv'
df = pd.read_csv(data_location)      

df_X = df.drop(columns=['SalePrice'])
df_y = df['SalePrice']

In [199]:
#randomize data into training and validation sets
rand_split = np.random.rand(len(df))  
train_list = rand_split < 0.75
valid_list = rand_split >= 0.75

train_X = df_X[train_list]
train_y = df_y[train_list]

valid_X = df_X[valid_list]
valid_y = df_y[valid_list]

The function below will be used to calculate the RMSLE score of our predictions. We will go through all the models in isolation and see how they preforms. In the end we will use our lists (created two cells below) to create a dataframe to find the best model blend.

In [200]:
#function for rmsle
def rmsle(model, X, y):
    
    rmse = np.sqrt(mean_squared_error(model.predict(X),y))
    return rmse


In [201]:
#create empty list that we will fill with model scores
train_score_list = []
valid_score_list = []
name_list = []

In [202]:
#create models that will be blended together
rf.fit(train_X, train_y) 

train_score = rmsle(rf, train_X, train_y)
valid_score = rmsle(rf, valid_X, valid_y)

print(f'train score = {train_score}')
print(f'valid score = {valid_score}')

name_list.append('rf')
train_score_list.append(train_score)
valid_score_list.append(valid_score)

train score = 0.06742138156613312
valid score = 0.13924309949398544


In [203]:
linear.fit(train_X, train_y) 

train_score = rmsle(linear, train_X, train_y)
valid_score = rmsle(linear, valid_X, valid_y)

print(f'train score = {train_score}')
print(f'valid score = {valid_score}')

name_list.append('linear')
train_score_list.append(train_score)
valid_score_list.append(valid_score)

train score = 0.1175657886064351
valid score = 0.1337759462181755


In [204]:
xgb.fit(train_X, train_y) 

train_score = rmsle(xgb, train_X, train_y)
valid_score = rmsle(xgb, valid_X, valid_y)

print(f'train score = {train_score}')
print(f'valid score = {valid_score}')

name_list.append('xgb')
train_score_list.append(train_score)
valid_score_list.append(valid_score)

train score = 0.08047667781513794
valid score = 0.13554007677774318


In [205]:
ridge.fit(train_X, train_y) 

train_score = rmsle(ridge, train_X, train_y)
valid_score = rmsle(ridge, valid_X, valid_y)

print(f'train score = {train_score}')
print(f'valid score = {valid_score}')

name_list.append('ridge')
train_score_list.append(train_score)
valid_score_list.append(valid_score)

train score = 0.11818933106390538
valid score = 0.13362273781753262


In [206]:
lasso.fit(train_X, train_y) 

train_score = rmsle(lasso, train_X, train_y)
valid_score = rmsle(lasso, valid_X, valid_y)

print(f'train score = {train_score}')
print(f'valid score = {valid_score}')

name_list.append('lasso')
train_score_list.append(train_score)
valid_score_list.append(valid_score)

train score = 0.11832332003151178
valid score = 0.13319813799178432


In [207]:
lgb.fit(train_X, train_y) 

train_score = rmsle(lgb, train_X, train_y)
valid_score = rmsle(lgb, valid_X, valid_y)

print(f'train score = {train_score}')
print(f'valid score = {valid_score}')

name_list.append('lgb')
train_score_list.append(train_score)
valid_score_list.append(valid_score)

train score = 0.10447589741919013
valid score = 0.13599208677279326


In [208]:
#get model predictions

train_pred_rf = rf.predict(train_X)
valid_pred_rf = rf.predict(valid_X)

train_pred_ridge = ridge.predict(train_X)
valid_pred_ridge = ridge.predict(valid_X)

train_pred_linear = linear.predict(train_X)
valid_pred_linear = linear.predict(valid_X)

train_pred_lasso = lasso.predict(train_X)
valid_pred_lasso = lasso.predict(valid_X)

train_pred_xgb = xgb.predict(train_X)
valid_pred_xgb = xgb.predict(valid_X)

train_pred_lgb = lgb.predict(train_X)
valid_pred_lgb = lgb.predict(valid_X)


The function below is used to calculate the RMSLE score of the blended model. I wrote a nested loop to go through all several possible model blending combination in an attempt to find the best one.

In [209]:
#function to calculate blended model score
def rmsle_blended(predict_list, percentage_list, y): 
    
    sum_list = []
    
    for i,j in zip(predict_list, percentage_list):
        sum_list.append(i*j)

    sum_mat = np.array(sum_list)

    pred = np.sum(sum_mat, axis = 0)
    
    
    rmse = np.sqrt(mean_squared_error(pred,y))
    
    return rmse

In [210]:
train_list = [train_pred_rf, train_pred_ridge, train_pred_linear, train_pred_xgb, train_pred_lgb, train_pred_lasso]
valid_list = [valid_pred_rf, valid_pred_ridge, valid_pred_linear, valid_pred_xgb, valid_pred_lgb, valid_pred_lasso]
percentage_list = [.1, .4, .2, .1, .1, .1]

train_rmsle= rmsle_blended(train_list, percentage_list, train_y)
valid_rmsle= rmsle_blended(valid_list, percentage_list, valid_y)

train_rmsle, valid_rmsle

(0.10466490594687117, 0.12951802227212394)

In [211]:
layer_1 = np.arange(.1,.91,.1)
layer_2 = np.arange(.1,.91,.1)
layer_3 = np.arange(.1,.91,.1)
layer_4 = np.arange(.1,.91,.1)
layer_5 = np.arange(.1,.91,.1)
layer_6 = np.arange(.1,.91,.1)

for i in layer_1:
    for j in layer_2:
        for k in layer_3:
            for l in layer_4:
                for m in layer_5:
                    for n in layer_6:
                        per_list = [i,j,k,l,m,n]
                        name = '-'.join(str("{:.1f}".format(e)) for e  in per_list)
                        if name not in name_list and (i+j+k+l+m+n) == 1:
                            train_rmsle= rmsle_blended(train_list, per_list, train_y)
                            valid_rmsle= rmsle_blended(valid_list, per_list, valid_y)

                            train_score_list.append(train_rmsle)
                            valid_score_list.append(valid_rmsle)
                            name_list.append(name)
                    

In [212]:
metric_dict = {'Name': name_list, 'Train Score' : train_score_list, 'Valid Score': valid_score_list}

metric_data = pd.DataFrame(data = metric_dict)
metric_data.sort_values(by=['Valid Score'])

Unnamed: 0,Name,Train Score,Valid Score
31,0.1-0.1-0.3-0.3-0.1-0.1,0.095994,0.128519
46,0.1-0.2-0.2-0.3-0.1-0.1,0.096079,0.128575
25,0.1-0.1-0.2-0.3-0.1-0.2,0.09612,0.128576
74,0.2-0.1-0.3-0.2-0.1-0.1,0.094512,0.128612
40,0.1-0.2-0.1-0.3-0.1-0.2,0.096219,0.128645
54,0.1-0.3-0.1-0.3-0.1-0.1,0.09618,0.128647
15,0.1-0.1-0.1-0.3-0.1-0.3,0.096265,0.128648
71,0.2-0.1-0.2-0.2-0.1-0.2,0.094639,0.128663
82,0.2-0.2-0.2-0.2-0.1-0.1,0.094599,0.128664
24,0.1-0.1-0.2-0.2-0.2-0.2,0.098481,0.1287


## Step 4: Transform Testing Data and get Testing Predictions

Now that we have a found the model we want to use its time to import our testing data and transform it to match the format of our cleaned training data. 

In [213]:
data_location = 'test.csv'
df = pd.read_csv(data_location, low_memory=False)
df=df.set_index('Id')

In [214]:
replace_num={'MSSubClass': 50.0,
 'LotFrontage': 69.0,
 'LotArea': 9478.5,
 'OverallQual': 6.0,
 'OverallCond': 5.0,
 'YearBuilt': 1973.0,
 'YearRemodAdd': 1994.0,
 'MasVnrArea': 0.0,
 'BsmtFinSF1': 383.5,
 'BsmtFinSF2': 0.0,
 'BsmtUnfSF': 477.5,
 'TotalBsmtSF': 991.5,
 '1stFlrSF': 1087.0,
 '2ndFlrSF': 0.0,
 'LowQualFinSF': 0.0,
 'GrLivArea': 1464.0,
 'BsmtFullBath': 0.0,
 'BsmtHalfBath': 0.0,
 'FullBath': 2.0,
 'HalfBath': 0.0,
 'BedroomAbvGr': 3.0,
 'KitchenAbvGr': 1.0,
 'TotRmsAbvGrd': 6.0,
 'Fireplaces': 1.0,
 'GarageYrBlt': 1980.0,
 'GarageCars': 2.0,
 'GarageArea': 480.0,
 'WoodDeckSF': 0.0,
 'OpenPorchSF': 25.0,
 'EnclosedPorch': 0.0,
 '3SsnPorch': 0.0,
 'ScreenPorch': 0.0,
 'PoolArea': 0.0,
 'MiscVal': 0.0,
 'MoSold': 6.0,
 'YrSold': 2008.0,
 'TotalSF': 2474.0,
 'TotalBath': 2.0,
 'SalePrice': 163000.0}

df = df.fillna(replace_num)

In [215]:
repl_val={'MasVnrType':'None','BsmtFinType1':'NoBas','BsmtFinType2':'NoBas','BsmtCond':'NoBas', 'BsmtExposure': 'NoBas',
         'Electrical':'Sbrkr','FireplaceQu':'NoFp','GarageType':'NoGar','GarageFinish':'NoGar','GarageQual':'NoGar','GarageCond':'NoGar',
         'PoolQC':'NoP','Fence':'NoF','BsmtQual':'NoBas','KitchenQual' : 'TA', 'GarageQual': 'TA' }

df= df.fillna(repl_val)

df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF'] 

df['TotalBath'] = .5 *(df['HalfBath'] + df['BsmtHalfBath']) + df['FullBath'] + df['BsmtFullBath']

In [216]:
cols_ohe = ['Neighborhood', 'HouseStyle', 'Foundation','GarageType'] 
cols = ['ExterQual', 'BsmtQual', 'BsmtExposure', 'KitchenQual', 'GarageFinish', 'OverallQual', 'YearBuilt', 'MasVnrArea',
        'GrLivArea', 'TotRmsAbvGrd', 'Fireplaces', 'GarageArea', 'TotalSF', 'TotalBath']

replace= {'ExterQual':{'Ex':7,'Fa':2,'Gd':5,'TA':3,'Po':1},'BsmtQual':{'Ex':6,'Fa':2,'Gd':4,'TA':3,'Po':1,'NoBas':2},
          'BsmtExposure':{'No':3,'Gd':5,'Mn':4,'Av':4,'NoBas':2},'KitchenQual':{'TA':3,'Gd':4,'Ex':6,'Fa':2,'Po':1},
           'FireplaceQu':{'TA':2,'Gd':2.2,'Ex':3.3,'Fa':1.6,'Po':1.2,'NoFp':1.4},'GarageFinish': {'RFn':4,'Unf':3,'Fin':5,'NoGar':2}}

df_cleaned = df[cols].replace(replace)

df_cleaned['YearBuilt']=2020-df_cleaned['YearBuilt']


df_cleaned_dummy= pd.get_dummies(df[cols_ohe])
df_cleaned_dummy['HouseStyle_2.5Fin'] = 0

df_cleaned = pd.concat([df_cleaned,df_cleaned_dummy],axis=1)

In [217]:
replace_2 = {'OverallCond': {1:1, 2:1, 3:1, 4:2, 5:2, 6:2, 7:3, 8:3, 9:3, 10:3}, 
           'GarageQual': {'TA': 1, 'Gd': 2, 'Fa': 1, 'Ex': 3,'Po': 1}}
            
new_cols = ['OverallCond', 'GarageQual']

df_add = df[new_cols]

df_add = df_add.fillna({'OverallCond' : 5, 'GarageQual': 'NoGar'})

df_add['CommerZone']=0
df_add.loc[df['MSZoning'] == 'C','CommerZone']=1


for i  in (replace_2.keys()):
    df_add[i] = df_add[i].map(replace_2[i])

 
df_testing = pd.concat([df_cleaned, df_add], axis=1)

for j in ['MasVnrArea','GrLivArea','GarageArea','TotalSF']:
    df_testing.loc[:,j] = np.log1p(df_testing[j])

In [218]:
df_testing = df_testing[df_X.columns]
df_testing['ExterQual']=df_testing['ExterQual'].astype('int32')
df_testing['KitchenQual']=df_testing['KitchenQual'].astype('int32')

Finally, we will now fit the data to our entire labelled dataset. Then we use use it to predict the unlabeled test data. 

In [219]:
rf.fit(df_X, df_y) 
ridge.fit(df_X,df_y)
xgb.fit(df_X,df_y)
linear.fit(df_X,df_y)
lasso.fit(df_X,df_y)
lgb.fit(df_X,df_y)

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
       importance_type='split', learning_rate=0.06, max_depth=-1,
       min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
       n_estimators=550, n_jobs=-1, num_leaves=3, objective=None,
       random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
       subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [220]:
test_list = [train_pred_rf, train_pred_ridge, train_pred_linear, train_pred_xgb, train_pred_lgb, train_pred_lasso]
percentage_list = [.1, .1, .3, .3, .1, .1]

test_pred = (.1 * rf.predict(df_testing) + .1 * ridge.predict(df_testing) + .3 * linear.predict(df_testing) +
             .3 * xgb.predict(df_testing) + .1 * lgb.predict(df_testing) + .1 * lasso.predict(df_testing))

In [221]:
testing_y = np.expm1(test_pred)

In [222]:
output = pd.DataFrame(data =
                     {'Id' : df_testing.index.values, 'SalePrice': testing_y})

In [223]:
output.to_csv('output.csv',index=False)

Now all that is left to do is upload the results to the kaggle competition. 

Fin