# Notebook 3: Model Training, Evaluation & Selection

# Part 1 - DEFINE
***

### Define the problem

In this notebook, using the Ames, Iowa housing dataset, first, I'll establish simple baseline model using the OLS regression, and then I'll develop a few predictive models, namely, random forest, xgboost and lightgbm regression models and compare the performance of these models against the baseline with the aim to get better predictive performance. The implementation of similar models will potentially allow housing agencies, real-estate companies, banks, municipial governments and home buyers to make informed decisions with respect to market pricing.

### Objective: 
- To build a predictive ML model with the MAE accuracy of less than 20000.



### Check versions of the Python and some key packages to ensure most recent version is used

In [194]:
%load_ext watermark
%watermark -a 'Vusal Babashov' -u -d -v -p numpy,mlxtend,matplotlib,sklearn

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Author: Vusal Babashov

Last updated: 2021-03-17

Python implementation: CPython
Python version       : 3.8.2
IPython version      : 7.21.0

numpy     : 1.19.2
mlxtend   : 0.18.0
matplotlib: 3.3.4
sklearn   : 0.24.1



In [195]:
#conda install flake8 

In [196]:
#conda update scikit-learn numpy matplotlib 

## Part 2 - DISCOVER
***

### Import Libraries

In [229]:
import pandas as pd
import math 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression, RFECV, RFE
from sklearn.pipeline import Pipeline


from sklearn.model_selection import GridSearchCV, train_test_split, KFold, cross_validate
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, explained_variance_score 

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

from feature_engine import encoding as enc #RareLabelEncoder, OrdinalEncoder
from feature_engine.selection import SelectByTargetMeanPerformance, DropConstantFeatures, DropFeatures 
from feature_engine.imputation import MeanMedianImputer
from feature_engine.creation import CombineWithReferenceFeature, MathematicalCombination

#your info here
__author__ = "Vusal Babashov"
__email__ = "vbabashov@gmail.com"
__website__ = 'https://vbabashov.github.io'

### Load the Data

In [198]:
train_file = "data/train.csv"
test_feature_file = "data/test.csv"

def load_file(file):
    '''loads csv to pd dataframe'''
    return pd.read_csv(file)

data_train_raw = load_file(train_file)
feature_pred_raw  = load_file(test_feature_file)

To understand the data better, the full [EDA](https://github.com/vbabashov/house-prices/blob/main/price_prediction_EDA.ipynb) analysis are implemented in seperate notebooks due to size and readibility. 
***

### PreProcessing Steps

In [199]:
data_train_raw.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [200]:
feature_pred_raw.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [201]:
data_train_raw.shape # check the dimensions of the dataframe

(1460, 81)

In [202]:
feature_pred_raw.shape # target variable SalePrice is to be predicted

(1459, 80)

In [203]:
def log_transform (df):
    '''This function performs the log transformation of the target'''
    df['SalePrice'] = np.log(df['SalePrice'])
    return df

In [204]:
# Log Transformation of the target
train_transformed_df = log_transform(data_train_raw)

From the EDA, we know that some of the columns have a lot of missing values. Below, I'll identify and drop the columns that have 80% (somewhat an arbitrary choice) or more of its values missing or coded as NaN.

In [205]:
def drop_missing_cols_df (df):
    '''Identifies and drops the columns with 80% or hihgher proportion of missing data '''
    dropped_cols = []  
    for col in df.columns:
        if df[col].isnull().sum()/df.shape[0] >= 0.8:
            dropped_cols.append(col)
    dropped_df=df.drop(columns=dropped_cols)
    return dropped_df, dropped_cols  

In [206]:
train_clean_df, missing_cols = drop_missing_cols_df(train_transformed_df) # determine and drop the missing columns

In [207]:
missing_cols # These four columns are dropped

['Alley', 'PoolQC', 'Fence', 'MiscFeature']

In [208]:
pred_clean_df = feature_pred_raw.drop(columns=missing_cols, axis=1) # Let's drop the same columns from the test set

[Data dictionary file](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt) specifies that columns have differnet data types. Below is the breakdown of the variables into nominal, ordinal and numerical types.

In [209]:
nominal = ['MSSubClass', 'MSZoning', 'Street', 'LandContour', 'LotConfig', 
                   'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
                   'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 
                   'Foundation', 'Heating', 'CentralAir', 'GarageType', 'MoSold',
                   'SaleType', 'SaleCondition'] # removed Alley, MiscFeature, 

ordinal = ['LotShape', 'Utilities', 'LandSlope', 'OverallQual', 'OverallCond', 
                   'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 
                   'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'Electrical', 'KitchenQual', 
                   'Functional', 'FireplaceQu', 'GarageFinish', 'GarageQual', 'GarageCond',
                   'PavedDrive'] #removed PoolQC, Fence,


numeric = ['Id','LotFrontage','LotArea','YearBuilt','YearRemodAdd','MasVnrArea','BsmtFinSF1',
                  'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea',
                  'BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr', 'TotRmsAbvGrd',
                  'Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch',
                  '3SsnPorch','ScreenPorch','PoolArea','MiscVal', 'GarageYrBlt', 'YrSold'] # removed the SalePrice

categorical = nominal+ordinal

In [210]:
def impute_missing_values (df, categorical_features, numeric_features):
    ''' Imputes the continious columns with median and categorical columns with the mode value'''
    imputer_con = SimpleImputer(missing_values=np.nan, strategy='median')
    imputer_cat = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
    for col in categorical_features+numeric_features:
        if df[col].isnull().sum() > 0:    
            if col in categorical_features:              
                df[col] = imputer_cat.fit_transform(df[col].values.reshape(-1,1))
            elif col in numeric_features:  
                df[col] = imputer_con.fit_transform(df[col].values.reshape(-1,1))
    return df  

In [211]:
train_imputed_df = impute_missing_values (train_clean_df, categorical, numeric)  #impute the categorical variables with the most frequent, or mode, and the numeric variables with the median
pred_imputed_df  = impute_missing_values (pred_clean_df, categorical, numeric) 

 Below are ordered values for each ordinal variable as per the data dictionary.

In [212]:
# Ordinal Category Values
lot_shape = ['IR3','IR2','IR1','Reg']
utilities = ['ELO', 'NoSeWa', 'NoSewr','AllPub']
land_slope = ['Sev','Mod','Gtl']
overall_qual = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  # already in the ordinal structure
overall_cond = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  # already in the ordinal structure
exter_qual = ['Po', 'Fa', 'TA', 'Gd', 'Ex']
exter_cond = ['Po', 'Fa', 'TA', 'Gd', 'Ex']
bsmt_qual  = ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex']
bsmt_cond  = ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex']
bsmt_exposure  = ['NA', 'No', 'Mn', 'Av', 'Gd']
bsmt_fin_type1 = ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ']
bsmt_fin_type2 = ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ']
heating_qual = ['Po', 'Fa', 'TA', 'Gd', 'Ex']
electrical = ['Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr']
kitchen_qual = ['Po', 'Fa', 'TA', 'Gd', 'Ex']
functional = ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ']
fire_place_qual = ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex']
garage_finish = ['NA', 'Unf', 'RFn', 'Fin']
garage_qual = ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex']
garage_cond = ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex']
paved_drive = ['N', 'P', 'Y']

ordinal_categories_list = [lot_shape , utilities, land_slope, overall_qual, overall_cond, exter_qual, exter_cond, bsmt_qual, 
                          bsmt_cond, bsmt_exposure, bsmt_fin_type1, bsmt_fin_type2, heating_qual, electrical, kitchen_qual,
                          functional, fire_place_qual, garage_finish, garage_qual, garage_cond, paved_drive]  

In [213]:
def ordinal_encoding (df, nominal_cols, ordinal_cols, ordinal_categories_list, numeric_cols):
    '''This function encodes ordinal variables into ordinal encoding and combines wit the rest of the dataframe'''
    ore = OrdinalEncoder(categories=ordinal_categories_list)
    Z=ore.fit_transform(df[ordinal_cols])
    list_of_frames=[df[nominal_cols].reset_index(drop=True), 
                    pd.DataFrame(Z,columns=ordinal_cols).reset_index(drop=True), 
                    df[numeric_cols].reset_index(drop=True)]
                   # df['SalePrice'].reset_index(drop=True)]
    return pd.concat(list_of_frames, axis=1)

In [214]:
train_enc_df  = ordinal_encoding (train_imputed_df, nominal, ordinal, ordinal_categories_list, numeric+['SalePrice']) # encode the ordinal variables as per specified orderd
pred_enc_df   = ordinal_encoding (pred_imputed_df, nominal, ordinal, ordinal_categories_list, numeric)

In [215]:
def convert_month_string (df):
    '''This function maps the numerical month names into string month names'''
    d = { 1 : 'Jan',
          2 : 'Feb',
          3 : 'Mar',
          4 : 'Apr',
          5 : 'May',
          6 : 'June',
          7 : 'July',
          8 : 'Aug',
          9 : 'Sep',
          10: 'Oct',
          11: 'Nov',
          12: 'Dec'
    }
    df['MoSold'] = df ['MoSold'].map(d)
    return df

In [216]:
train_converted_df = convert_month_string(train_enc_df)
pred_converted_df  = convert_month_string(pred_enc_df)

In [217]:
def convert_types (df):
    '''This function coverts the categorical variables into object and numeric variables into int types'''
    df[nominal] = df[nominal].astype('O')
    df[ordinal] = df[ordinal].astype('int')
    df[numeric] = df[numeric].astype('int')

In [218]:
convert_types (train_converted_df)
convert_types (pred_converted_df)

In [219]:
X_pred = pred_converted_df.drop(columns=['Id'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(train_converted_df.drop(['Id', 'SalePrice'], axis=1),train_converted_df['SalePrice'],
                                                    test_size=0.2,
                                                    random_state=0)

X_train.shape, X_test.shape, X_pred.shape

((1168, 75), (292, 75), (1459, 75))

### Establish a baseline 

I used Ordinary Least Squares (OLS) Regression model results as a [baseline](https://github.com/vbabashov/house-prices/blob/main/baseline.ipynb) and obtained the following model accuracy.

- MAE  for the Baseline Model: 24139.18
- RMSE for the Baseline Model : 149478.70

### Hypothesize solution 

Th MAE is around 24000 with the engineered features. Let's try to come up with a predictive model with better accuracy. There are many supervised learning methods that can be developed. In this notebook, I'll explore the following tree-based techniques because of their recent successful applications in many domains.

- Random Forest
- Xgboost 
- LightGBM

## Part 3 - DEVELOP

In this part of the process, I'll look into creating features, tuning models, and training/validating models
- model selection (i.e, hyperparameter tuning)
- algorithm selection
- model evaluation with the selected algorithm

### Feature Engineering

In [220]:
def encode_rare_label (df_train, df_test, df_pred):
    ''' This function encodes the rare nominal categories with the Rare label if number of categories is at least 4, cat is less than 5% of the total values'''
    rare_enc = enc.RareLabelEncoder(tol = 0.05, n_categories=4, variables=nominal)
    rare_enc.fit(df_train)
    df_train = rare_enc.transform(df_train)
    df_test  = rare_enc.transform(df_test)
    df_pred  = rare_enc.transform(df_pred)
    return df_train, df_test, df_pred

In [221]:
def encode_order_price(df_train, target_train, df_test, df_pred):
    '''This function does the ordinal encoding according to mean prices on nominal variables'''
    encoder = enc.OrdinalEncoder (encoding_method='ordered', variables = nominal)
    encoder.fit(df_train, target_train)
    df_train = encoder.transform(df_train)
    df_test = encoder.transform(df_test)
    return df_train, df_test, df_pred

In [222]:
def engineer_features (df_train, df_test, df_pred):  
    ''' This function  engineers several features e.g., time, bath count, and total area of the house'''  
    # Years passed since 
    combinator = CombineWithReferenceFeature(
        variables_to_combine=['YrSold'],
        reference_variables=['YearBuilt', 'YearRemodAdd', 'GarageYrBlt'],
        operations = ['sub']
    )  
    # drop the features
    drop = DropFeatures(
        features_to_drop=['YearBuilt','YrSold','YearRemodAdd', 'GarageYrBlt']
    )
    # total number of bathrooms
    bath = MathematicalCombination(
        variables_to_combine=['BsmtHalfBath', 'BsmtFullBath', 'FullBath', 'HalfBath'],
        math_operations=['sum'],
        new_variables_names=['TotalBath'],
    )
    # total area of house 1stfloor+2ndfloor+Totalbasement
    area = MathematicalCombination(
        variables_to_combine=['1stFlrSF', '2ndFlrSF', 'TotalBsmtSF'],
        math_operations=['sum'],
        new_variables_names=['TotalArea'],
    )
    
    combinator.fit(df_train)
    df_train = combinator.transform(df_train)
    df_test = combinator.transform(df_test)
    df_pred = combinator.transform(df_pred)
    
    drop.fit(df_train)
    df_train = drop.transform (df_train)
    df_test  = drop.transform (df_test)
    df_pred  = drop.transform(df_pred)
    
    df_train = bath.fit_transform (df_train)
    df_test  = bath.fit_transform (df_test)
    df_pred  = bath.fit_transform(df_pred)

    df_train = area.fit_transform (df_train)
    df_test  = area.fit_transform (df_test)
    df_pred  = area.fit_transform(df_pred)
    
    return df_train, df_test, df_pred

In [223]:
X_train, X_test, X_pred = encode_rare_label  (X_train, X_test, X_pred)
X_train, X_test, X_pred = encode_order_price (X_train, y_train, X_test, X_pred)
X_train, X_test, X_pred = engineer_features  (X_train, X_test, X_pred)

In [224]:
lgbm = LGBMRegressor(random_state=1)

In [241]:
train_pred = lgbm.fit(X_train, y_train)
test_pred  = lgbm.fit(X_test, y_test)

In [242]:
train_pred = lgbm.predict(X_train)
test_pred = lgbm.predict(X_test)

print('Train RMSE: %.2f'%  mean_squared_error(y_train,train_pred, squared=False))
print(' Test RMSE: %.2f'%   mean_squared_error(y_test, test_pred, squared=False))
print()
print('Train R2: %.2f'%  r2_score(y_train, train_pred))
print(' Test R2: %.2f'%  r2_score(y_test, test_pred))

print ('\n Train MAE: %.2f'%   mean_absolute_error(np.exp(y_train), np.exp(train_pred)))
print ('  Test MAE: %.2f'%     mean_absolute_error(np.exp(y_test), np.exp(test_pred)))

print ('\n Train RMSE: %.2f'%   mean_squared_error(np.exp(y_train), np.exp(train_pred), squared = False))
print ('  Test RMSE: %.2f'%     mean_squared_error(np.exp(y_test),  np.exp(test_pred), squared = False))

Train RMSE: 0.16
 Test RMSE: 0.05

Train R2: 0.85
 Test R2: 0.98

 Train MAE: 18771.69
  Test MAE: 6544.92

 Train RMSE: 30663.40
  Test RMSE: 18730.37


Above results suggest some underfitting - e.g., high bias, low variance

### Create Models

In [243]:
reg1 = RandomForestRegressor(random_state=1)
reg2 = XGBRegressor(random_state=1)
reg3 = LGBMRegressor(random_state=1)

In [259]:
#Build Pipelines
pipe1 = Pipeline(steps=[ #('fea', SelectKBest(score_func=f_regression, k = 65)),
                         ('pol', PolynomialFeatures()), 
                         ('reg1',reg1)])
                        
pipe2 = Pipeline(steps=[#('fea', SelectKBest(score_func=f_regression, k = 65)),
                        ('pol', PolynomialFeatures()), 
                        ('reg2',reg2)])

pipe3 = Pipeline(steps=[#('fea', SelectKBest(score_func=f_regression, k = 65)),
                        ('pol', PolynomialFeatures()), 
                        ('reg3',reg3)])


Model Accuracy Results (without parameter tuning)

In [260]:
pipe1.fit(X_train, y_train)
y_pred = pipe1.predict(X_test)
print ('\n Mean Absolute Error (MAE) for the Random Forest: %.2f'%  mean_absolute_error(np.exp(y_test), np.exp(y_pred)))
print(' Test R2: %.2f'%  r2_score(y_test, y_pred))


 Mean Absolute Error (MAE) for the Random Forest: 17555.75
 Test R2: 0.87


In [261]:
pipe2.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
print ('\n Mean Absolute Error (MAE) for the Xgboost: %.2f'%  mean_absolute_error(np.exp(y_test), np.exp(y_pred)))
print(' Test R2: %.2f'%  r2_score(y_test, y_pred))


 Mean Absolute Error (MAE) for the Xgboost: 19173.90
 Test R2: 0.86


In [262]:
pipe3.fit(X_train, y_train)
y_pred = pipe3.predict(X_test)
print ('\n Mean Absolute Error (MAE) for the LightGBM: %.2f'%  mean_absolute_error(np.exp(y_test), np.exp(y_pred)))
print(' Test R2: %.2f'%  r2_score(y_test, y_pred))


 Mean Absolute Error (MAE) for the LightGBM: 16894.79
 Test R2: 0.87


### Compare the Models, With Different Parameters, and Algorithms Using Nested Cross Validation (i.e., 5x2Cv)

In [None]:
# Building the Pipelines
#Build Pipelines
pipe1 = Pipeline(steps=[ ('fs',  SelectKBest(score_func=f_regression, k = 200)),
                         ('pol', PolynomialFeatures()), 
                         ('reg1',reg1)])
                        
pipe2 = Pipeline(steps=[('fs', SelectKBest(score_func=f_regression, k = 200)),
                        ('pol', PolynomialFeatures()), 
                        ('reg2',reg2)])

pipe3 = Pipeline(steps=[('fs', SelectKBest(score_func=f_regression, k = 200)),
                        ('pol', PolynomialFeatures()), 
                        ('reg3',reg3)])

In [None]:
# Setting up the parameter grids for hyperparameter tuning, i.e, Model Selection
param_grid1 = {'reg1__n_estimators': [500,1000],
           #    'reg1__fs' : [100,150,200]
               }

param_grid2 = {
          'reg2__colsample_bytree':[0.6, 1], 
          'reg2__eta': [0.01, 0.1],
          'reg2__max_depth': [8,10],
          'reg2__min_child_weight':[6,9], 
          'reg2__subsample' :[0.6, 0.8],
        #  'reg2__fs' : [100,150,200]
        }

param_grid3 = {

    "reg3__num_leaves": [6, 8, 20, 30],
    "reg3__max_depth": [2, 4, 6, 8, 10],
    "reg3__n_estimators": [50, 100, 200, 500],
    "reg3__colsample_bytree": [0.3, 1.0],

   }
 
param_grid4 = {'reg4__alpha':[0.001, 0.01, 0.1],
              # 'reg4__fs' : [100,150,200]
              }

In [None]:
# Setting up multiple GridSearchCV objects for model selection and algorithm comparison
gridcvs = {}

inner_cv = KFold(n_splits=2, shuffle=True, random_state=1)

for pgrid, est, name in zip((param_grid1, param_grid2, param_grid3, param_grid4),
                            (pipe1, pipe2, pipe3, pipe4),
                            ('RForest', 'Xgboost', 'LightGBM' ,'Lasso')):
      
    gcv = GridSearchCV(estimator=est,
                       param_grid=pgrid,
                       scoring = 'neg_mean_absolute_error',
                       #n_jobs=-1,
                       cv=inner_cv,
                       verbose=0,
                       refit=True)
    gridcvs[name] = gcv

In [None]:
outer_cv = KFold(n_splits=5, shuffle=True, random_state=1)

for name, gs_est in sorted(gridcvs.items()):
    scores_dict = cross_validate(gs_est, 
                                 X=feature_train, 
                                 y=target_train,
                                 verbose=0,
                                 cv=outer_cv,
                                 return_estimator=True,
                                 #n_jobs=-1
                                )

    print(50 * '-', '\n')
    print('Algorithm:', name)
    print('    Inner loop:')
    
    
    for i in range(scores_dict['test_score'].shape[0]):

        print('\n      Best MAE Score (avg. of inner test folds) %.2f' % np.absolute(scores_dict['estimator'][i].best_score_))
        print('        Best parameters:', scores_dict['estimator'][i].best_estimator_)
        print('        MAE Score (on outer test fold) %.2f' % np.absolute(scores_dict['test_score'][i]))

    print('\n%s | outer test folds Ave. Score %.2f +/- %.2f' % 
          (name, np.absolute(scores_dict['test_score']).mean(), 
           np.absolute(scores_dict['test_score']).std()))

### Perform Model Selection (i.e., Hyperparameter Tuning) Once the Algorithm Selection is Made

In [None]:
#2-fold cross validation on the best model, fit on the entire dataset
gcv_model_select = GridSearchCV(estimator=pipe3,
                                param_grid=param_grid3,
                                scoring='neg_mean_absolute_error',
                                n_jobs=-1,
                                cv = 2,
                                verbose=0,
                                refit=True)

gcv_model_select.fit(feature_train, target_train)

### Select best model and Test the Performance

In [None]:
#select the model with the lowest error as your "production" model
best_model = gcv_model_select.best_estimator_

train_ = mean_absolute_error(y_true=np.exp(target_train), y_pred=np.exp(best_model.predict(feature_train)))
test_  = mean_absolute_error(y_true=np.exp(target_test),  y_pred=np.exp(best_model.predict(feature_test)))

print('MAE score: %.2f (average over k-fold CV test folds)' %
      np.absolute(gcv_model_select.best_score_))
print('Best Parameters: %s' % gcv_model_select.best_params_)

print('\nTraining MAE for Best Model: %.2f' % (train_))
print('Test MAE for Best Model: %.2f' % (test_))

## Part 4 - DEPLOY

### Automate pipeline 

In [None]:
#write script that trains model on entire training set, saves model to disk,
#and scores the "test" dataset

### Deploy solution

In [None]:
#save your prediction to a csv file or optionally save them as a table in a SQL database
#additionally, you want to save a visualization and summary of your prediction and feature importances
#these visualizations and summaries will be extremely useful to business stakeholders

### Measure efficacy

We'll skip this step since we don't have the outcomes for the test data