# Predict Home Price - Fine-Tune Base Model 

### *Introduction*  
The goal of this project is to makes predictions about the future sale prices of homes. The prediction results are evaluated on **Mean Absolute Error** between the predicted log error and the actual log error. The logerror (target variable) is defined as ***logerror=log(Zestimate)−log(SalePrice)*** and is recorded in the training data. 

In the previous notebook, the data which contains the list of real estate properties in three counties (Los Angeles, Orange and Ventura, California) data in 2016 was analyzed and cleaned. It was found that the dataset contains both numerical and categorical features, the correlation coefficients between the features and target variable are relatively small, and also a large portion of the data are missing. Base models were built with the cleaned dataset and were compared. Ridge model, RandomForestRegressor, and GradientBoostingRegressor show similar cross validation scores. 

### *About This Notebook*
This notebook focuses on **Base model fine-tuning**, i.e., use GridSearchCV to fine tune the model hyperparameters.  
  
Cross validation scores of optimized models:
- Ridge: mean -0.053128 (std 0.000603)
- KNN: mean -0.053374 (std 0.000624)
- RF: mean -0.053026 (std 0.000639)
- GB: mean -0.052970 (std 0.000647)

***Next Step:***  
The next step is to stack these fine-tuned base models and to evaluate the performance of the stacked models.

## 0 | Package and Configuration

In [14]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
sns.set_style('white')
import matplotlib.pyplot as plt
%matplotlib inline

import pickle
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelBinarizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

In [2]:
# initiate random seed
SEED = 8

In [3]:
# hide warnings
import warnings
warnings.filterwarnings('ignore')

## 1 | Get Data

In [4]:
DATA_PATH = '../data'

def load_data(path, file_name):
    """load csv data and return dataframe"""
    csv_path = os.path.join(path, file_name)
    return pd.read_csv(csv_path)

In [5]:
# load train_merge.csv - created in Notebook1
train_merge = load_data(DATA_PATH, file_name='train_merge.csv')
# drop `transactiondate` and `parcelid` in train
train_merge.drop(['transactiondate', 'parcelid'], axis=1, inplace=True)

In [6]:
# load prop_downsized.csv - created in Notebook1
prop = load_data(DATA_PATH, file_name='prop_downsized.csv')
# drop `parcelid` in prop
prop.drop(['parcelid'], axis=1, inplace=True)

In [7]:
# set aside a test set
train_set, test_set = train_test_split(train_merge, test_size=0.2, random_state=SEED)
print('Training set size: {}\nTest set size: {}'.format(train_set.shape, test_set.shape))

Training set size: (72220, 58)
Test set size: (18055, 58)


## 2 | Model Fine-Tune

### 2.1 Transformer Class and Pipelines

In [8]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    """select desired features and drop the rest"""
    def __init__(self, features):
        self.features = features
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X[self.features]

In [9]:
class FeatureAdder(BaseEstimator, TransformerMixin):
    """add new features including average size of rooms, ratio between living area and lot size,
    ratio between property tax and total tax, and ratio between structure value and land value"""
    def __init__(self, add_new_feature=True):
        self.add_new_feature = add_new_feature
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # define new features
        N_AvgSize = X['calculatedfinishedsquarefeet']/(X['bedroomcnt'] + X['bathroomcnt'] + 1)
        N_PropLot = X['calculatedfinishedsquarefeet']/X['lotsizesquarefeet']
        N_ValueRatio = X['taxamount']/X['taxvaluedollarcnt']
        N_StructLand = X['structuretaxvaluedollarcnt']/X['landtaxvaluedollarcnt']
        # add new features if True
        if self.add_new_feature:
            return np.c_[X, N_AvgSize, N_PropLot, N_ValueRatio, N_StructLand]
        else:
            return X

In [10]:
class FeatureDropper(BaseEstimator, TransformerMixin):
    """drop features with percentage of missing missing values larger than missing_pct"""
    def __init__(self, missing_pct=1, drop_cols=[]):
        self.missing_pct = missing_pct # missing value percentage threshold
        self.drop_cols = drop_cols # initialize columns to drop
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        for col in X.columns:
            if pd.isnull(X[col]).sum()/X.shape[0] >= self.missing_pct:
                self.drop_cols.append(col)
        X = X.drop(self.drop_cols, axis=1)
        return X

In [11]:
class CatTransformer(BaseEstimator, TransformerMixin):
    """categorival feature transformer: impute categorical value and encode categories"""
    def __init__(self, cat_dict):
        self.cat_dict = cat_dict
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        for col in X.columns:
            if X[col].dtype == 'O':
                X[col].fillna('-99', inplace=True)
            X[col].fillna(-99, inplace=True)
            X[col] = X[col].astype('category', categories=self.cat_dict[col])
        return X

In [12]:
class DummyEncoder(TransformerMixin):
    """create dummy variables"""
    def __init__(self, columns=None):
        self.columns = columns
        
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return pd.get_dummies(X, columns=self.columns)

In [13]:
# numerical features
num_features = ['basementsqft', 'bathroomcnt', 'bedroomcnt', 'calculatedbathnbr', \
             'threequarterbathnbr', 'finishedfloor1squarefeet', 'calculatedfinishedsquarefeet',\
             'finishedsquarefeet6', 'finishedsquarefeet12', 'finishedsquarefeet13', \
             'finishedsquarefeet15', 'finishedsquarefeet50', 'fireplacecnt', 'fullbathcnt', \
             'garagecarcnt', 'garagetotalsqft', 'latitude', 'longitude', 'lotsizesquarefeet', \
             'numberofstories', 'poolcnt', 'poolsizesum', 'roomcnt', \
             'unitcnt', 'yardbuildingsqft17', 'yardbuildingsqft26', 'yearbuilt', \
             'taxvaluedollarcnt', 'structuretaxvaluedollarcnt', 'landtaxvaluedollarcnt', \
             'taxamount', 'assessmentyear']

# categorical features
cat_features = ['airconditioningtypeid', 'decktypeid', 'architecturalstyletypeid', \
               'buildingclasstypeid', 'heatingorsystemtypeid', 'fips', 'fireplaceflag', \
               'hashottuborspa', 'pooltypeid10', 'pooltypeid2', 'propertylandusetypeid', \
               'propertyzoningdesc', 'regionidcounty', 'taxdelinquencyflag', 'propertycountylandusecode', \
                'rawcensustractandblock', 'censustractandblock', 'regionidcity', 'regionidzip', \
                'regionidneighborhood', 'storytypeid', 'pooltypeid7', 'typeconstructiontypeid', 'taxdelinquencyyear']

# potential features to drop (categorical variables with large number of levels) 
drop_features = ['propertycountylandusecode', 'rawcensustractandblock', 'censustractandblock', \
                 'regionidcity', 'regionidzip', 'regionidneighborhood', 'propertyzoningdesc']

In [18]:
# levels of categorical variables
categories = {}
for col in cat_features:
    if prop[col].dtype == 'O':
        prop[col].fillna('-99', inplace=True)
    prop[col].fillna(-99, inplace=True)
    categories[col] = prop[col].astype('category').cat.categories

In [19]:
# save the categorical variables for future use
with open('../data/categories.pickle', 'wb') as handle:
    pickle.dump(categories, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [15]:
with open('../data/categories.pickle', 'rb') as handle:
    categories = pickle.load(handle)

In [16]:
# pipeline for numerical features
num_pipe = Pipeline([
        ('selector', DataFrameSelector(num_features)),
        ('feature_dropper', FeatureDropper(missing_pct=0.95)),
        ('feature_adder', FeatureAdder(add_new_feature=True)),
        ('imputer', Imputer(strategy='median')),
        ('scaler', StandardScaler())
    ])

In [17]:
# pipeline for categorical features
cat_pipe = Pipeline([
        ('selector', DataFrameSelector(cat_features)), 
        ('feature_dropper', FeatureDropper(missing_pct=0.95, drop_cols=drop_features)),
        ('cat_transform', CatTransformer(categories)),
        ('get_dummy', DummyEncoder())
    ])

In [18]:
# full pipeline combining pipelines for numerical and categorical features
full_pipe = FeatureUnion([
        ('num_pipeline', num_pipe),
        ('cat_pipeline', cat_pipe)
    ])

### 2.2 Grid Search

In [19]:
train_wo_outlier = train_set[(train_set.logerror > -0.4) & (train_set.logerror < 0.42)]
train_w_outlier = train_set.copy()

In [20]:
labels = train_wo_outlier['logerror'].values
features = train_wo_outlier.drop(['logerror'], axis=1)

In [21]:
# transform features with full_pipe
features_transformed = full_pipe.fit_transform(features)

In [27]:
# naive predictor - median predictor
class NaivePredictor(BaseEstimator):
    def fit(self, X, y):
        pass
    
    def predict(self, X):
        return np.full((len(X), 1), 0.107)

#### Ridge

In [45]:
# grid search for ridge
ridge = Ridge(random_state=SEED)
param_ridge = {'alpha': [0, 0.1, 1, 5, 10, 100, 1000, 10000, 100000]}

grid_ridge = GridSearchCV(ridge, param_grid=param_ridge, cv=5,
                         scoring='neg_mean_absolute_error')
grid_ridge.fit(features_transformed, labels)

GridSearchCV(cv=5, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=8, solver='auto', tol=0.001),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': [0, 0.1, 1, 5, 10, 100, 1000, 10000, 100000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_absolute_error', verbose=0)

In [47]:
cv_ridge = grid_ridge.cv_results_
for mean_score, params in zip(cv_ridge['mean_test_score'], 
                              cv_ridge['params']):
    print(-mean_score, params)

73184877.9651 {'alpha': 0}
0.0532292958466 {'alpha': 0.1}
0.0532266417536 {'alpha': 1}
0.0532236846688 {'alpha': 5}
0.0532228450081 {'alpha': 10}
0.0532063875374 {'alpha': 100}
0.0531570333955 {'alpha': 1000}
0.0531468038533 {'alpha': 10000}
0.0531980478795 {'alpha': 100000}


In [48]:
# fine tune
param_ridge2 = {'alpha': [8000, 9000, 10000, 11000, 12000, 13000]}

grid_ridge2 = GridSearchCV(ridge, param_grid=param_ridge2, cv=5,
                         scoring='neg_mean_absolute_error')
grid_ridge2.fit(features_transformed, labels)

cv_ridge2 = grid_ridge2.cv_results_
for mean_score, params in zip(cv_ridge2['mean_test_score'], 
                              cv_ridge2['params']):
    print(-mean_score, params)

0.0531427412412 {'alpha': 8000}
0.0531448051647 {'alpha': 9000}
0.0531468038533 {'alpha': 10000}
0.0531487035324 {'alpha': 11000}
0.053150534403 {'alpha': 12000}
0.0531523167158 {'alpha': 13000}


In [50]:
# fine tune
param_ridge3 = {'alpha': [2000, 3000, 4000, 5000, 6000, 7000, 8000]}

grid_ridge3 = GridSearchCV(ridge, param_grid=param_ridge3, cv=5,
                         scoring='neg_mean_absolute_error')
grid_ridge3.fit(features_transformed, labels)

cv_ridge3 = grid_ridge3.cv_results_
for mean_score, params in zip(cv_ridge3['mean_test_score'], 
                              cv_ridge3['params']):
    print(-mean_score, params)

0.0531423855228 {'alpha': 2000}
0.0531371518153 {'alpha': 3000}
0.0531364293486 {'alpha': 4000}
0.0531372551883 {'alpha': 5000}
0.0531386986655 {'alpha': 6000}
0.05314061851 {'alpha': 7000}
0.0531427412412 {'alpha': 8000}


In [51]:
# best alpha: 4000
print(grid_ridge3.best_estimator_)

Ridge(alpha=4000, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=8, solver='auto', tol=0.001)


#### KNN

In [33]:
# grid search for KNN
knn = KNeighborsRegressor(n_neighbors=5)
param_knn = {'n_neighbors': [5, 10, 20], 'p': [1, 2]}

grid_knn = GridSearchCV(knn, param_grid=param_knn, cv=3, n_jobs=2,
                         scoring='neg_mean_absolute_error')
grid_knn.fit(features_transformed, labels)

cv_knn = grid_knn.cv_results_
for mean_score, params in zip(cv_knn['mean_test_score'], 
                              cv_knn['params']):
    print(-mean_score, params)

0.0604260722443 {'n_neighbors': 5, 'p': 1}
0.0604371977931 {'n_neighbors': 5, 'p': 2}
0.0568401130065 {'n_neighbors': 10, 'p': 1}
0.0569258934352 {'n_neighbors': 10, 'p': 2}
0.0549210901831 {'n_neighbors': 20, 'p': 1}
0.0549880686881 {'n_neighbors': 20, 'p': 2}


In [36]:
# fine tune
knn = KNeighborsRegressor(n_neighbors=20, p=1)
param_knn2 = {'n_neighbors': [20, 30, 40, 50]}

grid_knn2 = GridSearchCV(knn, param_grid=param_knn2, cv=3, n_jobs=2,
                         scoring='neg_mean_absolute_error')
grid_knn2.fit(features_transformed, labels)

cv_knn2 = grid_knn2.cv_results_
for mean_score, params in zip(cv_knn2['mean_test_score'], 
                              cv_knn2['params']):
    print(-mean_score, params)

0.0549210901831 {'n_neighbors': 20}
0.0541970097562 {'n_neighbors': 30}
0.0538585365139 {'n_neighbors': 40}
0.05366127579 {'n_neighbors': 50}


In [37]:
# fine tune
knn = KNeighborsRegressor(n_neighbors=50, p=1)
param_knn3 = {'n_neighbors': [50, 60, 70, 80]}

grid_knn3 = GridSearchCV(knn, param_grid=param_knn3, cv=3, n_jobs=2,
                         scoring='neg_mean_absolute_error')
grid_knn3.fit(features_transformed, labels)

cv_knn3 = grid_knn3.cv_results_
for mean_score, params in zip(cv_knn3['mean_test_score'], 
                              cv_knn3['params']):
    print(-mean_score, params)

0.05366127579 {'n_neighbors': 50}
0.0535133545367 {'n_neighbors': 60}
0.0534084607925 {'n_neighbors': 70}
0.0533529750979 {'n_neighbors': 80}


In [38]:
# model
print(grid_knn3.best_estimator_)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=80, p=1,
          weights='uniform')


#### RandomForest

In [25]:
# grid search for RF
rf = RandomForestRegressor(random_state=SEED)
param_rf = {'n_estimators': [10, 50, 100], 'max_depth': [3, 5, 10]}

grid_rf = GridSearchCV(rf, param_grid=param_rf, cv=5,
                         scoring='neg_mean_absolute_error')
grid_rf.fit(features_transformed, labels)

cv_rf = grid_rf.cv_results_
for mean_score, params in zip(cv_rf['mean_test_score'], 
                              cv_rf['params']):
    print(-mean_score, params)

0.0531147282249 {'max_depth': 3, 'n_estimators': 10}
0.0530987405142 {'max_depth': 3, 'n_estimators': 50}
0.053093382021 {'max_depth': 3, 'n_estimators': 100}
0.0530525955749 {'max_depth': 5, 'n_estimators': 10}
0.0530105185664 {'max_depth': 5, 'n_estimators': 50}
0.0530089600657 {'max_depth': 5, 'n_estimators': 100}
0.0532443393581 {'max_depth': 10, 'n_estimators': 10}
0.0530095235023 {'max_depth': 10, 'n_estimators': 50}
0.0529812476742 {'max_depth': 10, 'n_estimators': 100}


In [26]:
# fine tune
rf = RandomForestRegressor(random_state=SEED, max_depth=5)
param_rf2 = {'n_estimators': [100, 150, 200]}

grid_rf2 = GridSearchCV(rf, param_grid=param_rf2, cv=5,
                         scoring='neg_mean_absolute_error')
grid_rf2.fit(features_transformed, labels)

cv_rf2 = grid_rf2.cv_results_
for mean_score, params in zip(cv_rf2['mean_test_score'], 
                              cv_rf2['params']):
    print(-mean_score, params)

0.0530089600657 {'n_estimators': 100}
0.0530076035873 {'n_estimators': 150}
0.0530074511852 {'n_estimators': 200}


In [27]:
# fine tune
rf = RandomForestRegressor(random_state=SEED, max_depth=5)
param_rf3 = {'n_estimators': [200, 250, 300, 350]}

grid_rf3 = GridSearchCV(rf, param_grid=param_rf3, cv=5, n_jobs=2,
                         scoring='neg_mean_absolute_error')
grid_rf3.fit(features_transformed, labels)

cv_rf3 = grid_rf3.cv_results_
for mean_score, params in zip(cv_rf3['mean_test_score'], 
                              cv_rf3['params']):
    print(-mean_score, params)

0.0530074511852 {'n_estimators': 200}
0.0530045805246 {'n_estimators': 250}
0.05300359793 {'n_estimators': 300}
0.0530031392825 {'n_estimators': 350}


In [31]:
# fine tune
rf = RandomForestRegressor(random_state=SEED, max_depth=5)
param_rf4 = {'n_estimators': [350, 400, 500]}

grid_rf4 = GridSearchCV(rf, param_grid=param_rf4, cv=5, n_jobs=2,
                         scoring='neg_mean_absolute_error')
grid_rf4.fit(features_transformed, labels)

cv_rf4 = grid_rf4.cv_results_
for mean_score, params in zip(cv_rf4['mean_test_score'], 
                              cv_rf4['params']):
    print(-mean_score, params)

0.0530031392825 {'n_estimators': 350}
0.0530033507674 {'n_estimators': 400}
0.0530024647723 {'n_estimators': 500}


In [32]:
# model
print(grid_rf4.best_estimator_)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=5,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=500, n_jobs=1, oob_score=False, random_state=8,
           verbose=0, warm_start=False)


#### GradientBoosting

In [39]:
# grid search for GB
gb = GradientBoostingRegressor(random_state=SEED)
param_gb = {'n_estimators': [100, 150, 200], 'max_depth': [3, 5, 10]}

grid_gb = GridSearchCV(gb, param_grid=param_gb, cv=5, n_jobs=2,
                         scoring='neg_mean_absolute_error')
grid_gb.fit(features_transformed, labels)

cv_gb = grid_gb.cv_results_
for mean_score, params in zip(cv_gb['mean_test_score'], 
                              cv_gb['params']):
    print(-mean_score, params)

0.0529321639709 {'max_depth': 3, 'n_estimators': 100}
0.0529623922466 {'max_depth': 3, 'n_estimators': 150}
0.0529861676242 {'max_depth': 3, 'n_estimators': 200}
0.0530355975463 {'max_depth': 5, 'n_estimators': 100}
0.0531170844951 {'max_depth': 5, 'n_estimators': 150}
0.0532187494185 {'max_depth': 5, 'n_estimators': 200}
0.0538949579796 {'max_depth': 10, 'n_estimators': 100}
0.0542690280416 {'max_depth': 10, 'n_estimators': 150}
0.0546415439229 {'max_depth': 10, 'n_estimators': 200}


In [40]:
# fine tune
gb = GradientBoostingRegressor(random_state=SEED, max_depth=3)
param_gb2 = {'n_estimators': [70, 80, 90, 100]}

grid_gb2 = GridSearchCV(gb, param_grid=param_gb2, cv=5, n_jobs=2,
                         scoring='neg_mean_absolute_error')
grid_gb2.fit(features_transformed, labels)

cv_gb2 = grid_gb2.cv_results_
for mean_score, params in zip(cv_gb2['mean_test_score'], 
                              cv_gb2['params']):
    print(-mean_score, params)

0.0529322186146 {'n_estimators': 70}
0.0529304416679 {'n_estimators': 80}
0.0529312332548 {'n_estimators': 90}
0.0529321639709 {'n_estimators': 100}


In [41]:
# model
print(grid_gb2.best_estimator_)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=80, presort='auto',
             random_state=8, subsample=1.0, verbose=0, warm_start=False)


#### CV - optimized model

In [43]:
# models to test
models = []
models.append(('Ridge', Ridge(random_state=SEED, alpha=4000)))
models.append(('KNN', KNeighborsRegressor(n_neighbors=80, p=1)))
models.append(('RF', RandomForestRegressor(random_state=SEED, max_depth=5, n_estimators=500)))
models.append(('GB', GradientBoostingRegressor(random_state=SEED, max_depth=3, n_estimators=80)))

In [44]:
# evaluate each model
results = []
names = []
kfold = KFold(n_splits=5, shuffle=True, random_state=SEED)

for name, model in models:
    cv_results = cross_val_score(model, features_transformed, labels, cv=kfold, 
                                 scoring='neg_mean_absolute_error')
    results.append(cv_results)
    names.append(name)
    print("%s: score mean %f (score std %f)" % (name, cv_results.mean(), cv_results.std()))

Ridge: score mean -0.053128 (score std 0.000603)
KNN: score mean -0.053374 (score std 0.000624)
RF: score mean -0.053026 (score std 0.000639)
GB: score mean -0.052970 (score std 0.000647)


**THE END** 