# Model Prediction

### Stacked Model: XGB + LGBM => Elastic Net Meta Regressor

The best performing model in terms of fit was a stacked ensemble, bearing an R2 score of approximately 0.9125. It is important to restate the results of the learning curves explored for each model -- though the R2 scores are quite high, there model tuning did not affect the overall convergence of each respective algorithm (XGB, LGBM), indicating that more data is needed to both improve the fit, and allow the validation and training errors to converge to a better value.

This section entails benchmarking of the stacked ensemble with regards to prediction time, and final analysis of error metrics for the testing phase.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.linear_model import (LinearRegression, Ridge, Lasso, RandomizedLasso, ElasticNet)
from sklearn.feature_selection import RFE
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostRegressor
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
import warnings
import os
import timeit
import time
import datetime
warnings.filterwarnings('ignore')
SEED=42
CV_FOLDS=10

kc_data = pd.read_csv('./data/ci_i5_ap.csv')

  from numpy.core.umath_tests import inner1d


In [2]:
kc_data = kc_data.drop(['id', 'date'], axis=1)
kc_data.head()
# kc_data['price'] = np.log(kc_data['price'])
kc_train, kc_test, train_Y, test_Y = train_test_split(
    kc_data, kc_data['price'], 
    test_size=0, 
    random_state=SEED
)
kc_train = kc_train.drop('price', axis=1)
kc_test = kc_test.drop('price', axis=1)
scaler = StandardScaler()
kc_train = scaler.fit_transform(kc_train)

In [3]:
from mlxtend.regressor import StackingRegressor
import xgboost as xgb
from sklearn.tree import DecisionTreeRegressor
from sklearn import svm

In [4]:
gbm = xgb.XGBRegressor(
    min_child_weight=9, 
    max_depth=1, 
    objective='reg:linear',
    learning_rate=0.01, 
    seed=SEED, 
    n_estimators=2750, 
    verbose=True
)

svr = svm.SVR(degree=1, C=0.03)

dtr = DecisionTreeRegressor(
    max_depth=12, 
    min_samples_leaf=11,
    min_samples_split=2,
    random_state=SEED
)

In [5]:
from mlxtend.regressor import StackingRegressor
lin_reg = LinearRegression()
meta_reg_elastic = ElasticNet(
#     alpha=0.0095, 
#     l1_ratio=0, 
#     tol=0.00001, 
    random_state=SEED
)
stacked_reg = StackingRegressor(regressors=[dtr, svr, gbm], meta_regressor = meta_reg_elastic)

In [6]:
#Construct Dataframes to save results
gbm_df = pd.DataFrame(columns=[['avg_pred', 'single_pred', 'rmse', 'mae', 'logerror']])
lgbm_df = pd.DataFrame(columns=[['avg_pred', 'single_pred', 'rmse', 'mae', 'logerror']])

### Model Fitting and Prediction

In [7]:

cv = cross_val_score(stacked_reg, kc_train, train_Y, cv=10, scoring='neg_median_absolute_error')
print(cv.mean())

-0.34797528024686236


In [None]:
stacked_reg.fit(kc_train, train_Y)

In [None]:
i = 0
ensemble_time = 0

for i in range(num_runs):
    start = time.time()
    stacked_pred = stacked_reg.predict(kc_test)
    end = time.time()
    ensemble_time += end-start


In [None]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import math

unlog_Y = np.exp(test_Y)
unlog_pred = np.exp(stacked_pred)
log_err = np.sum(np.log(unlog_Y) - np.log(unlog_pred))

ensemble_dict ={
    'name': 'ensemble',
    'avg_pred':(ensemble_time/num_runs * 100000), 
    'single_pred':(ensemble_time/num_runs) / len(test_Y) * 100000, 
    'rmse': math.sqrt(mean_squared_error(unlog_Y, unlog_pred)),
    'mae': mean_absolute_error(unlog_Y, unlog_pred),
    'log_error': log_err
}
ensemble_dict

In [None]:
print("GBM Avg Predict Time(μs): ", ensemble_dict['avg_pred'])
print("GBM Single Predict Time (μs): ", ensemble_dict['single_pred'])
print("Ensemble RMSE: ", ensemble_dict['rmse'])
print("Ensemble MAE: ", ensemble_dict['mae'])
print("Ensemble Log Lerror: ", ensemble_dict['log_error'])

In [None]:
home = os.path.expanduser("~")
exec = os.path.join(home, "LightGBM/lightgbm")

lgbm_alone = GBMRegressor(exec_path=exec, 
                    boosting_type='gbdt',
                    feature_fraction_seed=SEED,
                    bagging_seed=SEED,
                    tree_learner='serial',
                    metric='r2', #l2
                    verbose=False,
                    
                    num_leaves=35,
                    num_iterations=3800,#350 faster1
                    learning_rate=0.01,#0.1 faster1
                    max_bin=500, #255 faster1

                    min_data_in_leaf=5,
                    feature_fraction=1,

                    bagging_fraction=1,
                    bagging_freq=10,

                    metric_freq=1,
                    early_stopping_round=19
                   )

In [None]:
lgbm_alone.fit(kc_train, train_Y)
i = 0
lgbm_time = 0
num_runs = 100
for i in range(num_runs):
    start = time.time()
    lgbm_pred = lgbm_alone.predict(kc_test)
    end = time.time()
    lgbm_time += end-start


In [None]:
unlog_lgbm_pred = np.exp(lgbm_pred)

lgbm_dict ={
    'name': 'lgbm',
    'avg_pred':(lgbm_time/num_runs * 100000), 
    'single_pred':(lgbm_time/num_runs) / len(test_Y) * 100000, 
    'rmse': math.sqrt(mean_squared_error(unlog_Y, unlog_lgbm_pred)),
    'mae': mean_absolute_error(unlog_Y, unlog_lgbm_pred),
    'log_error': np.sum(np.log(unlog_Y) - np.log(unlog_lgbm_pred))
}

print("LGBM Avg Predict Time (μs): ", lgbm_dict['avg_pred'])
print("LGBM Single Predict Time (μs): ", lgbm_dict['single_pred'])
print("LGBM RMSE: ", lgbm_dict['rmse'])
print("LGBM MAE: ", lgbm_dict['mae'])
print("LGBM Log-Error ", lgbm_dict['log_error'])

In [None]:
gbm_alone = xgb.XGBRegressor(
    min_child_weight=5, 
    max_depth=3, 
    objective='reg:linear',
    gamma=0,
    reg_alpha=0.6,
    reg_lambda=1,
    learning_rate=0.1, 
    colsample_bytree=1.0, 
    seed=SEED, 
    n_estimators=2375, 
    subsample=1,
    verbose=True
)

In [None]:
gbm_alone.fit(kc_train, train_Y)

In [None]:
i = 0
gbm_time = 0

for i in range(num_runs):
    start = time.time()
    gbm_pred = gbm_alone.predict(kc_test)
    end = time.time()
    gbm_time += end-start
    


In [None]:
unlog_gbm_pred = np.exp(lgbm_pred)
gbm_dict ={
    'name': 'GBM',
    'avg_pred':(gbm_time/num_runs * 100000), 
    'single_pred':(gbm_time/num_runs) / len(test_Y) * 100000, 
    'rmse': math.sqrt(mean_squared_error(unlog_Y, unlog_gbm_pred)),
    'mae': mean_absolute_error(unlog_Y, unlog_gbm_pred),
    'log_error': np.sum(np.log(unlog_Y) - np.log(unlog_gbm_pred))
}

print("GBM Avg Predict Time(μs): ", gbm_dict['avg_pred'])
print("GBM Single Predict Time (μs): ", gbm_dict['single_pred'])
print("GBM RMSE: ", gbm_dict['rmse'])
print("GBM MAE: ", gbm_dict['mae'])
print("GBM Log-Error: ", gbm_dict['log_error'])

# Analysis of Performance Results

In [None]:
df = pd.DataFrame(ensemble_dict, index=[0])
df = df.append(lgbm_dict, ignore_index=True)
df = df.append(gbm_dict, ignore_index=True)

In [None]:
df

In [None]:
sns.barplot(x='name', y='avg_pred', data=df)

In [None]:
sns.barplot(x='name', y='single_pred', data=df)

In [None]:
sns.barplot(x='name', y='mae', data=df)

In [None]:
sns.barplot(x='name', y='rmse', data=df)

In [None]:
sns.barplot(x='name', y='log_error', data=df)

The overall RMSE and MAE scores of each respective models are relatively comparable, with the RMSE of the ensemble being slightly more efficient (approximately 1,000 dollars less). There is a tradeoff to be considered, in which the ensemble takes the most time (approximately the duration of the LGBM + GBM prediction times), but overall performs better on error metrics, especially as it faces a significant reduction in log loss error, from 16 down to 8.5. 

GBM seems to be the most lightweight of the model, only requiring 7 microseconds per prediction. It is important to consider the run-time use-cases for each model, if one is constructing a customer-facing api that utilizes one of these models for predictions. These prediction times are only relative -- though the stacked ensemble is the slowest, it is still relatively quick in the span of predicting, requiring only 32 microseconds per prediction. Additionally, if speed remains a problem, stacked ensembling allows distributed processing by fitting and predicting on different machines for a parallel approach.

There remains an extensive amount of tuning and feature engineering available to improving the predictive capabilities of this model set, but these approaches were constrained due to cost and time. Additionally, the models need more data in order to converge further to an optimal result. Possible additional features were explored in the feature engineering section, which culminates in further leveraging the lat, long features in regards to coast distances, and overall proximity to amenities for a neighborhood rating. 