## Initial Modelling

In [1]:
import pandas as pd
import numpy as np

from xgboost import XGBRegressor

from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [2]:
X_train = pd.read_csv("../data/processed/X_train.csv")
X_test = pd.read_csv("../data/processed/X_test.csv")
y_train = np.array(pd.read_csv("../data/processed/y_train.csv")).reshape(-1)
y_test = np.array(pd.read_csv("../data/processed/y_test.csv")).reshape(-1)

ridge = Ridge().fit(X_train, y_train)
supp_vec = SVR().fit(X_train, y_train)
rf = RandomForestRegressor().fit(X_train, y_train)
grad = XGBRegressor().fit(X_train, y_train)

## Model Evaluation

Re-exponentiate the target so we're in actual units of dollars

In [3]:
y_test = np.exp(y_test)

In [4]:
ridge_pred = np.exp(ridge.predict(X_test))
supp_vec_pred = np.exp(supp_vec.predict(X_test))
rf_pred = np.exp(rf.predict(X_test))
grad_pred = np.exp(grad.predict(X_test))

In [5]:
metrics = {}

predictions = {'Ridge Regression': ridge_pred,
               'SVR': supp_vec_pred,
               'Random Forest': rf_pred,
               'Gradient Boosting': grad_pred}

n = len(y_test)

# Compute metrics
for model_name, y_pred in predictions.items():
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    p = X_test.shape[1]
    adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
    
    metrics[model_name] = {'MSE': mse,
                           'RMSE': rmse,
                           'MAE': mae,
                           'R-squared': r2,
                           'Adjusted R-squared': adj_r2}




In [6]:
metrics_df = pd.DataFrame(metrics).map(lambda x: "{:.2e}".format(x))
metrics_df


Unnamed: 0,Ridge Regression,SVR,Random Forest,Gradient Boosting
MSE,216000000000.0,7090000000.0,11100000000.0,860000000.0
RMSE,465000.0,84200.0,105000.0,29300.0
MAE,125000.0,43400.0,13500.0,8710.0
R-squared,0.476,0.983,0.973,0.998
Adjusted R-squared,0.442,0.982,0.971,0.998


XGBoost wins on every metric and it isn't close

In terms of evaluating criteria:

RMSE, MSE, $R^2$ and Adj $R^2$ are all linked to the squared error. RMSE has the benefit of being interpretable in terms of actual units, and $R^2$ gives a good relative measure of success.

MAE is linked to the observed error, not the model's loss function (squared error).

Overall, the strongest selectors for model fit are RMSE, MSE, $R^2$ and Adj $R^2$ - these are all linked to the actual squared error and therefore give the best indication of model fit.

MAE is suitable as a reporting metric to stakeholders, but isn't suitable for model selection because it is only indirectly linked to goodness of fit.



## Feature Selection

Future goal to explore methods such as RFECV or Forward/Backward selection to reduce the model's dimensionality