This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [1]:
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
import numpy as np


In [2]:
X_train = pd.read_csv("../data/X_train.csv")
X_test = pd.read_csv("../data/X_test.csv")
y_train = np.array(pd.read_csv("../data/y_train.csv")).reshape(-1)
y_test = np.array(pd.read_csv("../data/y_test.csv")).reshape(-1)

ridge = Ridge().fit(X_train, y_train)
supp_vec = SVR().fit(X_train, y_train)
rf = RandomForestRegressor().fit(X_train, y_train)
grad = GradientBoostingRegressor().fit(X_train, y_train)


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [3]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [4]:
y_test = np.exp(y_test)

In [5]:
ridge_pred = np.exp(ridge.predict(X_test))
supp_vec_pred = np.exp(supp_vec.predict(X_test))
rf_pred = np.exp(rf.predict(X_test))
grad_pred = np.exp(grad.predict(X_test))

In [6]:
metrics = {}

predictions = {'Ridge Regression': ridge_pred,
               'SVR': supp_vec_pred,
               'Random Forest': rf_pred,
               'Gradient Boosting': grad_pred}

n = len(y_test)

for model_name, y_pred in predictions.items():
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    p = X_test.shape[1]
    adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
    
    metrics[model_name] = {'MSE': mse,
                           'RMSE': rmse,
                           'MAE': mae,
                           'R-squared': r2,
                           'Adjusted R-squared': adj_r2}




In [9]:
metrics_df = pd.DataFrame(metrics).applymap(lambda x: "{:.2e}".format(x))
metrics_df


Unnamed: 0,Ridge Regression,SVR,Random Forest,Gradient Boosting
MSE,229000000000.0,104000000000.0,11100000000.0,36900000000.0
RMSE,479000.0,322000.0,105000.0,192000.0
MAE,128000.0,63800.0,14000.0,76600.0
R-squared,0.445,0.748,0.973,0.911
Adjusted R-squared,0.414,0.734,0.972,0.905


In [12]:
# Random forest wins on every metric and it isn't close
# Mean absolute error may be the best single metric here because it represents an actual cash error
# While being less skewed by outliers like RMSE