## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

In [2]:
# Import the data from 

X_train = pd.read_csv('../data/preprocessed/X_train.csv')
X_test = pd.read_csv('../data/preprocessed/X_test.csv')
y_train = pd.read_csv('../data/preprocessed/y_train.csv')
y_train = y_train.values.ravel() # Change pd df to 1d numpy array
y_test = pd.read_csv('../data/preprocessed/y_test.csv')
y_test = y_test.values.ravel() # Change pd df to 1d numpy array

In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [4]:
# Initialize models
linear_reg = LinearRegression()
svm_reg = SVR()
rf_reg = RandomForestRegressor()
xgb_reg = xgb.XGBRegressor()

# Fit models
linear_reg.fit(X_train, y_train)
svm_reg.fit(X_train, y_train)
rf_reg.fit(X_train, y_train)
xgb_reg.fit(X_train, y_train)

# Predict
y_pred_linear = linear_reg.predict(X_test)
y_pred_svm = svm_reg.predict(X_test)
y_pred_rf = rf_reg.predict(X_test)
y_pred_xgb = xgb_reg.predict(X_test)

In [5]:
# import models and fit

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [6]:
# gather evaluation metrics and compare results

In [7]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [8]:
# Mean Absolute Error
mae_linear = mean_absolute_error(y_test, y_pred_linear)
mae_svm = mean_absolute_error(y_test, y_pred_svm)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)

# Mean Squared Error
mse_linear = mean_squared_error(y_test, y_pred_linear)
mse_svm = mean_squared_error(y_test, y_pred_svm)
mse_rf = mean_squared_error(y_test, y_pred_rf)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)

# Root Mean Squared Error
rmse_linear = np.sqrt(mse_linear)
rmse_svm = np.sqrt(mse_svm)
rmse_rf = np.sqrt(mse_rf)
rmse_xgb = np.sqrt(mse_xgb)

# R-squared
r2_linear = r2_score(y_test, y_pred_linear)
r2_svm = r2_score(y_test, y_pred_svm)
r2_rf = r2_score(y_test, y_pred_rf)
r2_xgb = r2_score(y_test, y_pred_xgb)

metrics = {
    'Model': ['Linear Regression', 'Support Vector Machine', 'Random Forest', 'XGBoost'],
    'MAE': [mae_linear, mae_svm, mae_rf, mae_xgb],
    'MSE': [mse_linear, mse_svm, mse_rf, mse_xgb],
    'RMSE': [rmse_linear, rmse_svm, rmse_rf, rmse_xgb],
    'R-squared': [r2_linear, r2_svm, r2_rf, r2_xgb]
}

# Create a df
results_df = pd.DataFrame(metrics)
results_df

Unnamed: 0,Model,MAE,MSE,RMSE,R-squared
0,Linear Regression,131319.80281,185262400000.0,430421.150297,0.298296
1,Support Vector Machine,215308.01048,274805300000.0,524218.750852,-0.040858
2,Random Forest,40837.795086,149877700000.0,387140.385611,0.43232
3,XGBoost,41733.465233,157308000000.0,396620.694151,0.404177


#### Summary

Mean Absolute Error (MAE) measures the average magnitude of errors in the predictions, regardless of their direction. Lower values are better. Random Forest has the lowest MAE, indicating it has the smallest average prediction error.

- Best: Random Forest (40,837.80)
Worst: Support Vector Machine (215,308.01)

Mean Squared Error (MSE) measures the average of the squares of the errors. It penalizes larger errors more than MAE. Random Forest performs best here, having the lowest MSE, indicating it has the smallest average squared error.

- Best model: Random Forest (1.498777e+11)
- Worst: Support Vector Machine (2.748053e+11)

Root Mean Squared Error (RMSE) is the square root of MSE and provides the error magnitude in the same units as the target variable. Random Forest has the lowest RMSE, suggesting it has the smallest error magnitude on average.

- Best: Random Forest (387,140.39)
- Worst: Support Vector Machine (524,218.75)

R-squared measures the proportion of variance in the dependent variable that is predictable from the independent variables. A higher R-squared indicates a better fit. Random Forest has the highest R-squared, indicating it explains the most variance in the target variable. The Support Vector Machine has a negative R-squared, suggesting it performs worse than a simple mean-based model.

- Best: Random Forest (0.432320)
- Worst: Support Vector Machine (-0.040858)

Random Forest is the best performer overall, showing the lowest MAE, MSE, and RMSE, and the highest R-squared.
XGBoost is second, with slightly higher MAE, MSE, and RMSE but still has a high R-squared.
Linear Regression performs reasonably well, but its metrics are not as strong as Random Forest or XGBoost.
Support Vector Machine performs the worst in all metrics, especially in R-squared, indicating poor model performance relative to the other methods.

Random Forest and XGBoost are generally more complex models that can capture non-linear relationships better than Linear Regression and Support Vector Machines in this case.

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [9]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)