# Model Comparison

## Introduction
In this notebook, I compare the performance of multiple machine learning models for predicting car prices. I focus on six different initial models and evaluate their performance using metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and R2 Score.

## Summary
### Key Findings
- **Random Forest:**
  - Mean Absolute Error (MAE): 2841.05
  - Mean Squared Error (MSE): 21853604.32
  - R2 Score: 0.8781
<br>
<br>
- **XGBoost:**
  - Mean Absolute Error (MAE): 3017.48
  - Mean Squared Error (MSE): 21386162.54
  - R2 Score: 0.8808


### Conclusion
Both the Random Forest and XGBoost models performed well, with Random Forest having the best overall intial performance. XGBoost handled large errors marginally better, indicated in the R2 score. I will proceed with hyperparameter tuning for both models to improve their performance further. 

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
from sklearnex import patch_sklearn # speeds up algorithms on intel processors for mac
patch_sklearn()
from sklearn.Linear_model import LinearRegression, Lasso, Ridge, LassoCV, RidgeCV
import xgboost as xgb
import seaborn as sns
from sklearn import metrics, svm, preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR, LinearSVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestRegressor
import time
from datetime import datetime

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Scaling & Partioning Data
#### 80% test / 20% train

In [2]:
cars_df = pd.read_csv('Car Data Cleaned.csv')

predictors = ['Year', 'Model', 'State', 'Mileage']

X = pd.get_dummies(cars_df[predictors], drop_first=False).values # one-hot encodes categorical columns
y = cars_df['Price'].values.reshape(-1,1)

column_names = pd.get_dummies(cars_df[predictors], drop_first=False) # fixes numpy array error

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=column_names.columns) #convert back to DataFrame to maintain column names / use x2 otherwise error 

# splitting data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=.2, random_state=1)

print('Data Processing Finished')

Data Processing Finished


# Model Comparison

## Linear Regression

#### Baseline model

In [3]:
start_time = time.time()

# Create model
Linear_model = LinearRegression()
Linear_model.fit(X_train,y_train.ravel())

# Predict the response for test dataset
pred_linear = Linear_model.predict(X_test)

# Print coefficients & performance
print('intercept ', Linear_model.intercept_)
print(pd.DataFrame({'Predictor': X_scaled.columns, 'coefficient': Linear_model.coef_}))
print('\n')
print("\033[1mLinear Regression Performance:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_test, pred_linear))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_test, pred_linear))
print("R2 Score             (R2): ", metrics.r2_score(y_test, pred_linear))

end_time = time.time()
elapsed_time = (end_time - start_time) / 60
print('\n\n')
print(f"{elapsed_time:.1f} minutes to execute.")

intercept  12175.147152581978
         Predictor   coefficient
0             Year   4371.740331
1          Mileage  -3961.047810
2       Model_370z  -8117.330227
3    Model_4runner -31240.426375
4         Model_86  -2291.306801
..             ...           ...
333       State_VT  14305.366320
334       State_WA  41700.820438
335       State_WI  49996.538895
336       State_WV  20841.242387
337       State_WY   7342.995466

[338 rows x 2 columns]


[1mLinear Regression Performance:[0m
Mean Absolute Error (MAE):  4175153.4496936365
Mean Squared Error  (MSE):  1.4512715646334742e+17
R2 Score             (R2):  -809191514.02251



0.0 minutes to execute.


## Random Forest

In [4]:
start_time = time.time()

# Create model
RF_model = RandomForestRegressor(n_jobs=-1)
RF_model.fit(X_train,y_train.ravel())

# Predict the response for test dataset
pred_RF = RF_model.predict(X_test)

# Print performance
print("\033[1mRandom Forest Regression Performance:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_test, pred_RF))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_test, pred_RF))
print("R2 Score             (R2): ", metrics.r2_score(y_test, pred_RF))

end_time = time.time()
elapsed_time = (end_time - start_time) / 60
print('\n\n')
print(f"{elapsed_time:.1f} minutes to execute.")

[1mRandom Forest Regression Performance:[0m
Mean Absolute Error (MAE):  2841.0501529556477
Mean Squared Error  (MSE):  21853604.317072425
R2 Score             (R2):  0.8781499505883282



0.8 minutes to execute.


## XGBoost

In [5]:
start_time = time.time()

# Create model
XGBoost_model = xgb.XGBRegressor()
XGBoost_model.fit(X_train, y_train.ravel())

# Predict the response for test dataset
pred_XGBoost = XGBoost_model.predict(X_test)

# print performance
print("\033[1mXGBoost Performance:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_test, pred_XGBoost))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_test, pred_XGBoost))
print("R2 Score             (R2): ", metrics.r2_score(y_test, pred_XGBoost))

end_time = time.time()
elapsed_time = (end_time - start_time) / 60
print('\n\n')
print(f"{elapsed_time:.1f} minutes to execute.")

[1mXGBoost Performance:[0m
Mean Absolute Error (MAE):  3017.4821608950706
Mean Squared Error  (MSE):  21386162.54130363
R2 Score             (R2):  0.8807562851154895



0.0 minutes to execute.


## Ridge Regression

In [6]:
start_time = time.time()

# Create model (using more accurate RidgeCV vs Ridge) 
Ridge_model = RidgeCV(alphas=np.logspace(-6, 6, 13), cv=5) #logspace finds the best alpha with ridgecv
Ridge_model.fit(X_train,y_train.ravel())

# Predict the response for test dataset
pred_Ridge = Ridge_model.predict(X_test)

# Print performance
print("\033[1mRidge Regression Performance:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_test, pred_Ridge))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_test, pred_Ridge))
print("R2 Score             (R2): ", metrics.r2_score(y_test, pred_Ridge))

end_time = time.time()
elapsed_time = (end_time - start_time) / 60
print('\n\n')
print(f"{elapsed_time:.1f} minutes to execute.")

[1mRidge Regression Performance:[0m
Mean Absolute Error (MAE):  3430.504209729433
Mean Squared Error  (MSE):  28128574.353743095
R2 Score             (R2):  0.8431623394862201



0.2 minutes to execute.


## Lasso Regression 

In [7]:
start_time = time.time()

# Create model
Lasso_model = LassoCV(cv=5, random_state=1)
Lasso_model.fit(X_train,y_train.ravel())

# Predict the response for test dataset
pred_Lasso = Lasso_model.predict(X_test)

# Print performance
print("\033[1mLasso Regression Performance:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_test, pred_Lasso))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_test, pred_Lasso))
print("R2 Score             (R2): ", metrics.r2_score(y_test, pred_Lasso))

end_time = time.time()
elapsed_time = (end_time - start_time) / 60
print('\n\n')
print(f"{elapsed_time:.1f} minutes to execute.")

[1mLasso Regression Performance:[0m
Mean Absolute Error (MAE):  3429.165967294308
Mean Squared Error  (MSE):  28135442.02252226
R2 Score             (R2):  0.8431240471401182



0.1 minutes to execute.


## Support Vector Regression

In [8]:
start_time = time.time()

# Create model
SVR_model = SVR(kernel = 'rbf')
SVR_model.fit(X_train,y_train.ravel())

# Predict the response for test dataset
pred_SVR = SVR_model.predict(X_test)

# Print performance
print("\033[1mSupport Vector Regression Performance:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_test, pred_SVR))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_test, pred_SVR))
print("R2 Score             (R2): ", metrics.r2_score(y_test, pred_SVR))


end_time = time.time()
elapsed_time = (end_time - start_time) / 60
print('\n\n')
print(f"{elapsed_time:.1f} minutes to execute.")

[1mSupport Vector Regression Performance:[0m
Mean Absolute Error (MAE):  9406.713729606216
Mean Squared Error  (MSE):  182016973.66927552
R2 Score             (R2):  -0.01487960125816512



0.4 minutes to execute.
