# Model Comparison

## Objective
The objective of this notebook is to compare the performance of various machine learning models for predicting car prices, identifying the top-performing models for further optimization.

## Summary
The comparison of six machine learning models revealed that Random Forest and XGBoost were the top performers. Random Forest exhibited the best overall performance, while XGBoost showed potential for improvement through hyperparameter tuning.

### Key Findings
- **Random Forest:**
  - Mean Absolute Error (MAE): `1,824.63`
  - Mean Squared Error  (MSE): `8,395,573`
  - R2 Score             (R2): `0.9515`
  - **Performance Summary:** Random Forest outperformed the other models overall, demonstrating the lowest MAE and MSE, indicating higher accuracy and better generalization.
<br>
<br>

- **XGBoost:**
  - Mean Absolute Error (MAE): `2,216.44`
  - Mean Squared Error  (MSE): `10,367,387`
  - R2 Score             (R2): `0.9402`
  - **Performance Summary:** XGBoost showed competitive performance with a slightly lower R2 score compared to Random Forest. However, it is known for responding well to hyperparameter tuning, which could potentially improve its performance.


In [None]:
# Necessary libraries
import time
from datetime import datetime
import pandas as pd
import numpy as np
import joblib
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from catboost import CatBoostRegressor
import lightgbm as lgb

### Import and Process Data

In [None]:
# Load data
df_cars = pd.read_csv('cleaned_data_july_21st.csv')

# Define features and save column names
features = ['Year', 'Model', 'State', 'Mileage', 'Trim', 'Make', 'Body Style', 'City']  # City increase features by more than 3,000
column_names = pd.get_dummies(df_cars[features], drop_first=False) # Fixes numpy array error

# Encode features and define target
X = pd.get_dummies(df_cars[features], drop_first=False).values
y = df_cars['Price'].values.reshape(-1)

# Define scaler and scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=column_names.columns) # Convert back to DataFrame to maintain column names

# Split data into 80/20 train/test
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=.2, random_state=1)

print('Data imported & processed')
print('\nTrain & Test Data Row, Column Count:', X_train.shape, X_test.shape)

Data imported & processed

Train & Test Data Row, Column Count: (66316, 4221) (16579, 4221)


## Model Comparison
(Preliminary Model Performance)

### Linear Regression (Baseline Model)

In [None]:
# Track training time
start_time = time.time()

# Train Linear regression model
Linear_model = LinearRegression(n_jobs=-1)
Linear_model.fit(X_train,y_train.ravel())
pred_linear = Linear_model.predict(X_test)

# Replace NaN values in predictions with zero, fixes NaN error
pred_linear = np.nan_to_num(pred_linear, nan=0.0)

# Print performance
print("\033[1mLinear Regression Performance:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_test, pred_linear))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_test, pred_linear))
print("R2 Score             (R2): ", metrics.r2_score(y_test, pred_linear))

# Print model execution time
end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")
print('\n')

# Print coefficients
print('intercept ', Linear_model.intercept_)
print(pd.DataFrame({'Predictor': X_scaled.columns, 'coefficient': Linear_model.coef_}))

[1mLinear Regression Performance:[0m
Mean Absolute Error (MAE):  26721.3905543157
Mean Squared Error  (MSE):  888169699.3344593
R2 Score             (R2):  -4.100408125966033
871.5 seconds to execute.


intercept  nan
            Predictor   coefficient
0                Year  4.366843e+03
1             Mileage -3.678320e+03
2          Model_370z -5.752513e+13
3       Model_4runner  6.883538e+14
4            Model_86  4.909451e+13
...               ...           ...
4216    City_oak lawn           NaN
4217   City_rochester           NaN
4218  City_scottsdale           NaN
4219    City_stockton           NaN
4220      City_warren           NaN

[4221 rows x 2 columns]


### Random Forest Regressor

In [None]:
# Track training time
start_time = time.time()

# Train Random Forest regressor model
RF_model = RandomForestRegressor(n_jobs=-1)
RF_model.fit(X_train,y_train.ravel())
pred_RF = RF_model.predict(X_test)

# Print performance
print("\033[1mRandom Forest Regression Performance:\033[0m")
print(f'Mean Absolute Error (MAE): $ {metrics.mean_absolute_error(y_test, pred_RF):,.2f}')
print(f'Mean Squared Error  (MSE): {int(metrics.mean_squared_error(y_test, pred_RF)):,}')
print(f'R2 Score             (R2): {metrics.r2_score(y_test, pred_RF):.4f}')

# Print model execution time
end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")

[1mRandom Forest Regression Performance:[0m
Mean Absolute Error (MAE): $ 1,832.50
Mean Squared Error  (MSE): 9,145,979
R2 Score             (R2): 0.9475
338.0 seconds to execute.


### XGBoost Regressor

In [None]:
# Track training time
start_time = time.time()

# Train XGBoost regressor model
XGBoost_model = xgb.XGBRegressor()
XGBoost_model.fit(X_train, y_train.ravel())
pred_XGBoost = XGBoost_model.predict(X_test)

# Print performance
print("\033[1mXGBoost Regression Performance:\033[0m")
print(f'Mean Absolute Error (MAE): $ {metrics.mean_absolute_error(y_test, pred_XGBoost):,.2f}')
print(f'Mean Squared Error  (MSE): {int(metrics.mean_squared_error(y_test, pred_XGBoost)):,}')
print(f'R2 Score             (R2): {metrics.r2_score(y_test, pred_XGBoost):.4f}')

# Print model execution time
end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")

[1mXGBoost Regression Performance:[0m
Mean Absolute Error (MAE): $ 2,247.27
Mean Squared Error  (MSE): 11,244,244
R2 Score             (R2): 0.9354
9.9 seconds to execute.


### Ridge Regression

In [None]:
# Track training time
start_time = time.time()

# Train Ridge regression model (cv for more accurate score)
Ridge_model = RidgeCV(alphas=np.logspace(-6, 6, 13), cv=5) #logspace finds the best alpha with ridgecv
Ridge_model.fit(X_train,y_train)
pred_Ridge = Ridge_model.predict(X_test)

# Print performance
print("\033[1mRidge Regression Performance:\033[0m")
print(f'Mean Absolute Error (MAE): $ {metrics.mean_absolute_error(y_test, pred_Ridge):,.2f}')
print(f'Mean Squared Error  (MSE): {int(metrics.mean_squared_error(y_test, pred_Ridge)):,}')
print(f'R2 Score             (R2): {metrics.r2_score(y_test, pred_Ridge):.4f}')

# Print model execution time
end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")

[1mRidge Regression Performance:[0m
Mean Absolute Error (MAE): $ 2,574.89
Mean Squared Error  (MSE): 15,995,740
R2 Score             (R2): 0.9081
437.4 seconds to execute.


### Lasso Regression 

In [None]:
# Track training time
start_time = time.time()

# Train Lasso regression model
Lasso_model = LassoCV(cv=5, random_state=1)
Lasso_model.fit(X_train,y_train.ravel())
pred_Lasso = Lasso_model.predict(X_test)

# Print performance
print("\033[1mLasso Regression Performance:\033[0m")
print(f'Mean Absolute Error (MAE): $ {metrics.mean_absolute_error(y_test, pred_Lasso):,.2f}')
print(f'Mean Squared Error  (MSE): {int(metrics.mean_squared_error(y_test, pred_Lasso)):,}')
print(f'R2 Score             (R2): {metrics.r2_score(y_test, pred_Lasso):.4f}')

# Print model execution time
end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")

[1mLasso Regression Performance:[0m
Mean Absolute Error (MAE): $ 2,553.84
Mean Squared Error  (MSE): 15,812,539
R2 Score             (R2): 0.9092
255.2 seconds to execute.


## Conclusion - Ininital Preliminary Model Comparison

The Random Forest model had the best overall performance, with the lowest error metrics and the highest R2 score, making it a strong candidate for predicting car prices. However, the XGBoost model, despite slightly higher error metrics, demonstrated a competitive R2 score, suggesting it captures a substantial amount of variance. Given XGBoost’s potential for significant improvements through hyperparameter tuning, I have decided to move forward with optimizing both the Random Forest and XGBoost models. The goal is to achieve the highest possible predictive performance by fine-tuning these models and comparing their performance post-optimization.

# New findings 

Below are models that i learned about while trying to optimize other models. initially I was moving forward with XGboost, then I found out that XGboost has a feature to enable cateforical columns. 6 of my 8 features are categorical so this was a great feature. My optimized XGboost model performed slightly worse than the hot-encoded xgboost model but when applied to the validation data it was significantly better. I then had trouble with creating my dashboard with that model. While researching my problem, i found out about catboost, a model made specifically for categorical features, and lightxgm, another boosted gradient model. All of these models are able to handle the data without encoding, significantly speeding up the efficiency. 

## Categorical Models without Hot-Encoding

#### Process Data for Categorical Models

In [None]:
# Define features and target
features = ['Year', 'Model', 'State', 'Mileage', 'Trim', 'Make', 'Body Style', 'City']
X_categorical = df_cars[features].copy()
y_categorical = df_cars['Price']

# Define categorical features
categorical_features = ['Model', 'State', 'Trim', 'Make', 'Body Style', 'City']

# Format categorical features
X_categorical[categorical_features] = X_categorical[categorical_features].astype('category')

# Scale numerical features
scaler = StandardScaler()
X_categorical[['Year', 'Mileage']] = scaler.fit_transform(X_categorical[['Year', 'Mileage']])
joblib.dump(scaler, 'numerical_scaler.pkl') # Save numerical scaler

# Split the data
train_X, test_X, train_y, test_y = train_test_split(X_categorical, y_categorical, test_size=0.2, random_state=1)

print('Train/Test Data Row & Column Count:', train_X.shape, test_X.shape)
print('\nData processed for categorical model')

Train/Test Data Row & Column Count: (66316, 8) (16579, 8)

Data processed for categorical model


## XGBoost Regressor (enable_categorical=True)

In [None]:
# Track training time
start_time = time.time()

# Train XGBoost regressor (enable_categorical) model
XGBoost_categorical_model = xgb.XGBRegressor(enable_categorical=True, n_jobs=-1)
XGBoost_categorical_model.fit(train_X, train_y.to_numpy())
pred_XGBoost_categorical = XGBoost_categorical_model.predict(test_X)

# Print performance
print("\033[1mCategorical XGBoost Regressor Performance:\033[0m")
print(f'Mean Absolute Error (MAE): $ {metrics.mean_absolute_error(test_y, pred_XGBoost_categorical):,.2f}')
print(f'Mean Squared Error  (MSE): {int(metrics.mean_squared_error(test_y, pred_XGBoost_categorical)):,}')
print(f'R2 Score             (R2): {metrics.r2_score(test_y, pred_XGBoost_categorical):.4f}')

# Print model execution time
end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")

# Print performance
print("\033[1m\nEncoded XGBoost Regressor Performance:\033[0m")
print(f'Mean Absolute Error (MAE): $ {metrics.mean_absolute_error(y_test, pred_XGBoost):,.2f}')
print(f'Mean Squared Error  (MSE): {int(metrics.mean_squared_error(y_test, pred_XGBoost)):,}')
print(f'R2 Score             (R2): {metrics.r2_score(y_test, pred_XGBoost):.4f}')

[1mCategorical XGBoost Regressor Performance:[0m
Mean Absolute Error (MAE): $ 1,820.59
Mean Squared Error  (MSE): 9,253,507
R2 Score             (R2): 0.9469
1.4 seconds to execute.
[1m
Encoded XGBoost Regressor Performance:[0m
Mean Absolute Error (MAE): $ 2,247.27
Mean Squared Error  (MSE): 11,244,244
R2 Score             (R2): 0.9354


### Apply Categorical XGBoost to Validation Set

In [None]:
# Load validation data
df_validation = pd.read_csv('cleaned_data_aug_16th.csv') 

# Format cateforical features
df_validation[categorical_features] = df_validation[categorical_features].astype('category')

# Scale and format numerical features
X_validation = df_validation[features].copy()
X_validation[['Year', 'Mileage']] = scaler.transform(X_validation[['Year', 'Mileage']])
X_validation[['Year', 'Mileage']] = X_validation[['Year', 'Mileage']].astype('float64')

print('Validation Data loaded and processed')
print('\nValidation Data Row and Column Count:', X_validation.shape)

# Predict validation data
pred_xgb = XGBoost_categorical_model.predict(X_validation)

# Define validation target
y_validation = df_validation['Price'].values

# Print validation performance
print('\n\033[1mCategorical XGBoost Regressor Performance on Validation Data from 8/15:\033[0m')
print(f'Mean Absolute Error (MAE): $ {round(metrics.mean_absolute_error(y_validation, pred_xgb), 2):,}')
print(f'Mean Squared Error  (MSE): {int(round(metrics.mean_squared_error(y_validation, pred_xgb))):,}')
print(f'R2 Score             (R2): {round(metrics.r2_score(y_validation, pred_xgb), 4)}')

# Print original performance
print("\033[1m\nCategorical XGBoost Regressor Performance on Original Data from 7/21:\033[0m")
print(f'Mean Absolute Error (MAE): $ {metrics.mean_absolute_error(test_y, pred_XGBoost_categorical):,.2f}')
print(f'Mean Squared Error  (MSE): {int(metrics.mean_squared_error(test_y, pred_XGBoost_categorical)):,}')
print(f'R2 Score             (R2): {metrics.r2_score(test_y, pred_XGBoost_categorical):.4f}')

Validation Data loaded and processed

Validation Data Row and Column Count: (9229, 8)

[1mCategorical XGBoost Regressor Performance on Validation Data from 8/15:[0m
Mean Absolute Error (MAE): $ 8,773.75
Mean Squared Error  (MSE): 164,607,404
R2 Score             (R2): 0.0359
[1m
Categorical XGBoost Regressor Performance on Original Data from 7/21:[0m
Mean Absolute Error (MAE): $ 1,820.59
Mean Squared Error  (MSE): 9,253,507
R2 Score             (R2): 0.9469


## Summary

After running into issues with my Encoded XGBoost model in the dashboard, I began to learn more about XGBoost and found out about the parameters that allows you to use categorical columns without encoding. I decided to try this parameters, enable_categorical=True, this took the number of features from over 4,300 to only 8. This model performed significantly faster and more accurate without any hyperparameter tuning. I moved forward with hyperparameter tuning and validation. The model performed extremely well everywhere until i applied it to the validation set. This model was unusable on the validation data that introduced a small number of new models and trims, as you can see above when apply untuned model to validation data. I was going to go back to my encoded xgboost model but decided to do more research if there are any other models that accept categorical columns without encoding. I came accross CatBoost and LightGBM. Below i compare the 2 models and apply them to the validation data before deciding which model i move forward with. 

## Catboost Regressor

In [None]:
# Track training time
start_time = time.time()

# Train Catboost regressor model
catboost_model = CatBoostRegressor(cat_features=categorical_features, verbose=0)
catboost_model.fit(train_X,train_y)
pred_catboost = catboost_model.predict(test_X)

# Print performance
print("\033[1mCatBoost Regressor Performance:\033[0m")
print(f'Mean Absolute Error (MAE): $ {metrics.mean_absolute_error(test_y, pred_catboost):,.2f}')
print(f'Mean Squared Error  (MSE): {int(metrics.mean_squared_error(test_y, pred_catboost)):,}')
print(f'R2 Score             (R2): {metrics.r2_score(test_y, pred_catboost):.4f}')

# Print model execution time
end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")

# Print feature importance
print("\n\033[1mFeature Importance:\033[0m")
importance_catboost = pd.DataFrame({
    'Feature': train_X.columns,
    'Importance': catboost_model.get_feature_importance()
})
importance_catboost = importance_catboost.sort_values(by='Importance', ascending=False)
print(importance_catboost)

[1mCatBoost Regressor Performance:[0m
Mean Absolute Error (MAE): $ 1,922.83
Mean Squared Error  (MSE): 10,072,576
R2 Score             (R2): 0.9422
14.7 seconds to execute.

[1mFeature Importance:[0m
      Feature  Importance
1       Model   28.995478
0        Year   18.955490
4        Trim   15.433566
3     Mileage   14.942127
5        Make   13.957470
6  Body Style    6.524228
2       State    0.753334
7        City    0.438307


## LightGBM Regressor

In [None]:
# Track training time
start_time = time.time()

# Train LightGBM regressor model
light_model = lgb.LGBMRegressor(n_jobs=-1, verbose=-1)
light_model.fit(train_X, train_y, categorical_feature=categorical_features)
pred_light = light_model.predict(test_X)

# Print performance
print("\033[1mLightGBM Regressor Performance:\033[0m")
print(f'Mean Absolute Error (MAE): $ {metrics.mean_absolute_error(test_y, pred_light):,.2f}')
print(f'Mean Squared Error  (MSE): {int(metrics.mean_squared_error(test_y, pred_light)):,}')
print(f'R2 Score             (R2): {metrics.r2_score(test_y, pred_light):.4f}')

# Print model execution time
end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")

# Print feature importance
print("\n\033[1mFeature Importance:\033[0m")
importance_light = pd.DataFrame({
    'Feature': train_X.columns,
    'Importance': light_model.feature_importances_
})
importance_light = importance_light.sort_values(by='Importance', ascending=False)
print(importance_light)

[1mLightGBM Regressor Performance:[0m
Mean Absolute Error (MAE): $ 1,695.51
Mean Squared Error  (MSE): 8,147,484
R2 Score             (R2): 0.9532
0.4 seconds to execute.

[1mFeature Importance:[0m
      Feature  Importance
1       Model         685
4        Trim         607
3     Mileage         569
7        City         533
0        Year         384
2       State         135
6  Body Style          44
5        Make          43


### Apply CatBoost to Validation Set

In [None]:
# Predict validation data
pred_catboost_val = catboost_model.predict(X_validation)

# Print validation performance
print('\n\033[1mCatBoost Regressor Performance on Validation Data from 8/15:\033[0m')
print(f'Mean Absolute Error (MAE): $ {round(metrics.mean_absolute_error(y_validation, pred_catboost_val), 2):,}')
print(f'Mean Squared Error  (MSE): {int(round(metrics.mean_squared_error(y_validation, pred_catboost_val))):,}')
print(f'R2 Score             (R2): {round(metrics.r2_score(y_validation, pred_catboost_val), 4)}')

# Print original performance
print("\033[1m\nCatBoost Regressor Performance on Original Data from 7/21:\033[0m")
print(f'Mean Absolute Error (MAE): $ {metrics.mean_absolute_error(test_y, pred_catboost):,.2f}')
print(f'Mean Squared Error  (MSE): {int(metrics.mean_squared_error(test_y, pred_catboost)):,}')
print(f'R2 Score             (R2): {metrics.r2_score(test_y, pred_catboost):.4f}')


[1mCatBoost Regressor Performance on Validation Data from 8/15:[0m
Mean Absolute Error (MAE): $ 2,371.04
Mean Squared Error  (MSE): 14,785,259
R2 Score             (R2): 0.9134
[1m
CatBoost Regressor Performance on Original Data from 7/21:[0m
Mean Absolute Error (MAE): $ 1,922.83
Mean Squared Error  (MSE): 10,072,576
R2 Score             (R2): 0.9422


### Apply LightGBM to Validation Set

In [None]:
# Predict validation data using the loaded model
pred_light_val = light_model.predict(X_validation)

# Print validation performance
print('\n\033[1mLightGBM Regressor Performance on Validation Data from 8/15:\033[0m')
print(f'Mean Absolute Error (MAE): $ {round(metrics.mean_absolute_error(y_validation, pred_light_val), 2):,}')
print(f'Mean Squared Error  (MSE): {int(round(metrics.mean_squared_error(y_validation, pred_light_val))):,}')
print(f'R2 Score             (R2): {round(metrics.r2_score(y_validation, pred_light_val), 4)}')

# Print original performance
print("\033[1m\nLightGBM Regressor Performance on Original Data from 7/21:\033[0m")
print(f'Mean Absolute Error (MAE): $ {metrics.mean_absolute_error(test_y, pred_light):,.2f}')
print(f'Mean Squared Error  (MSE): {int(metrics.mean_squared_error(test_y, pred_light)):,}')
print(f'R2 Score             (R2): {metrics.r2_score(test_y, pred_light):.4f}')


[1mLightGBM Regressor Performance on Validation Data from 8/15:[0m
Mean Absolute Error (MAE): $ 1,898.75
Mean Squared Error  (MSE): 10,302,178
R2 Score             (R2): 0.9397
[1m
LightGBM Regressor Performance on Original Data from 7/21:[0m
Mean Absolute Error (MAE): $ 1,695.51
Mean Squared Error  (MSE): 8,147,484
R2 Score             (R2): 0.9532


## These data sets are signicantly simiplified with only 8 features and non encoding. All models perform very efficient. LightGBM has the best inital performance. I will move forward with these models and measure their validation on new data to find the best performing mdoel. 