## Summary

### Key Findings
- **Random Forest:**  - **Mean Absolute Error (MAE):** 1,824.63
  - **Mean Squared Error  (MSE):** 8,395,572.97
  - **R2 Score             (R2):** 0.9515
  - **Performance Summary:** Random Forest outperformed the other models overall, demonstrating the lowest MAE and MSE, indicating higher accuracy and better generalization.


- **XGBoost:**  - **Mean Absolute Error (MAE):** 2,216.44
  - **Mean Squared Error  (MSE):** 10,367,387.06
  - **R2 Score             (R2):** 0.9402
  - **Performance Summary:** XGBoost showed competitive performance with a slightly lower R2 score compared to Random Forest. However, it is known for responding well to hyperparameter tuning, which could potentially improve its performance.

### Conclusion
The Random Forest model had the best overall performance, with the lowest error metrics and the highest R2 score, making it a strong candidate for predicting car prices. However, the XGBoost model, despite slightly higher error metrics, demonstrated a competitive R2 score, suggesting it captures a substantial amount of variance. Given XGBoost’s potential for significant improvements through hyperparameter tuning, I have decided to move forward with optimizing both the Random Forest and XGBoost models. The goal is to achieve the highest possible predictive performance by fine-tuning these models and comparing their performance post-optimization.


In [None]:
import time
from datetime import datetime
from pathlib import Path
import pandas as pd
import numpy as np
import xgboost as xgb
import seaborn as sns
from sklearnex import patch_sklearn 
patch_sklearn()
from sklearn.linear_model import LinearRegression, Lasso, Ridge, LassoCV, RidgeCV
from sklearn import metrics, svm, preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR, LinearSVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestRegressor

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Processing Data
#### 80% test / 20% train

In [None]:
data_path = Path('path_name')
car_data_df = pd.read_csv(data_path / 'car_data_cleaned.csv')

car_data_df.drop(columns=['Listing ID'], inplace=True) # Listing id's are irrelevant
car_data_df.drop(columns=['Stock Type'], inplace=True) # All stock type is 'used' 

seven_features = ['Year', 'Model', 'State', 'Mileage', 'Trim', 'Make', 'Body Style']

X = pd.get_dummies(car_data_df[seven_features], drop_first=False).values # one-hot encodes categorical columns
y = car_data_df['Price'].values.reshape(-1,1)

column_names = pd.get_dummies(car_data_df[seven_features], drop_first=False) # fixes numpy array error

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=column_names.columns) #convert back to DataFrame to maintain column names / use x2 otherwise error 

# splitting data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=.2, random_state=1)

print('Data processed')

Data processed


# Model Comparison

## Linear Regression

#### Baseline model

In [None]:
start_time = time.time()

# Train model
Linear_model = LinearRegression()
Linear_model.fit(X_train,y_train.ravel())
pred_linear = Linear_model.predict(X_test)

# Print performance
print("\033[1mLinear Regression Performance:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_test, pred_linear))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_test, pred_linear))
print("R2 Score             (R2): ", metrics.r2_score(y_test, pred_linear))

end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")
print('\n')

# Print coefficients
print('intercept ', Linear_model.intercept_)
print(pd.DataFrame({'Predictor': X_scaled.columns, 'coefficient': Linear_model.coef_}))

[1mLinear Regression Performance:[0m
Mean Absolute Error (MAE):  1.2229085108027732e+138
Mean Squared Error  (MSE):  1.0536039679942167e+280
R2 Score             (R2):  -6.081423121374969e+271
5.2 seconds to execute.

intercept  9.779237449219169e+120
                    Predictor    coefficient
0                        Year   4.388868e+03
1                     Mileage  -3.661332e+03
2                  Model_370z  -1.794623e+14
3               Model_4runner   2.065697e+14
4                    Model_86   1.473291e+13
..                        ...            ...
975  Body Style_passenger van -3.652301e+118
976   Body Style_pickup truck -1.247812e+119
977          Body Style_sedan -1.292097e+119
978            Body Style_suv -1.627791e+119
979          Body Style_wagon -2.659945e+118

[980 rows x 2 columns]


## Random Forest Regression

In [None]:
start_time = time.time()

# Train model
RF_model = RandomForestRegressor(n_jobs=-1)
RF_model.fit(X_train,y_train.ravel())
pred_RF = RF_model.predict(X_test)

# Print performance
print("\033[1mRandom Forest Regression Performance:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_test, pred_RF))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_test, pred_RF))
print("R2 Score             (R2): ", metrics.r2_score(y_test, pred_RF))

end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")

[1mRandom Forest Regression Performance:[0m
Mean Absolute Error (MAE):  1828.4619247116739
Mean Squared Error  (MSE):  8429433.705244176
R2 Score             (R2):  0.9513451404964234
61.0 seconds to execute.


## XGBoost Regression

In [None]:
start_time = time.time()

# Train model
XGBoost_model = xgb.XGBRegressor()
XGBoost_model.fit(X_train, y_train.ravel())
pred_XGBoost = XGBoost_model.predict(X_test)

# Print performance
print("\033[1mXGBoost Regression Performance:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_test, pred_XGBoost))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_test, pred_XGBoost))
print("R2 Score             (R2): ", metrics.r2_score(y_test, pred_XGBoost))

end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")

[1mXGBoost Regression Performance:[0m
Mean Absolute Error (MAE):  2216.44419821337
Mean Squared Error  (MSE):  10367387.055782594
R2 Score             (R2):  0.9401592350973133
2.0 seconds to execute.


## Ridge Regression

In [None]:
start_time = time.time()

# Train model (using more accurate RidgeCV vs Ridge) 
Ridge_model = RidgeCV(alphas=np.logspace(-6, 6, 13), cv=5) #logspace finds the best alpha with ridgecv
Ridge_model.fit(X_train,y_train.ravel())
pred_Ridge = Ridge_model.predict(X_test)

# Print performance
print("\033[1mRidge Regression Performance:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_test, pred_Ridge))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_test, pred_Ridge))
print("R2 Score             (R2): ", metrics.r2_score(y_test, pred_Ridge))

end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")

[1mRidge Regression Performance:[0m
Mean Absolute Error (MAE):  2565.9715533219355
Mean Squared Error  (MSE):  15555801.976605577
R2 Score             (R2):  0.9102116006717822
32.6 seconds to execute.


## Lasso Regression 

In [None]:
start_time = time.time()

# Train model
Lasso_model = LassoCV(cv=5, random_state=1)
Lasso_model.fit(X_train,y_train.ravel())
pred_Lasso = Lasso_model.predict(X_test)

print("\033[1mLasso Regression Performance:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_test, pred_Lasso))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_test, pred_Lasso))
print("R2 Score             (R2): ", metrics.r2_score(y_test, pred_Lasso))

end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")

[1mLasso Regression Performance:[0m
Mean Absolute Error (MAE):  2563.41953317199
Mean Squared Error  (MSE):  15539556.65403859
R2 Score             (R2):  0.9103053689977132
31.2 seconds to execute.


## Support Vector Regression

In [None]:
start_time = time.time()

# Train model
SVR_model = SVR()
SVR_model.fit(X_train,y_train.ravel())
pred_SVR = SVR_model.predict(X_test)

print("\033[1mSupport Vector Regression Performance:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_test, pred_SVR))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_test, pred_SVR))
print("R2 Score             (R2): ", metrics.r2_score(y_test, pred_SVR))

end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")

[1mSupport Vector Regression Performance:[0m
Mean Absolute Error (MAE):  9320.233042386373
Mean Squared Error  (MSE):  175549010.58708507
R2 Score             (R2):  -0.013272391097010505
59.7 seconds to execute.
