### **Objective**
To validate the performance of the XGBoost model on new data collected nearly a month after the original training data. The objective is to assess the model’s robustness in predicting car prices amidst significant market fluctuations, ensuring its effectiveness as a pricing optimization tool over time.

### **Summary**
In this notebook, the XGBoost model trained on data from July 21st was evaluated using new data from August 16th. Given the month-long gap, significant changes in the market were expected to impact the model’s performance. This validation process involved aligning and scaling the new data with the original training features, followed by a comparison of the model's predictive accuracy on both datasets.

### **Key Findings**
- **Mean Absolute Error (MAE):**
  - Original Data (July 21st): **\$1,680.10**
  - Validation Data (August 16th): **\$2,760.08**
  - **Comparison:** The MAE increased by \$1,079.98, reflecting market fluctuations over the past month. The error remains within a reasonable range, representing approximately **21.12%** of the standard deviation of prices in the validation set.


<br>

- **Mean Squared Error (MSE):**
  - Original Data (July 21st): **6,899,667.86**
  - Validation Data (August 16th): **21,098,306.57**
  - **Comparison:** The MSE more than tripled, highlighting a significant rise in larger errors. This increase underscores the effect of market shifts on the model's prediction accuracy.

<br>

- **R2 Score:**
  - Original Data (July 21st): **0.9604**
  - Validation Data (August 16th): **0.8764**
  - **Comparison:** The R2 score dropped by 0.084, indicating a reduction in the model's ability to explain the variance in car prices. Despite this, the R2 score remains relatively strong, suggesting that the model still captures most of the variance in the data.

- **Data Set Size Consideration:**
  - The validation set consisted of **9,229** samples, compared to **66,282** for training and **16,571** for testing in the original data. The smaller size of the validation set could contribute to higher variability in performance metrics, as it may not fully capture the range of conditions seen in the larger training set.

### **Conclusion**
The validation confirms that while the XGBoost model experiences a decline in performance when applied to new data collected during a period of market volatility, it still provides reasonably accurate predictions. The increase in error metrics is a natural consequence of the market changes, and the smaller validation set may also contribute to this variability. Nonetheless, the model’s performance remains within an acceptable range, validating its continued use as a pricing optimization tool. To ensure sustained accuracy, it is recommended that the model be periodically retrained with updated data, particularly in rapidly changing market environments.


In [1]:
from pathlib import Path
import pandas as pd
import joblib
import time
from datetime import datetime
from sklearnex import patch_sklearn
patch_sklearn()
import xgboost as xgb
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score, r2_score

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [2]:
df_validation = pd.read_csv('cleaned_data_aug_16th.csv') 
print('Data imported')

Data imported


In [13]:
# Drop irrelevant columns
df_validation.drop(columns=['Listing ID', 'Stock Type'], inplace=True)

print("Number of Rows and Columns:", df_validation.shape)
print("\n")

print("Summary Statistics")
print(round(df_validation.describe(),1))

Number of Rows and Columns: (9229, 9)


Summary Statistics
         Year     Price   Mileage
count  9229.0    9229.0    9229.0
mean   2019.9   26628.8   55274.0
std       3.0   13067.6   39231.3
min    2010.0    2950.0    1002.0
25%    2018.0   17998.0   24565.0
50%    2021.0   23990.0   47128.0
75%    2022.0   31702.0   78890.0
max    2024.0  109890.0  270964.0


### Load Model

In [21]:
# Load the model from the file
cars_xgb_model = joblib.load('cars_xgb_model.pkl')

# Load the saved scaler
scaler = joblib.load('cars_scaler.pkl')

print('Model loaded')

Model loaded


In [25]:
# Define features
features = ['Year', 'Model', 'State', 'Mileage', 'Trim', 'Make', 'Body Style']

# Prepare validation data
X_validation = pd.get_dummies(df_validation[features], drop_first=False)

# Align the validation set with the training set's columns (this is crucial)
X_validation = X_validation.reindex(columns=scaler.feature_names_in_, fill_value=0)

# Scale the validation data using the loaded scaler
X_validation_scaled = scaler.transform(X_validation)

# Convert back to DataFrame to maintain column names (optional)
X_validation_scaled = pd.DataFrame(X_validation_scaled, columns=scaler.feature_names_in_)

# Make predictions on the validation set
y_pred = cars_xgb_model.predict(X_validation_scaled)

# Assuming df_validation has the actual prices in a 'Price' column:
y_validation = df_validation['Price'].values

print("\n\033[1mModel Performance on Original Data:\033[0m")
print('Mean Absolute Error (MAE):  1680.1063897516956')
print('Mean Squared Error  (MSE):  6899667.861436898')
print('R2 Score             (R2):  0.9603779299684401')

print("\n\033[1mModel Performance on Validation Set from 8/15:\033[0m")
print("Mean Absolute Error (MAE): ", metrics.mean_absolute_error(y_validation, y_pred))
print("Mean Squared Error  (MSE): ", metrics.mean_squared_error(y_validation, y_pred))
print("R2 Score             (R2): ", metrics.r2_score(y_validation, y_pred))


[1mModel Performance on Original Data:[0m
Mean Absolute Error (MAE):  1680.1063897516956
Mean Squared Error  (MSE):  6899667.861436898
R2 Score             (R2):  0.9603779299684401

[1mModel Performance on Validation Set from 8/15:[0m
Mean Absolute Error (MAE):  2760.0871583883004
Mean Squared Error  (MSE):  21093686.751713946
R2 Score             (R2):  0.8764603568996824
