# Model Validation
 
### Objective:
The objective of this notebook is to validate the performance of the LightGBM model on new data collected nearly a month after the original training data. The goal is to assess the model’s ability to predict car prices accurately amidst market fluctuations and determine its suitability as a vehicle value estimation tool.

### Summary:
In this notebook, the LightGBM model, originally trained on data from July 26th, was validated using new data from August 16th. The validation process involved scaling the new data to match the original training features and comparing model performance metrics between the two datasets. Despite a month of market changes, the LightGBM model maintained a robust predictive accuracy, showing slight variations in error metrics compared to the original data.

#### **Key Findings**:
- **Mean Absolute Error (MAE)**:
  - Original Data (July 26th): `$1,609.03`
  - Validation Data (August 16th): `$1,805.08`
  - **Comparison**: The MAE increased by `$196.05`, reflecting slight degradation in model performance due to market fluctuations.

<br>

- **Mean Squared Error (MSE)**:
  - Original Data (July 26th): `7,441,905`
  - Validation Data (August 16th): `8,084,573`
  - **Comparison**: The MSE showed a slight increase relative to the typical MSE score. 

<br>

- **R² Score**:
  - Original Data (July 26th): `0.9573`
  - Validation Data (August 16th): `0.9527`
  - **Comparison**: The R² score only marginally dropped, signifying that the model retained most of its explanatory power even with the newer data.

<br>

- **Data Set Size Consideration:**
  - The validation set consisted of `9,229` samples, compared to `66,282` for training and `16,571` for testing in the original data. The smaller size of the validation set could contribute to higher variability in performance metrics, as it may not fully capture the range of conditions seen in the larger training set.

In [None]:
# Necessary libraries
import pandas as pd
import joblib
from sklearn import metrics
from sklearn.model_selection import train_test_split

### Load Models

In [None]:
# Load final model and final scaler
final_model = joblib.load('final_model.pkl')
final_scaler = joblib.load('final_scaler.pkl')

print('Final model and scaler loaded')

Final model and scaler loaded


### Load and Process Validation Data

In [None]:
# Load validation data
df_validation = pd.read_csv('cleaned_data_aug_16th.csv') 

# Define validation features and categorical features
features = ['Year', 'Model', 'State', 'Mileage', 'Trim', 'Make', 'Body Style']
cat_features = ['Model', 'State', 'Trim', 'Make', 'Body Style']

# Format non-numerical columns as 'category'
df_validation[cat_features] = df_validation[cat_features].astype('category')

# Scale and format numerical features with scaler
X_validation = df_validation[features].copy()
X_validation[['Year', 'Mileage']] = final_scaler.transform(X_validation[['Year', 'Mileage']])
X_validation[['Year', 'Mileage']] = X_validation[['Year', 'Mileage']].astype('float64')

# Define validation target 
y_validation = df_validation['Price'].values

# Print validation summary statistics
print('Data loaded & processed\n')
print('Validation Data Row and Column Count:', df_validation.shape)
print('\nValidation Data Summary Statistics:')
print(round(df_validation.describe(),1))

Data loaded & processed

Validation Data Row and Column Count: (9229, 11)

Validation Data Summary Statistics:
         Year     Price   Mileage
count  9229.0    9229.0    9229.0
mean   2019.9   26628.8   55274.0
std       3.0   13067.6   39231.3
min    2010.0    2950.0    1002.0
25%    2018.0   17998.0   24565.0
50%    2021.0   23990.0   47128.0
75%    2022.0   31702.0   78890.0
max    2024.0  109890.0  270964.0


## Final Model on Validation Data

In [None]:
# Predict validation data using final model
pred_final = final_model.predict(X_validation)

# Print validation data performance
print('\n\033[1mOptimized LightGBM Regressor Performance on Validation Data from 8/15:\033[0m')
print(f'Mean Absolute Error (MAE): $ {round(metrics.mean_absolute_error(y_validation, pred_final), 2):,}')
print(f'Mean Squared Error  (MSE): {int(round(metrics.mean_squared_error(y_validation, pred_final))):,}')
print(f'R² Score             (R²): {round(metrics.r2_score(y_validation, pred_final), 4)}')

# Print optimized lightgbm model on original data performance
print('\n\033[1mOptimized LightGBM Regressor Performance on Original Data from 7/26:\033[0m')
print('Mean Absolute Error (MAE): $ 1,609.03')
print('Mean Squared Error  (MSE): 7,441,905')
print('R² Score             (R²): 0.9573')

# Print preliminary lightgbm model on original data performance
print('\n\033[1mPreliminary LightGBM Regressor Performance on Original Data from 7/26:\033[0m')
print('Mean Absolute Error (MAE): $ 1,695.51')
print('Mean Squared Error  (MSE): 8,147,484')
print('R² Score             (R²): 0.9532')


[1mOptimized LightGBM Regressor Performance on Validation Data from 8/15:[0m
Mean Absolute Error (MAE): $ 1,805.08
Mean Squared Error  (MSE): 8,084,573
R² Score             (R²): 0.9527

[1mOptimized LightGBM Regressor Performance on Original Data from 7/26:[0m
Mean Absolute Error (MAE): $ 1,609.03
Mean Squared Error  (MSE): 7,441,905
R² Score             (R²): 0.9573

[1mPreliminary LightGBM Regressor Performance on Original Data from 7/26:[0m
Mean Absolute Error (MAE): $ 1,695.51
Mean Squared Error  (MSE): 8,147,484
R² Score             (R²): 0.9532


## **Conclusion**:
The LightGBM model demonstrated strong resilience to market fluctuations, retaining robust predictive power with only minor increases in error metrics. The performance on validation data remains within an acceptable range, indicating that the model is suitable for continued use in car price prediction for dealerships. Regular model updates with fresh data will help maintain accuracy over time, especially in dynamic market conditions. The results validate the LightGBM model as an effective tool for optimizing vehicle pricing strategies based on historical data.