# Model Optimization

### Objective:
The objective of this notebook is to optimize the LightGBM model by tuning hyperparameters and improving its performance on car price predictions. The optimization process aims to reduce error metrics like Mean Absolute Error (MAE) and Mean Squared Error (MSE).

### Summary:
In this notebook, the LightGBM model was subjected to several optimization iterations using RandomizedSearchCV. The model initially performed well, but optimization was explored to determine if any further improvements could be made. After hyperparameter tuning, the final model showed minimal improvements in predictive performance. 

#### **Key Findings**:
- **Legacy Model (XGBoost):**
  - Mean Absolute Error (MAE): `$1,695.51`
  - Mean Squared Error (MSE): `8,147,484`
  - R² Score: `0.9532`
  - Required manual encoding of categorical variables, which increased preprocessing time and complexity.

<br>

- **Optimized LightGBM Model**:
  - **Final Performance After Optimization**:
    - Mean Absolute Error (MAE): `$1,609.03`
    - Mean Squared Error (MSE): `7,441,905`
    - R² Score: `0.9573`

<br>

 - **Improvement**:
    - Optimized hyperparameters led to a reduced MAE by `$86.48` and MSE by `705,579` compared to XGBoost.

In [None]:
# Necessary libraries
import time
from datetime import datetime
import pandas as pd
import numpy as np
import joblib
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
import lightgbm as lgb

### Process Data and Train Model

In [None]:
# Load data
df_cars = pd.read_csv('cleaned_data_july_21st.csv')

# Define features and target
features = ['Year', 'Model', 'State', 'Mileage', 'Trim', 'Make', 'Body Style']
X = df_cars[features].copy()
y = df_cars['Price']

# Define and format categorical features
categorical_features = ['Model', 'State', 'Trim', 'Make', 'Body Style']
X[categorical_features] = X[categorical_features].astype('category')

# Scale numerical features
scaler = joblib.load('final_scaler.pkl')
X[['Year', 'Mileage']] = scaler.fit_transform(X[['Year', 'Mileage']])

# Split data 80/20
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=1)

print('Data loaded & processed')
print('\nTrain/Test Data Row & Column Count:', train_X.shape, test_X.shape)

# Train LightGBM model for RandomizedSearch
lgb_model = lgb.LGBMRegressor(n_jobs=-1, verbose=-1)
lgb_model.fit(train_X, train_y, categorical_feature=categorical_features)
print('\nLightGBM model trained')

Data loaded & processed

Train/Test Data Row & Column Count: (66316, 7) (16579, 7)

LightGBM model trained


## LightGBM Hyperparameter Tuning w/ RandomizedSearchCV

In [None]:
# Define hyperparameter search 
param_search = {
    'num_leaves': [31, 50, 75, 100],
    'max_depth': [5, 7, 9, 12, 15],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [250, 500, 1000],
    'subsample': [0.7, 0.8, 1.0],
    'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
    'reg_alpha': [0, 0.001, 0.1, 0.5],
    'reg_lambda': [0, 0.1, 0.5]
}

# Track training time
start_time = time.time()

# Perform RandomizedSearch with cross-validation
random_search_light = RandomizedSearchCV(
    lgb_model, 
    param_distributions=param_search, 
    n_iter=300, 
    cv=5, 
    scoring='neg_mean_squared_error',
    random_state=1,
)

# Fit model with best hyperparameters
light_model_optimized = random_search_light.fit(train_X, train_y)

# Predict test data
pred_light_optimized = light_model_optimized.predict(test_X)

# Print best hyperparameters
print("Best Hyperparameters:", light_model_optimized.best_params_)
print('\n')

# Print best hyperparameters LightGBM model performance
print("\033[1mOptimized LightGBM Model Performance:\033[0m")
print(f'Mean Absolute Error (MAE): $ {metrics.mean_absolute_error(test_y, pred_light_optimized):,.2f}')
print(f'Mean Squared Error  (MSE): {int(metrics.mean_squared_error(test_y, pred_light_optimized)):,}')
print(f'R² Score             (R²): {metrics.r2_score(test_y, pred_light_optimized):.4f}')

# Print search execution time
end_time = time.time()
elapsed_time = (end_time - start_time) / 60
print(f"{elapsed_time:.1f} minutes to execute.")

Best Hyperparameters: {'subsample': 1.0, 'reg_lambda': 0.5, 'reg_alpha': 0.001, 'num_leaves': 100, 'n_estimators': 1000, 'max_depth': 5, 'learning_rate': 0.1, 'colsample_bytree': 0.9}


[1mOptimized LightGBM Regression Performance:[0m
Mean Absolute Error (MAE): $ 1,663.60
Mean Squared Error  (MSE): 7,327,172
R² Score             (R²): 0.9579
37.4 minutes to execute.


## Final Model

In [None]:
# Define final/best hyperparameters
best_hyperparams = {
    'subsample': 1.0, 
    'reg_lambda': 0.5, 
    'reg_alpha': 0.001, 
    'num_leaves': 100, 
    'n_estimators': 1000, 
    'max_depth': 5, 
    'learning_rate': 0.1, 
    'colsample_bytree': 0.9
}

# Track training time
start_time = time.time()

# Train final optimized LightGBM model
final_model = lgb.LGBMRegressor(**best_hyperparams, verbose=-1, n_jobs=-1)
final_model.fit(train_X, train_y, categorical_feature=categorical_features)

# Predict test data
final_pred = final_model.predict(test_X)

# Print final optimized LightGBM model performance
print('\033[1mOptimized LightGBM Model Performance (Final Model):\033[0m')
print(f'Mean Absolute Error (MAE): $ {metrics.mean_absolute_error(test_y, final_pred):,.2f}')
print(f'Mean Squared Error  (MSE): {int(metrics.mean_squared_error(test_y, final_pred)):,}')
print(f'R² Score             (R²): {metrics.r2_score(test_y, final_pred):.4f}')

# Print final model execution time
end_time = time.time()
elapsed_time = (end_time - start_time)
print(f"{elapsed_time:.1f} seconds to execute.")

# Print initial LightGBM performance
print('\n\033[1mPreliminary LightGBM Model Performance:\033[0m')
print('Mean Absolute Error (MAE): $ 1,695.51')
print('Mean Squared Error  (MSE): 8,147,484')
print('R² Score             (R²): 0.9532')

# Save final optimized LightGBM model
joblib.dump(final_model, 'final_model.pkl')

[1mFinal Model (LightGBM) Performance:[0m
Mean Absolute Error (MAE): $ 1,663.60
Mean Squared Error  (MSE): 7,327,172
R² Score             (R²): 0.9579
1.1 seconds to execute.

[1mPreliminary LightGBM Regressor Performance:[0m
Mean Absolute Error (MAE): $ 1,695.51
Mean Squared Error  (MSE): 8,147,484
R² Score             (R²): 0.9532


['final_model.pkl']

## **Conclusion**:
The optimization of the LightGBM model successfully reduced both MAE and MSE while improving the model's overall performance. Its ability to handle categorical features without encoding resulted in faster training times and more efficient preprocessing. Compared to the legacy XGBoost model, LightGBM's optimized version proved to be more accurate and efficient, making it a superior choice for predicting vehicle prices. The optimized LightGBM model is now better suited for deployment.