# Supply Chain Multi-Output Prediction
## 1. Introduction
This notebook demonstrates the implementation of a multi-output regression model for supply chain management predictions. I'll be predicting multiple targets (LBL and MTLp2-MTLp16) using various machine learning models.
## 2. Setup and Dependencies

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb
import lightgbm as lgb
import joblib
import arff
import warnings
import os
warnings.filterwarnings('ignore')

## 3. Data Loading and Preparation
I'm working with ARFF files containing supply chain data.

In [2]:
# Load the SupplyChainModel class
from src.scm import SupplyChainModel
# Initialize the model
scm = SupplyChainModel()

# Load training and test data
train_df = scm.load_arff_data('data/Supply Chain Management_train.arff')
test_df = scm.load_arff_data('data/Supply Chain Management_test.arff')

# Display data info
print("Training data shape:", train_df.shape)
print("\nFeature names:")
print(train_df.columns.tolist())

Training data shape: (8145, 296)

Feature names:
['timeunit', 'storageCost', 'interestRate', 'compidx0lt2', 'compidx0lt2l1', 'compidx0lt2l2', 'compidx0lt2l4', 'compidx0lt2l8', 'compidx1lt2', 'compidx2lt2', 'compidx3lt2', 'compidx4lt2', 'compidx4lt2l1', 'compidx4lt2l2', 'compidx4lt2l4', 'compidx4lt2l8', 'compidx5lt2', 'compidx6lt2', 'compidx6lt2l1', 'compidx6lt2l2', 'compidx6lt2l4', 'compidx6lt2l8', 'compidx7lt2', 'compidx8lt2', 'compidx8lt2l1', 'compidx8lt2l2', 'compidx8lt2l4', 'compidx8lt2l8', 'compidx9lt2', 'compidx10lt2', 'compidx10lt2l1', 'compidx10lt2l2', 'compidx10lt2l4', 'compidx10lt2l8', 'compidx11lt2', 'compidx12lt2', 'compidx12lt2l1', 'compidx12lt2l2', 'compidx12lt2l4', 'compidx12lt2l8', 'compidx13lt2', 'compidx14lt2', 'compidx14lt2l1', 'compidx14lt2l2', 'compidx14lt2l4', 'compidx14lt2l8', 'compidx15lt2', 'compidx0lt6', 'compidx0lt6l1', 'compidx0lt6l2', 'compidx0lt6l4', 'compidx0lt6l8', 'compidx1lt6', 'compidx2lt6', 'compidx3lt6', 'compidx4lt6', 'compidx4lt6l1', 'compidx4lt6l

## 4. Model Training and Evaluation
I'll train three different models:

- Random Forest
- XGBoost
- LightGBM

In [3]:
# Prepare data
X_train, y_train = scm.prepare_data(train_df)
X_test, y_test = scm.prepare_data(test_df)

# Create models
scm.create_models()

# Train and evaluate
target_names = ['LBL'] + [f'MTLp{i}' for i in range(2, 17)]
results = scm.train_and_evaluate(X_train, y_train, X_test, y_test, target_names)


Training random_forest...

Training xgboost...

Training lightgbm...
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.021586 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 68846
[LightGBM] [Info] Number of data points in the train set: 8145, number of used features: 280
[LightGBM] [Info] Start training from score 1206.718232
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.015267 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 68846
[LightGBM] [Info] Number of data points in the train set: 8145, number of used features: 280
[LightGBM] [Info] Start training from score 1290.648619
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.014510 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 68846
[LightGBM] [Info] Number of data points i

## 5. Results Analysis
Let's analyze the performance of each model:

In [4]:
# Print results for each model
for model_name, model_results in results.items():
    print(f"\n{model_name.upper()} Performance Metrics:")
    print("-" * 40)
    for target, metrics in model_results.items():
        if target != 'overall_r2':
            print(f"\nTarget: {target}")
            print(f"RMSE: {metrics['RMSE']:.4f}")
            print(f"R2 Score: {metrics['R2']:.4f}")
    print(f"\nOverall R2 Score: {model_results['overall_r2']:.4f}")


RANDOM_FOREST Performance Metrics:
----------------------------------------

Target: LBL
RMSE: 90.4847
R2 Score: 0.8205

Target: MTLp2
RMSE: 98.3226
R2 Score: 0.8216

Target: MTLp3
RMSE: 104.1193
R2 Score: 0.7763

Target: MTLp4
RMSE: 122.9627
R2 Score: 0.7556

Target: MTLp5
RMSE: 159.7497
R2 Score: 0.6100

Target: MTLp6
RMSE: 192.0373
R2 Score: 0.5439

Target: MTLp7
RMSE: 181.8115
R2 Score: 0.5812

Target: MTLp8
RMSE: 210.8607
R2 Score: 0.5014

Target: MTLp9
RMSE: 94.1596
R2 Score: 0.8104

Target: MTLp10
RMSE: 102.3098
R2 Score: 0.8070

Target: MTLp11
RMSE: 103.5664
R2 Score: 0.7922

Target: MTLp12
RMSE: 114.5011
R2 Score: 0.7851

Target: MTLp13
RMSE: 121.1037
R2 Score: 0.8153

Target: MTLp14
RMSE: 115.4555
R2 Score: 0.8398

Target: MTLp15
RMSE: 123.7046
R2 Score: 0.8401

Target: MTLp16
RMSE: 125.6062
R2 Score: 0.8409

Overall R2 Score: 0.7463

XGBOOST Performance Metrics:
----------------------------------------

Target: LBL
RMSE: 82.8515
R2 Score: 0.8495

Target: MTLp2
RMSE: 92.7464

## 6. Model Export
Export the trained models for future use:

In [5]:
# Export all models and scalers
scm.export_models()


Best performing model: xgboost
All models, scalers, and configurations exported to exported_models/


## 7. Using the Exported Models
Here's how to load and use the exported models:

In [9]:
# Load a saved model and scaler
model = joblib.load('exported_models/random_forest_model.joblib')
scaler = joblib.load('exported_models/random_forest_scaler.joblib')

# Prepare new data
new_data = X_test.copy()  # Example using test data
new_data_scaled = scaler.transform(new_data)

# Make predictions
predictions = model.predict(new_data_scaled)
predictions_df = pd.DataFrame(predictions, columns=target_names)
predictions_df


Unnamed: 0,LBL,MTLp2,MTLp3,MTLp4,MTLp5,MTLp6,MTLp7,MTLp8,MTLp9,MTLp10,MTLp11,MTLp12,MTLp13,MTLp14,MTLp15,MTLp16
0,1521.52,1638.99,1642.94,1727.48,1979.64,2108.98,2132.41,2232.81,1579.97,1661.90,1655.88,1740.38,2059.71,2101.49,2172.44,2229.93
1,1523.16,1631.60,1639.26,1740.77,1995.89,2070.54,2128.22,2222.97,1592.70,1670.51,1721.09,1784.79,2059.99,2115.20,2224.46,2236.00
2,1521.72,1638.43,1638.62,1734.87,1982.73,2100.38,2128.40,2227.61,1619.99,1682.74,1714.85,1783.87,2059.59,2135.04,2207.72,2255.14
3,1522.52,1623.94,1649.31,1741.52,1987.19,2072.99,2114.93,2213.70,1608.02,1681.63,1713.51,1787.33,2072.65,2154.22,2230.80,2269.45
4,1505.88,1614.63,1639.80,1735.74,1985.03,2095.61,2126.73,2239.13,1597.75,1685.69,1726.89,1788.95,2123.71,2183.10,2255.16,2290.63
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1653,1318.50,1430.77,1416.70,1511.34,1521.92,1592.93,1564.26,1622.32,1380.09,1495.72,1463.68,1593.28,1693.69,1764.16,1672.62,1708.25
1654,1305.86,1454.11,1456.97,1503.85,1560.79,1621.88,1616.66,1700.77,1401.63,1501.58,1482.16,1603.56,1624.81,1783.99,1703.29,1729.76
1655,1323.12,1434.98,1461.65,1512.29,1592.01,1680.66,1615.69,1741.42,1390.74,1502.82,1452.75,1585.62,1696.20,1828.68,1772.65,1797.14
1656,1342.88,1467.89,1493.99,1537.21,1679.26,1716.60,1658.15,1743.51,1419.00,1531.70,1500.87,1621.57,1725.63,1820.82,1754.97,1795.76


## 8. Conclusion
I've successfully built and compared three different multi-output regression models for supply chain prediction. The models show varying performance across different targets, with some targets being predicted more accurately than others. 

Best Performing Model: XGBoost
- Overall R² Score: 0.7684
- Followed by LightGBM (0.7650) and Random Forest (0.7463)

Key Findings:

1. Model Performance by Target Groups:
   - Strong Predictions (R² > 0.80):
     * Near-term targets (LBL, MTLp2)
     * Later-term targets (MTLp13-MTLp16)
     * All models performed exceptionally well on LBL with XGBoost reaching R² = 0.8495

   - Moderate Predictions (0.70 < R² < 0.80):
     * Mid-term targets (MTLp3, MTLp4)
     * MTLp9-MTLp12
     * Consistent performance across all models

   - Challenging Predictions (R² < 0.70):
     * MTLp5-MTLp8
     * Particularly MTLp8 showing lowest performance (R² ≈ 0.50) across all models

2. Model Comparison:
   - XGBoost showed superior performance in overall R² and most individual targets
   - LightGBM performed slightly better than XGBoost on some specific targets (e.g., LBL)
   - Random Forest, while still good, consistently ranked third in performance

3. RMSE Analysis:
   - Lower RMSE values for near-term predictions
   - Higher RMSE values for mid-range predictions (MTLp5-MTLp8)
   - RMSE increases with prediction horizon, suggesting growing uncertainty

Recommendations:
1. Use XGBoost as the primary model for this supply chain prediction task
2. Consider ensemble methods for MTLp5-MTLp8 to improve performance
3. Implement feature engineering or selection to enhance predictions for weaker targets
4. Consider collecting additional relevant features for MTLp5-MTLp8 predictions

Future Improvements:
1. Hyperparameter tuning for each model
2. Feature importance analysis to understand key drivers
3. Develop specialized models for the challenging prediction windows
4. Implement stacking or blending of models for improved performance

The models demonstrate strong predictive capability overall, with particular strength in near-term and long-term predictions, while showing room for improvement in mid-term forecasting.