# 03 - Model Building with AutoML (FLAML)

## What we're doing

We want to predict **% Silica Concentrate** (our quality metric) from process inputs.

Instead of assuming a specific algorithm, we'll use **FLAML** (Fast and Lightweight AutoML) to:
1. Try multiple algorithms automatically
2. Tune hyperparameters for each
3. Pick the best one based on validation performance

## What is FLAML?

FLAML is Microsoft's AutoML library. It's:
- **Fast**: Uses smart search, not brute force
- **Lightweight**: Minimal dependencies
- **Automatic**: Handles model selection + hyperparameter tuning

### Algorithms FLAML tries:
| Algorithm | Type | Strengths |
|-----------|------|----------|
| LightGBM | Gradient boosted trees | Fast, handles large data |
| XGBoost | Gradient boosted trees | Robust, widely used |
| Random Forest | Ensemble of trees | Stable, less overfitting |
| Extra Trees | Ensemble of trees | More randomness, faster |
| Linear/Ridge | Linear regression | Simple baseline |
| CatBoost | Gradient boosted trees | Great with categoricals |

FLAML allocates more time to promising algorithms and less to poor ones.

## Step 1: Load Preprocessed Data

From notebook #2, we created `mining_automl_ready.csv` which has:
- Redundant features dropped (% Silica Feed, % Iron Concentrate)
- Starch Flow log-transformed (was heavily skewed)
- All features standardized (mean=0, std=1)

**Why standardize?** AutoML tries multiple model types:
- Tree-based models (XGBoost, LightGBM) don't need it, but aren't hurt by it
- Linear models, SVM, neural nets **require** standardized data

By standardizing, we ensure ALL model types have a fair chance.

In [1]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')  # non-interactive backend
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import warnings
warnings.filterwarnings('ignore')

# Load preprocessed data (standardized, redundant features removed)
# Note: date column not included - not needed for modeling
df = pd.read_csv('../data/processed/mining_automl_ready.csv')

# Load feature names and scaler (saved from notebook #2)
feature_names = joblib.load('../data/processed/feature_names.joblib')
scaler = joblib.load('../data/processed/feature_scaler.joblib')

print(f"Dataset: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nFeatures ({len(feature_names)}): standardized, ready for any model type")
print(f"Target: % Silica Concentrate (NOT standardized - in original units)")
df.head()

Dataset: 4097 rows, 21 columns

Features (20): standardized, ready for any model type
Target: % Silica Concentrate (NOT standardized - in original units)


Unnamed: 0,% Iron Feed,Starch Flow,Amina Flow,Ore Pulp Flow,Ore Pulp pH,Ore Pulp Density,Flotation Column 01 Air Flow,Flotation Column 02 Air Flow,Flotation Column 03 Air Flow,Flotation Column 04 Air Flow,...,Flotation Column 06 Air Flow,Flotation Column 07 Air Flow,Flotation Column 01 Level,Flotation Column 02 Level,Flotation Column 03 Level,Flotation Column 04 Level,Flotation Column 05 Level,Flotation Column 06 Level,Flotation Column 07 Level,% Silica Concentrate
0,-0.212251,0.404113,1.083194,0.140386,0.914982,0.771216,-0.985665,-0.915495,-1.089311,-1.793315,...,-1.379915,-1.48976,-0.571874,-0.652849,-0.583654,0.386426,0.407072,0.456197,0.406791,1.31
1,-0.212251,0.381629,0.586454,0.274018,0.957985,-0.197532,-1.029399,-0.915904,-1.094421,-1.793315,...,-1.42462,-1.501172,-0.580146,-0.623916,-0.586492,0.391865,0.312846,0.338555,0.418576,1.11
2,-0.212251,0.63422,1.239983,0.141633,0.742794,0.820657,-1.019853,-0.919639,-1.093962,-1.793315,...,-1.41338,-1.479619,-0.575238,-0.618578,-0.58057,0.402535,0.344986,0.397891,0.403538,1.27
3,-0.212251,0.453448,1.25508,0.27344,0.399418,0.794703,-1.018239,-0.91622,-1.091336,-1.793315,...,-1.427471,-1.491618,-0.264436,-0.268758,-0.317273,0.96906,0.938527,0.964562,1.018911,1.36
4,-0.212251,0.526425,1.572251,0.243344,-0.057179,1.340809,-1.028134,-0.917724,-1.09512,-1.793315,...,-1.413128,-1.49452,0.23565,0.235556,0.130673,1.69314,1.656277,1.731899,1.781621,1.34


In [2]:
# Define target and features (using saved feature names)
target = '% Silica Concentrate'
features = feature_names  # loaded from preprocessing step

# Create X (inputs) and y (output)
X = df[features]
y = df[target]

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

# Verify standardization worked (features should have mean≈0, std≈1)
print(f"\nFeature statistics (should be mean≈0, std≈1):")
print(X.describe().T[['mean', 'std']].head(5).round(3))

print(f"\nTarget statistics (NOT standardized - original units):")
print(f"  Mean: {y.mean():.2f}%")
print(f"  Std:  {y.std():.2f}%")

X shape: (4097, 20)
y shape: (4097,)

Feature statistics (should be mean≈0, std≈1):
               mean  std
% Iron Feed    -0.0  1.0
Starch Flow     0.0  1.0
Amina Flow      0.0  1.0
Ore Pulp Flow  -0.0  1.0
Ore Pulp pH    -0.0  1.0

Target statistics (NOT standardized - original units):
  Mean: 2.33%
  Std:  1.12%


## Step 2: Train/Test Split

### Why split the data?

We need to evaluate how well our model works on **data it hasn't seen**.

- **Training set (80%)**: Model learns from this
- **Test set (20%)**: We evaluate on this — model never sees it during training

If we evaluated on training data, the model could just memorize answers (overfitting).

### Time series consideration

Our data is time-ordered. For true production use, we'd want to:
- Train on past data
- Test on future data

For this prototype, we'll use random split — simpler and fine for learning.

In [3]:
from sklearn.model_selection import train_test_split

# 80% train, 20% test, random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print(f"Training set: {X_train.shape[0]} rows")
print(f"Test set:     {X_test.shape[0]} rows")

Training set: 3277 rows
Test set:     820 rows


## Step 3: Train with FLAML AutoML

### What happens inside FLAML:

1. **Starts with cheap models** (linear regression, small trees)
2. **Estimates performance** quickly using cross-validation
3. **Allocates more time** to promising algorithms
4. **Tunes hyperparameters** for each algorithm it tries
5. **Returns the best model** found within time budget

### Key parameters:

| Parameter | What it does |
|-----------|-------------|
| `time_budget` | How long to search (seconds) |
| `metric` | What to optimize (r2, rmse, mae) |
| `task` | 'regression' or 'classification' |
| `estimator_list` | Which algorithms to try |

We'll give it 60 seconds to search — enough to try multiple algorithms.

In [4]:
from flaml import AutoML

# Create AutoML instance
automl = AutoML()

# Configure and run
automl.fit(
    X_train, y_train,
    task='regression',           # we're predicting a continuous value
    metric='r2',                 # optimize for R² (higher = better)
    time_budget=60,              # search for 60 seconds
    verbose=1,                   # show progress
    seed=42,                     # reproducibility
)

## Step 4: Results — What did FLAML find?

Let's see which algorithm won and how well it performs.

In [5]:
# Best model found
print("=" * 50)
print("BEST MODEL FOUND")
print("=" * 50)
print(f"\nAlgorithm: {automl.best_estimator}")
print(f"\nBest hyperparameters:")
for param, value in automl.best_config.items():
    print(f"  {param}: {value}")

BEST MODEL FOUND

Algorithm: lgbm

Best hyperparameters:
  n_estimators: 2318
  num_leaves: 34
  min_child_samples: 6
  learning_rate: 0.01921057990310202
  log_max_bin: 9
  colsample_bytree: 0.7333657199833747
  reg_alpha: 0.00682115245814188
  reg_lambda: 0.026798931149202433


## Step 5: Evaluate on Test Set

### Metrics we'll use:

| Metric | What it means | Good value |
|--------|--------------|------------|
| **R²** | % of variance explained (0-1) | >0.7 is decent, >0.9 is great |
| **RMSE** | Average error in same units as target | Lower is better |
| **MAE** | Average absolute error | Lower is better, easier to interpret |

### Interpretation:

- **R² = 0.8** means: model explains 80% of the variance in silica concentrate
- **RMSE = 0.2** means: predictions are off by ~0.2% silica on average
- **MAE = 0.15** means: typical error is 0.15% silica

In [6]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Predict on test set
y_pred = automl.predict(X_test)

# Calculate metrics
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)

print("=" * 50)
print("TEST SET PERFORMANCE")
print("=" * 50)
print(f"\nR² Score:  {r2:.4f}")
print(f"RMSE:      {rmse:.4f}")
print(f"MAE:       {mae:.4f}")

print(f"\n--- Interpretation ---")
print(f"Model explains {r2*100:.1f}% of variance in % Silica Concentrate")
print(f"Average prediction error: {mae:.3f}% silica")
print(f"Target range in data: {y.min():.1f} - {y.max():.1f}%")

TEST SET PERFORMANCE

R² Score:  0.3590
RMSE:      0.9113
MAE:       0.6824

--- Interpretation ---
Model explains 35.9% of variance in % Silica Concentrate
Average prediction error: 0.682% silica
Target range in data: 0.6 - 5.5%


In [7]:
# Actual vs Predicted plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Scatter plot
ax = axes[0]
ax.scatter(y_test, y_pred, alpha=0.5, s=20)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2, label='Perfect prediction')
ax.set_xlabel('Actual % Silica Concentrate')
ax.set_ylabel('Predicted % Silica Concentrate')
ax.set_title(f'Actual vs Predicted (R² = {r2:.3f})')
ax.legend()

# Residuals (errors)
ax = axes[1]
residuals = y_test - y_pred
ax.scatter(y_pred, residuals, alpha=0.5, s=20)
ax.axhline(y=0, color='r', linestyle='--', lw=2)
ax.set_xlabel('Predicted % Silica Concentrate')
ax.set_ylabel('Residual (Actual - Predicted)')
ax.set_title('Residual Plot')

plt.tight_layout()
plt.savefig('../data/processed/model_01_predictions.png', dpi=150, bbox_inches='tight')
plt.show()
print("\nSaved: data/processed/model_01_predictions.png")


Saved: data/processed/model_01_predictions.png


### How to read these plots:

**Left plot (Actual vs Predicted):**
- Points on the red dashed line = perfect predictions
- Scatter around the line = prediction error
- Tighter scatter = better model

**Right plot (Residuals):**
- Should look like random noise around zero
- Patterns in residuals = model is missing something
- Funnel shape = error depends on prediction value (heteroscedasticity)

## Step 6: Feature Importance

### What is feature importance?

It tells us **which inputs matter most** for predictions.

For tree-based models, importance = how much each feature reduces prediction error when used for splits.

### Why it matters for optimization:

- High importance + controllable → **lever you can pull**
- High importance + uncontrollable → explains variance but can't optimize
- Low importance → doesn't affect output much

In [8]:
# Get feature importance from the best model
# Note: Not all models have feature_importances_, so we handle that

model = automl.model.estimator

if hasattr(model, 'feature_importances_'):
    importance = pd.DataFrame({
        'feature': features,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print("Feature Importance (top 10):")
    print(importance.head(10).to_string(index=False))
else:
    # For linear models, use coefficients
    if hasattr(model, 'coef_'):
        importance = pd.DataFrame({
            'feature': features,
            'importance': np.abs(model.coef_)
        }).sort_values('importance', ascending=False)
        print("Feature Importance (absolute coefficients):")
        print(importance.head(10).to_string(index=False))
    else:
        importance = None
        print("Feature importance not available for this model type")

Feature Importance (top 10):
                     feature  importance
                 Ore Pulp pH        4784
                 Starch Flow        4430
                  Amina Flow        4308
Flotation Column 03 Air Flow        4217
Flotation Column 01 Air Flow        4102
               Ore Pulp Flow        4093
Flotation Column 02 Air Flow        4085
   Flotation Column 03 Level        3954
            Ore Pulp Density        3942
Flotation Column 05 Air Flow        3897


In [9]:
# Plot feature importance
if importance is not None:
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Color by controllability
    controllable = ['Starch Flow', 'Amina Flow', 'Ore Pulp Flow', 'Ore Pulp pH', 'Ore Pulp Density']
    controllable += [f'Flotation Column 0{i} Air Flow' for i in range(1, 8)]
    controllable += [f'Flotation Column 0{i} Level' for i in range(1, 8)]
    
    colors = ['green' if f in controllable else 'orange' for f in importance['feature']]
    
    sns.barplot(
        data=importance,
        x='importance', 
        y='feature',
        hue='feature',
        palette=colors,
        legend=False,
        ax=ax
    )
    ax.set_xlabel('Importance')
    ax.set_ylabel('Feature')
    ax.set_title('Feature Importance (Green = Controllable, Orange = Uncontrollable)')
    
    plt.tight_layout()
    plt.savefig('../data/processed/model_02_importance.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("\nSaved: data/processed/model_02_importance.png")


Saved: data/processed/model_02_importance.png


## Step 7: Compare what FLAML tried

Let's see all the algorithms FLAML evaluated and how they compared.

In [10]:
# All models tried during search
print("=" * 50)
print("ALL MODELS EVALUATED")
print("=" * 50)

# Get search history if available
if hasattr(automl, 'best_config_per_estimator'):
    print("\nBest config found per algorithm:")
    for estimator, config in automl.best_config_per_estimator.items():
        if config:
            print(f"\n{estimator}:")
            for k, v in config.items():
                print(f"    {k}: {v}")

print(f"\n\nFinal winner: {automl.best_estimator}")
print(f"Best validation R²: {automl.best_loss if automl.best_loss else 'N/A'}")

ALL MODELS EVALUATED

Best config found per algorithm:

lgbm:
    n_estimators: 2318
    num_leaves: 34
    min_child_samples: 6
    learning_rate: 0.01921057990310202
    log_max_bin: 9
    colsample_bytree: 0.7333657199833747
    reg_alpha: 0.00682115245814188
    reg_lambda: 0.026798931149202433

rf:
    n_estimators: 23
    max_features: 1.0
    max_leaves: 12

xgboost:
    n_estimators: 100
    max_leaves: 14
    min_child_weight: 0.06169806162467062
    learning_rate: 0.11183666490279637
    subsample: 1.0
    colsample_bylevel: 0.8686106651969953
    colsample_bytree: 0.9855946937981651
    reg_alpha: 0.0009765625
    reg_lambda: 0.026172971513902767

extra_tree:
    n_estimators: 11
    max_features: 1.0
    max_leaves: 29

xgb_limitdepth:
    n_estimators: 27
    max_depth: 5
    min_child_weight: 1.7687572479859561
    learning_rate: 0.14073492405525717
    subsample: 0.9215040509386039
    colsample_bylevel: 0.928235678013149
    colsample_bytree: 0.9414474294855173
    reg_

## Summary

### What we did:
1. Loaded **preprocessed** data from EDA (standardized, redundant features removed)
2. Defined inputs (X) and output (y) - features already prepared for any model type
3. Split into train/test sets
4. Used FLAML to automatically try multiple algorithms (trees, linear, etc.)
5. Evaluated on held-out test data
6. Analyzed feature importance

### Why preprocessing mattered:
- **Standardization** let linear models compete fairly with tree models
- **Log-transform on Starch Flow** helped linear models handle the skewed distribution
- **Dropping % Silica Feed** removed redundancy that could hurt linear models

### What we learned:
- Which algorithm works best for this data
- How accurate our predictions are (R², RMSE)
- Which features matter most for prediction
- Which controllable features we can use for optimization

### Next steps:
- **If performance is good**: Use model for "what-if" simulations
- **If performance is poor**: Try feature engineering, more data, or different approach
- **For production**: More rigorous validation (cross-validation, time-based splits)

In [11]:
# Final summary
print("=" * 50)
print("MODEL SUMMARY")
print("=" * 50)
print(f"\nBest algorithm: {automl.best_estimator}")
print(f"Test R²:        {r2:.4f}")
print(f"Test RMSE:      {rmse:.4f}")
print(f"Test MAE:       {mae:.4f}")
print(f"\nInterpretation:")
print(f"  Model explains {r2*100:.1f}% of silica concentrate variance")
print(f"  Typical prediction error: ±{mae:.3f}% silica")

if importance is not None:
    top_controllable = [f for f in importance.head(5)['feature'] if f in controllable]
    if top_controllable:
        print(f"\nTop controllable features:")
        for f in top_controllable:
            print(f"  - {f}")

MODEL SUMMARY

Best algorithm: lgbm
Test R²:        0.3590
Test RMSE:      0.9113
Test MAE:       0.6824

Interpretation:
  Model explains 35.9% of silica concentrate variance
  Typical prediction error: ±0.682% silica

Top controllable features:
  - Ore Pulp pH
  - Starch Flow
  - Amina Flow
  - Flotation Column 03 Air Flow
  - Flotation Column 01 Air Flow
