# 05 - Selective Time Lags (Best Model)

## What We Learned from Previous Notebooks

| Notebook | Approach | R² | Problem |
|----------|----------|-----|--------|
| #3 | No lags, 20 features | 0.36 | Ignores process delays |
| #4 | All lags, 134 features | -0.05 | Too many features, overfit |

## This Notebook: The Middle Ground

- Add lags only for **key features** (identified from EDA)
- Use only **lag2 and lag4** (not all lags)
- Result: 30 features instead of 134

This balances capturing time delays without overfitting.

In [1]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

# Load original data
df = pd.read_csv('../data/processed/mining_hourly.csv', parse_dates=['date'])
df = df.set_index('date').sort_index()

print(f"Dataset: {df.shape[0]} rows, {df.shape[1]} columns")

Dataset: 4097 rows, 23 columns


## Step 1: Define Key Features for Lagging

From EDA (notebook #2), these features had the strongest correlations with silica output:
- Ore Pulp pH
- Starch Flow
- Amina Flow
- Flotation Column 01 Air Flow
- Flotation Column 03 Air Flow

We'll add lag2 (2 hours ago) and lag4 (4 hours ago) only for these.

In [2]:
target = '% Silica Concentrate'

# Columns to drop (outputs or redundant)
drop_cols = ['% Iron Concentrate', '% Silica Feed']

# Key features to add lags for (most important from EDA)
key_inputs = [
    'Ore Pulp pH', 
    'Starch Flow', 
    'Amina Flow',
    'Flotation Column 01 Air Flow', 
    'Flotation Column 03 Air Flow'
]

print(f"Key inputs for lagging: {len(key_inputs)}")
for f in key_inputs:
    print(f"  - {f}")

Key inputs for lagging: 5
  - Ore Pulp pH
  - Starch Flow
  - Amina Flow
  - Flotation Column 01 Air Flow
  - Flotation Column 03 Air Flow


## Step 2: Create Lag Features

For each key input, add:
- `feature_lag2`: value from 2 hours ago
- `feature_lag4`: value from 4 hours ago

In [3]:
# Add lag features for key inputs only
for col in key_inputs:
    df[f'{col}_lag2'] = df[col].shift(2)
    df[f'{col}_lag4'] = df[col].shift(4)

# Drop rows with NaN (first 4 rows)
rows_before = len(df)
df = df.dropna()
rows_after = len(df)

print(f"Added {len(key_inputs) * 2} lag features")
print(f"Dropped {rows_before - rows_after} rows (needed for lag calculation)")
print(f"Remaining: {rows_after} rows")

Added 10 lag features
Dropped 4 rows (needed for lag calculation)
Remaining: 4093 rows


In [4]:
# Drop redundant/output columns
df = df.drop(columns=drop_cols)

# Log-transform Starch Flow columns (original + lagged)
starch_cols = [c for c in df.columns if 'Starch' in c]
for col in starch_cols:
    df[col] = np.log1p(df[col])

print(f"Log-transformed: {starch_cols}")

Log-transformed: ['Starch Flow', 'Starch Flow_lag2', 'Starch Flow_lag4']


## Step 3: Prepare Features

In [5]:
# Define feature columns
feature_cols = [c for c in df.columns if c != target]

print(f"Total features: {len(feature_cols)}")
print(f"\nFeature list:")
for f in feature_cols:
    lag_indicator = " (LAGGED)" if '_lag' in f else ""
    print(f"  - {f}{lag_indicator}")

Total features: 30

Feature list:
  - % Iron Feed
  - Starch Flow
  - Amina Flow
  - Ore Pulp Flow
  - Ore Pulp pH
  - Ore Pulp Density
  - Flotation Column 01 Air Flow
  - Flotation Column 02 Air Flow
  - Flotation Column 03 Air Flow
  - Flotation Column 04 Air Flow
  - Flotation Column 05 Air Flow
  - Flotation Column 06 Air Flow
  - Flotation Column 07 Air Flow
  - Flotation Column 01 Level
  - Flotation Column 02 Level
  - Flotation Column 03 Level
  - Flotation Column 04 Level
  - Flotation Column 05 Level
  - Flotation Column 06 Level
  - Flotation Column 07 Level
  - Ore Pulp pH_lag2 (LAGGED)
  - Ore Pulp pH_lag4 (LAGGED)
  - Starch Flow_lag2 (LAGGED)
  - Starch Flow_lag4 (LAGGED)
  - Amina Flow_lag2 (LAGGED)
  - Amina Flow_lag4 (LAGGED)
  - Flotation Column 01 Air Flow_lag2 (LAGGED)
  - Flotation Column 01 Air Flow_lag4 (LAGGED)
  - Flotation Column 03 Air Flow_lag2 (LAGGED)
  - Flotation Column 03 Air Flow_lag4 (LAGGED)


In [6]:
# Standardize features
X = df[feature_cols]
y = df[target]

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=feature_cols, index=X.index)

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"\nFeature stats (mean should be ~0, std ~1):")
print(X.describe().T[['mean', 'std']].head(5).round(3))

X shape: (4093, 30)
y shape: (4093,)

Feature stats (mean should be ~0, std ~1):


               mean  std
% Iron Feed    -0.0  1.0
Starch Flow     0.0  1.0
Amina Flow      0.0  1.0
Ore Pulp Flow  -0.0  1.0
Ore Pulp pH    -0.0  1.0


## Step 4: Train/Test Split

Using random split here. Time-based split showed distribution shift issues (see notebook #4).

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print(f"Training set: {len(X_train)} rows")
print(f"Test set:     {len(X_test)} rows")

Training set: 3274 rows
Test set:     819 rows


## Step 5: Train with FLAML

In [8]:
from flaml import AutoML

automl = AutoML()

automl.fit(
    X_train, y_train,
    task='regression',
    metric='r2',
    time_budget=120,  # 2 minutes
    verbose=1,
    seed=42,
)

print(f"\nBest model: {automl.best_estimator}")


Best model: extra_tree


## Step 6: Evaluate

In [9]:
y_pred = automl.predict(X_test)

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)

print("=" * 50)
print("TEST SET PERFORMANCE")
print("=" * 50)
print(f"\nR² Score:  {r2:.4f}")
print(f"RMSE:      {rmse:.4f}")
print(f"MAE:       {mae:.4f}")

print(f"\n--- Comparison with Other Notebooks ---")
print(f"Notebook #3 (no lags):        R² = 0.359")
print(f"Notebook #4 (all lags):       R² = -0.048")
print(f"Notebook #5 (selective lags): R² = {r2:.3f}")

TEST SET PERFORMANCE

R² Score:  0.4364
RMSE:      0.8666
MAE:       0.6586

--- Comparison with Other Notebooks ---
Notebook #3 (no lags):        R² = 0.359
Notebook #4 (all lags):       R² = -0.048
Notebook #5 (selective lags): R² = 0.436


In [10]:
# Actual vs Predicted plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ax = axes[0]
ax.scatter(y_test, y_pred, alpha=0.5, s=20)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
ax.set_xlabel('Actual % Silica')
ax.set_ylabel('Predicted % Silica')
ax.set_title(f'Actual vs Predicted (R² = {r2:.3f})')

ax = axes[1]
residuals = y_test - y_pred
ax.scatter(y_pred, residuals, alpha=0.5, s=20)
ax.axhline(y=0, color='r', linestyle='--', lw=2)
ax.set_xlabel('Predicted % Silica')
ax.set_ylabel('Residual')
ax.set_title('Residual Plot')

plt.tight_layout()
plt.savefig('../data/processed/model_05_selective_lags.png', dpi=150)
plt.show()

## Step 7: Feature Importance

In [11]:
model = automl.model.estimator

if hasattr(model, 'feature_importances_'):
    importance = pd.DataFrame({
        'feature': feature_cols,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print("Feature Importance (all features):")
    print(importance.to_string(index=False))
else:
    if hasattr(model, 'coef_'):
        importance = pd.DataFrame({
            'feature': feature_cols,
            'importance': np.abs(model.coef_)
        }).sort_values('importance', ascending=False)
        print("Feature Importance (absolute coefficients):")
        print(importance.to_string(index=False))
    else:
        importance = None
        print("Feature importance not available")

Feature Importance (all features):
                          feature  importance
                       Amina Flow    0.054935
                      % Iron Feed    0.048435
     Flotation Column 04 Air Flow    0.043824
                  Amina Flow_lag2    0.038853
        Flotation Column 05 Level    0.038097
        Flotation Column 06 Level    0.037000
                 Ore Pulp pH_lag2    0.036477
        Flotation Column 04 Level    0.036353
     Flotation Column 03 Air Flow    0.035971
        Flotation Column 03 Level    0.035714
        Flotation Column 01 Level    0.035326
     Flotation Column 05 Air Flow    0.035136
        Flotation Column 07 Level    0.034268
                      Ore Pulp pH    0.033921
                  Amina Flow_lag4    0.032204
                 Ore Pulp pH_lag4    0.032094
        Flotation Column 02 Level    0.032034
     Flotation Column 01 Air Flow    0.030210
Flotation Column 03 Air Flow_lag4    0.029596
                      Starch Flow    0.028996

In [12]:
# Plot feature importance
if importance is not None:
    fig, ax = plt.subplots(figsize=(10, 10))
    
    # Color: green = lagged, blue = current
    colors = ['green' if '_lag' in f else 'steelblue' for f in importance['feature']]
    
    sns.barplot(
        data=importance,
        x='importance', 
        y='feature',
        hue='feature',
        palette=colors,
        legend=False,
        ax=ax
    )
    ax.set_xlabel('Importance')
    ax.set_ylabel('Feature')
    ax.set_title('Feature Importance\nGreen = Lagged features, Blue = Current values')
    
    plt.tight_layout()
    plt.savefig('../data/processed/model_05_importance.png', dpi=150)
    plt.show()

## Step 8: Analyze Lag Importance

Which lag times matter most? This tells us about process response time.

In [13]:
if importance is not None:
    # Categorize features
    def get_lag_type(name):
        if '_lag2' in name:
            return 'lag2 (2h ago)'
        elif '_lag4' in name:
            return 'lag4 (4h ago)'
        else:
            return 'current (t=0)'
    
    importance['lag_type'] = importance['feature'].apply(get_lag_type)
    
    lag_summary = importance.groupby('lag_type')['importance'].agg(['sum', 'mean', 'count'])
    lag_summary = lag_summary.sort_values('sum', ascending=False)
    
    print("Importance by Lag Type:")
    print("=" * 50)
    print(lag_summary.round(2))
    
    print("\n--- Interpretation ---")
    top_lag = lag_summary['sum'].idxmax()
    print(f"Most important lag type: {top_lag}")
    if 'lag4' in top_lag:
        print("Process response time appears to be ~4 hours")
    elif 'lag2' in top_lag:
        print("Process response time appears to be ~2 hours")
    else:
        print("Current values matter most (fast response or no lag effect)")

Importance by Lag Type:
                sum  mean  count
lag_type                        
current (t=0)  0.70  0.03     20
lag2 (2h ago)  0.16  0.03      5
lag4 (4h ago)  0.15  0.03      5

--- Interpretation ---
Most important lag type: current (t=0)
Current values matter most (fast response or no lag effect)


## Summary

### Model Comparison

| Notebook | Features | Approach | R² |
|----------|----------|----------|----|
| #3 | 20 | No lags | 0.36 |
| #4 | 134 | All lags for all features | -0.05 |
| **#5** | **30** | **Selective lags for key features** | **~0.40** |

### Key Insights

1. **Time lags matter** — but only for key features
2. **More features ≠ better** — too many lags caused overfitting
3. **Feature importance shows process timing** — which lags matter tells us response time

### Limitations

- R² = 0.40 means 60% of variance is unexplained
- Model shows correlations, not guaranteed causal effects
- Use for directional guidance, not precise predictions