# Explainable RUL Prediction Models for Fleet Managers

This notebook teaches you how to build **transparent, trustworthy models** that fleet managers can understand and confidently use for DPF maintenance decisions.

## 🎯 Why Explainable Models Matter

**Traditional "Black Box" Approach**:
- Complex neural networks, ensemble methods
- High accuracy but no explanations
- "Trust the algorithm" - hard to validate
- Difficult to debug when predictions fail

**Our Explainable Approach**:
- Simple, transparent models
- Every prediction comes with reasons
- Fleet managers can validate logic
- Easy to improve and maintain

## 📚 What You'll Learn

1. **Model Selection**: Choosing algorithms that naturally provide explanations
2. **Feature Importance**: Understanding which sensors matter most
3. **Prediction Explanations**: Generating human-readable reasons for each prediction
4. **Model Validation**: Ensuring predictions make operational sense
5. **Deployment Strategy**: Rolling out explainable models in production

## 🏆 Success Criteria

A successful explainable model should:
- ✅ Predict RUL within ±10 days accuracy
- ✅ Provide clear explanations for every prediction
- ✅ Use features that fleet managers understand
- ✅ Be simple enough to validate manually
- ✅ Enable actionable maintenance decisions

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Explainable ML libraries
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import shap  # For model explanations

# Set up plotting
plt.style.use('default')
sns.set_palette("Set2")
plt.rcParams['figure.figsize'] = (12, 8)

print("🚀 Explainable Model Tutorial Ready!")
print("🎯 Goal: Build RUL models that fleet managers can trust and understand")
print("📊 Using interpretable features from previous notebook")

🚀 Explainable Model Tutorial Ready!
🎯 Goal: Build RUL models that fleet managers can trust and understand
📊 Using interpretable features from previous notebook


## 📊 Step 1: Load and Prepare Interpretable Features

We'll use the interpretable features created in the previous notebook. These features were specifically designed to be explainable to fleet managers.

In [ ]:
# Load interpretable features dataset
print("📁 Loading Interpretable Features Dataset...")

try:
    # First try to load the interpretable features dataset
    feature_dataset = pd.read_csv('../data/interpretable_features_dataset.csv')
    print(f"✅ Loaded interpretable features: {len(feature_dataset)} examples")
    
    # Check if we have the necessary columns
    if 'rul_days' not in feature_dataset.columns:
        print("⚠️ No RUL column found. Checking for maintenance data to create RUL...")
        
        # If no RUL column, we need to create it from maintenance data
        maintenance_df = pd.read_csv('../data/dpf_maintenance_records.csv')
        maintenance_df['Date of Issue'] = pd.to_datetime(maintenance_df['Date of Issue'])
        
        # Create RUL by finding time between consecutive maintenance events
        rul_data = []
        for vehicle_num in maintenance_df['Vehicle_Number'].unique():
            vehicle_maintenance = maintenance_df[
                maintenance_df['Vehicle_Number'] == vehicle_num
            ].sort_values('Date of Issue')
            
            # For each maintenance event except the last, calculate RUL
            for i in range(len(vehicle_maintenance) - 1):
                current_event = vehicle_maintenance.iloc[i]
                next_event = vehicle_maintenance.iloc[i + 1]
                
                rul_days = (next_event['Date of Issue'] - current_event['Date of Issue']).days
                
                # Find corresponding entry in feature dataset
                matching_entries = feature_dataset[
                    (feature_dataset['vehicle_number'] == vehicle_num) &
                    (pd.to_datetime(feature_dataset['maintenance_date']).dt.date == current_event['Date of Issue'].date())
                ]
                
                if len(matching_entries) > 0:
                    rul_data.append({
                        'index': matching_entries.index[0],
                        'rul_days': rul_days
                    })
        
        # Add RUL column to feature dataset
        feature_dataset['rul_days'] = np.nan
        for rul_entry in rul_data:
            feature_dataset.loc[rul_entry['index'], 'rul_days'] = rul_entry['rul_days']
        
        # Remove rows without RUL data
        feature_dataset = feature_dataset.dropna(subset=['rul_days'])
        print(f"✅ Created RUL column: {len(feature_dataset)} examples with RUL data")
    
except FileNotFoundError:
    print("⚠️ Interpretable features not found. Creating from raw data...")
    
    # Load raw data and create features quickly
    maintenance_df = pd.read_csv('../data/dpf_maintenance_records.csv')
    sensor_df = pd.read_csv('../data/dpf_vehicle_stats.csv')
    
    # Convert datetime columns and handle timezones
    maintenance_df['Date of Issue'] = pd.to_datetime(maintenance_df['Date of Issue'])
    sensor_df['time'] = pd.to_datetime(sensor_df['time']).dt.tz_localize(None)
    
    # Create simple RUL dataset for demonstration
    print("🔧 Creating simplified feature dataset...")
    
    feature_data = []
    
    # Create RUL labels (time between maintenance events)
    for vehicle_num in maintenance_df['Vehicle_Number'].unique():
        vehicle_maintenance = maintenance_df[
            maintenance_df['Vehicle_Number'] == vehicle_num
        ].sort_values('Date of Issue')
        
        # For each maintenance event except the last, calculate RUL
        for i in range(len(vehicle_maintenance) - 1):
            current_event = vehicle_maintenance.iloc[i]
            next_event = vehicle_maintenance.iloc[i + 1]
            
            rul_days = (next_event['Date of Issue'] - current_event['Date of Issue']).days
            
            # Get sensor data for 30 days before current maintenance
            vin = current_event['VIN Number']
            start_date = current_event['Date of Issue'] - timedelta(days=30)
            end_date = current_event['Date of Issue']
            
            vehicle_sensors = sensor_df[
                (sensor_df['vin'] == vin) &
                (sensor_df['time'] >= start_date) &
                (sensor_df['time'] < end_date)
            ]
            
            if len(vehicle_sensors) >= 5:  # Need minimum data
                # Calculate simple interpretable features
                features = {
                    'vehicle_number': vehicle_num,
                    'vin': vin,
                    'rul_days': rul_days,
                    'maintenance_type': current_event['lines_jobDescriptions'],
                    'data_points': len(vehicle_sensors)
                }
                
                # Add simple sensor statistics
                for sensor in ['engineLoadPercent', 'engineRpm', 'ecuSpeedMph', 'defLevelMilliPercent']:
                    if sensor in vehicle_sensors.columns:
                        values = vehicle_sensors[sensor].dropna()
                        if len(values) > 0:
                            features[f'{sensor}_mean'] = values.mean()
                            features[f'{sensor}_std'] = values.std()
                            features[f'{sensor}_trend'] = np.polyfit(range(len(values)), values, 1)[0] if len(values) > 1 else 0
                
                feature_data.append(features)
    
    feature_dataset = pd.DataFrame(feature_data)
    print(f"✅ Created simplified feature dataset: {len(feature_dataset)} examples")

if len(feature_dataset) == 0:
    print("❌ No feature data available. Please run the feature engineering notebook first.")
    raise ValueError("No feature data available")

# Display dataset overview
print(f"\n📊 Dataset Overview:")
print(f"   Examples: {len(feature_dataset)}")

# Check for RUL column
if 'rul_days' not in feature_dataset.columns:
    print("❌ No RUL column found in dataset")
    raise ValueError("RUL column missing")

feature_cols_count = len([col for col in feature_dataset.columns if col not in ['vehicle_number', 'vin', 'rul_days', 'maintenance_type', 'data_points', 'maintenance_date', 'window_days']])
print(f"   Features: {feature_cols_count}")
print(f"   Target: RUL (days until next maintenance)")

# Show RUL distribution
print(f"\n🎯 RUL Distribution:")
print(f"   Mean: {feature_dataset['rul_days'].mean():.1f} days")
print(f"   Median: {feature_dataset['rul_days'].median():.1f} days")
print(f"   Range: {feature_dataset['rul_days'].min()}-{feature_dataset['rul_days'].max()} days")

# Filter out extreme outliers for modeling (>365 days likely data quality issues)
feature_dataset = feature_dataset[feature_dataset['rul_days'] <= 365]
print(f"   After filtering outliers: {len(feature_dataset)} examples")

## 🤖 Step 2: Model Selection for Explainability

We'll compare different model types based on their explainability and performance:

### 1. Linear Regression (Most Explainable)
**Pros**: Crystal clear coefficients, easy to validate
**Cons**: Assumes linear relationships
**Best for**: When you need maximum transparency

### 2. Decision Tree (Rule-Based Explainability)
**Pros**: Clear if-then rules, handles non-linear patterns
**Cons**: Can overfit, unstable
**Best for**: When you want rule-based explanations

### 3. Random Forest (Balanced Approach)
**Pros**: Good performance, feature importance, stable
**Cons**: Less transparent than single tree
**Best for**: When you need good performance with some explainability

In [3]:
# Prepare data for modeling
def prepare_modeling_data(feature_dataset):
    """
    Prepare data for explainable modeling.
    """
    # Identify feature columns (exclude metadata)
    exclude_cols = ['vehicle_number', 'vin', 'rul_days', 'maintenance_type', 'data_points', 'maintenance_date', 'window_days']
    feature_cols = [col for col in feature_dataset.columns if col not in exclude_cols]
    
    if len(feature_cols) == 0:
        print("❌ No feature columns found!")
        return None, None, None
    
    # Prepare feature matrix
    X = feature_dataset[feature_cols].fillna(0)  # Fill missing with 0 (neutral)
    y = feature_dataset['rul_days']
    
    # Remove infinite values
    X = X.replace([np.inf, -np.inf], 0)
    
    print(f"📊 Prepared modeling data:")
    print(f"   Samples: {len(X)}")
    print(f"   Features: {len(feature_cols)}")
    print(f"   Target range: {y.min():.0f} - {y.max():.0f} days")
    
    return X, y, feature_cols

# Prepare the data
X, y, feature_cols = prepare_modeling_data(feature_dataset)

if X is None:
    print("❌ Could not prepare modeling data")
else:
    # Split into train/test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    print(f"\n📋 Data Split:")
    print(f"   Training: {len(X_train)} samples")
    print(f"   Testing: {len(X_test)} samples")
    
    # Show feature preview
    print(f"\n🔧 Feature Preview (first 10):")
    for i, feature in enumerate(feature_cols[:10]):
        print(f"   {i+1}. {feature}")
    
    if len(feature_cols) > 10:
        print(f"   ... and {len(feature_cols) - 10} more features")

KeyError: 'rul_days'

In [4]:
# Model 1: Linear Regression (Maximum Explainability)
def build_linear_rul_model(X_train, X_test, y_train, y_test, feature_cols):
    """
    Build and evaluate a linear regression model for RUL prediction.
    This model provides maximum explainability through coefficients.
    """
    print("📈 Building Linear Regression Model (Maximum Explainability)")
    print("="*60)
    
    # Scale features for interpretability
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Train linear model
    linear_model = LinearRegression()
    linear_model.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred_train = linear_model.predict(X_train_scaled)
    y_pred_test = linear_model.predict(X_test_scaled)
    
    # Evaluate performance
    train_mae = mean_absolute_error(y_train, y_pred_train)
    test_mae = mean_absolute_error(y_test, y_pred_test)
    train_r2 = r2_score(y_train, y_pred_train)
    test_r2 = r2_score(y_test, y_pred_test)
    
    print(f"📊 Model Performance:")
    print(f"   Training MAE: {train_mae:.1f} days")
    print(f"   Testing MAE: {test_mae:.1f} days")
    print(f"   Training R²: {train_r2:.3f}")
    print(f"   Testing R²: {test_r2:.3f}")
    
    # Feature importance (coefficients)
    feature_importance = pd.DataFrame({
        'feature': feature_cols,
        'coefficient': linear_model.coef_,
        'abs_coefficient': np.abs(linear_model.coef_)
    }).sort_values('abs_coefficient', ascending=False)
    
    print(f"\n🔍 Top 10 Most Important Features (Linear Coefficients):")
    for i, row in feature_importance.head(10).iterrows():
        direction = "↗️ Increases" if row['coefficient'] > 0 else "↘️ Decreases"
        print(f"   {row['feature']}: {direction} RUL by {abs(row['coefficient']):.2f} days per unit")
    
    return linear_model, scaler, feature_importance

# Build linear model
if X is not None:
    linear_model, linear_scaler, linear_importance = build_linear_rul_model(
        X_train, X_test, y_train, y_test, feature_cols
    )

NameError: name 'X' is not defined

In [5]:
# Model 2: Decision Tree (Rule-Based Explainability)
def build_decision_tree_rul_model(X_train, X_test, y_train, y_test, feature_cols):
    """
    Build and evaluate a decision tree model for RUL prediction.
    This model provides rule-based explainability.
    """
    print("\n🌳 Building Decision Tree Model (Rule-Based Explainability)")
    print("="*60)
    
    # Train decision tree (limit depth for interpretability)
    tree_model = DecisionTreeRegressor(
        max_depth=5,  # Limit depth for explainability
        min_samples_split=10,  # Prevent overfitting
        min_samples_leaf=5,    # Ensure meaningful leaves
        random_state=42
    )
    tree_model.fit(X_train, y_train)
    
    # Make predictions
    y_pred_train = tree_model.predict(X_train)
    y_pred_test = tree_model.predict(X_test)
    
    # Evaluate performance
    train_mae = mean_absolute_error(y_train, y_pred_train)
    test_mae = mean_absolute_error(y_test, y_pred_test)
    train_r2 = r2_score(y_train, y_pred_train)
    test_r2 = r2_score(y_test, y_pred_test)
    
    print(f"📊 Model Performance:")
    print(f"   Training MAE: {train_mae:.1f} days")
    print(f"   Testing MAE: {test_mae:.1f} days")
    print(f"   Training R²: {train_r2:.3f}")
    print(f"   Testing R²: {test_r2:.3f}")
    
    # Feature importance
    feature_importance = pd.DataFrame({
        'feature': feature_cols,
        'importance': tree_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print(f"\n🔍 Top 10 Most Important Features (Decision Tree):")
    for i, row in feature_importance.head(10).iterrows():
        print(f"   {row['feature']}: {row['importance']:.3f} importance")
    
    # Extract simple rules
    print(f"\n📋 Sample Decision Rules (Simplified):")
    print(f"   Tree depth: {tree_model.get_depth()}")
    print(f"   Number of leaves: {tree_model.get_n_leaves()}")
    print(f"   Most important feature: {feature_importance.iloc[0]['feature']}")
    
    return tree_model, feature_importance

# Build decision tree model
if X is not None:
    tree_model, tree_importance = build_decision_tree_rul_model(
        X_train, X_test, y_train, y_test, feature_cols
    )

NameError: name 'X' is not defined

In [6]:
# Model 3: Random Forest (Balanced Performance/Explainability)
def build_random_forest_rul_model(X_train, X_test, y_train, y_test, feature_cols):
    """
    Build and evaluate a random forest model for RUL prediction.
    This model balances performance and explainability.
    """
    print("\n🌲 Building Random Forest Model (Balanced Approach)")
    print("="*60)
    
    # Train random forest
    rf_model = RandomForestRegressor(
        n_estimators=100,      # Enough trees for stability
        max_depth=6,           # Limit depth for interpretability
        min_samples_split=10,  # Prevent overfitting
        min_samples_leaf=5,    # Ensure meaningful leaves
        random_state=42
    )
    rf_model.fit(X_train, y_train)
    
    # Make predictions
    y_pred_train = rf_model.predict(X_train)
    y_pred_test = rf_model.predict(X_test)
    
    # Evaluate performance
    train_mae = mean_absolute_error(y_train, y_pred_train)
    test_mae = mean_absolute_error(y_test, y_pred_test)
    train_r2 = r2_score(y_train, y_pred_train)
    test_r2 = r2_score(y_test, y_pred_test)
    
    print(f"📊 Model Performance:")
    print(f"   Training MAE: {train_mae:.1f} days")
    print(f"   Testing MAE: {test_mae:.1f} days")
    print(f"   Training R²: {train_r2:.3f}")
    print(f"   Testing R²: {test_r2:.3f}")
    
    # Feature importance
    feature_importance = pd.DataFrame({
        'feature': feature_cols,
        'importance': rf_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print(f"\n🔍 Top 10 Most Important Features (Random Forest):")
    for i, row in feature_importance.head(10).iterrows():
        print(f"   {row['feature']}: {row['importance']:.3f} importance")
    
    return rf_model, feature_importance

# Build random forest model
if X is not None:
    rf_model, rf_importance = build_random_forest_rul_model(
        X_train, X_test, y_train, y_test, feature_cols
    )

NameError: name 'X' is not defined

## 📊 Step 3: Model Comparison and Selection

Let's compare our three models on multiple criteria that matter for fleet management deployment.

In [None]:
# Compare all models
def compare_models():
    """
    Compare all three models on performance and explainability criteria.
    """
    print("🏆 MODEL COMPARISON SUMMARY")
    print("="*50)
    
    if X is None:
        print("❌ No models to compare")
        return
    
    # Calculate performance metrics for all models
    models_performance = []
    
    # Linear Regression
    if 'linear_model' in globals():
        X_test_scaled = linear_scaler.transform(X_test)
        linear_pred = linear_model.predict(X_test_scaled)
        linear_mae = mean_absolute_error(y_test, linear_pred)
        linear_r2 = r2_score(y_test, linear_pred)
        
        models_performance.append({
            'Model': 'Linear Regression',
            'MAE (days)': linear_mae,
            'R² Score': linear_r2,
            'Explainability': '⭐⭐⭐⭐⭐',
            'Complexity': '⭐',
            'Best For': 'Maximum transparency'
        })
    
    # Decision Tree
    if 'tree_model' in globals():
        tree_pred = tree_model.predict(X_test)
        tree_mae = mean_absolute_error(y_test, tree_pred)
        tree_r2 = r2_score(y_test, tree_pred)
        
        models_performance.append({
            'Model': 'Decision Tree',
            'MAE (days)': tree_mae,
            'R² Score': tree_r2,
            'Explainability': '⭐⭐⭐⭐',
            'Complexity': '⭐⭐',
            'Best For': 'Rule-based decisions'
        })
    
    # Random Forest
    if 'rf_model' in globals():
        rf_pred = rf_model.predict(X_test)
        rf_mae = mean_absolute_error(y_test, rf_pred)
        rf_r2 = r2_score(y_test, rf_pred)
        
        models_performance.append({
            'Model': 'Random Forest',
            'MAE (days)': rf_mae,
            'R² Score': rf_r2,
            'Explainability': '⭐⭐⭐',
            'Complexity': '⭐⭐⭐',
            'Best For': 'Performance + some explainability'
        })
    
    # Display comparison
    if models_performance:
        comparison_df = pd.DataFrame(models_performance)
        print(comparison_df.to_string(index=False, float_format='%.2f'))
        
        # Recommendations
        print(f"\n🎯 RECOMMENDATIONS:")
        
        best_performance = comparison_df.loc[comparison_df['MAE (days)'].idxmin()]
        print(f"   🏆 Best Performance: {best_performance['Model']} (MAE: {best_performance['MAE (days)']:.1f} days)")
        
        print(f"   🔍 Most Explainable: Linear Regression (crystal clear coefficients)")
        print(f"   ⚖️ Best Balance: Random Forest (good performance + feature importance)")
        
        print(f"\n💡 SELECTION GUIDE:")
        print(f"   • Choose Linear Regression if: Transparency is most important")
        print(f"   • Choose Decision Tree if: You want simple if-then rules")
        print(f"   • Choose Random Forest if: You need best predictive performance")
    
    return models_performance

# Compare models
model_comparison = compare_models()

## 🔍 Step 4: Generate Explanations for Predictions

The most important part of explainable models is providing clear explanations for each prediction. Let's create explanation systems for our models.

In [None]:
# Explanation system for Linear Regression
def explain_linear_prediction(model, scaler, feature_cols, feature_importance, sample_features, sample_rul=None):
    """
    Generate human-readable explanations for linear regression predictions.
    """
    # Scale the sample
    sample_scaled = scaler.transform([sample_features])
    
    # Make prediction
    predicted_rul = model.predict(sample_scaled)[0]
    
    # Calculate feature contributions
    contributions = []
    intercept_contribution = model.intercept_
    
    for i, (feature, value, scaled_value) in enumerate(zip(feature_cols, sample_features, sample_scaled[0])):
        coefficient = model.coef_[i]
        contribution = coefficient * scaled_value
        
        if abs(contribution) > 1:  # Only show significant contributions
            contributions.append({
                'feature': feature,
                'value': value,
                'coefficient': coefficient,
                'contribution': contribution
            })
    
    # Sort by absolute contribution
    contributions.sort(key=lambda x: abs(x['contribution']), reverse=True)
    
    # Generate explanation
    explanation = {
        'predicted_rul': predicted_rul,
        'actual_rul': sample_rul,
        'baseline_rul': intercept_contribution,
        'top_contributors': contributions[:5],
        'explanation_text': []
    }
    
    # Generate human-readable text
    explanation['explanation_text'].append(f"🎯 Predicted RUL: {predicted_rul:.0f} days")
    if sample_rul is not None:
        error = abs(predicted_rul - sample_rul)
        explanation['explanation_text'].append(f"📊 Actual RUL: {sample_rul:.0f} days (Error: {error:.0f} days)")
    
    explanation['explanation_text'].append(f"\n📈 Key Contributing Factors:")
    
    for i, contrib in enumerate(contributions[:3], 1):
        direction = "increases" if contrib['contribution'] > 0 else "decreases"
        impact = abs(contrib['contribution'])
        
        explanation['explanation_text'].append(
            f"   {i}. {contrib['feature']} ({contrib['value']:.2f}) {direction} RUL by {impact:.1f} days"
        )
    
    return explanation

# Explanation system for Decision Tree
def explain_tree_prediction(model, feature_cols, sample_features, sample_rul=None):
    """
    Generate human-readable explanations for decision tree predictions.
    """
    # Make prediction
    predicted_rul = model.predict([sample_features])[0]
    
    # Get the leaf node and path
    leaf_id = model.decision_path([sample_features]).indices
    
    # Find which features were used in the decision path
    feature_used = []
    threshold_used = []
    
    for node_id in leaf_id:
        if model.tree_.feature[node_id] != -2:  # Not a leaf
            feature_idx = model.tree_.feature[node_id]
            threshold = model.tree_.threshold[node_id]
            feature_name = feature_cols[feature_idx]
            feature_value = sample_features[feature_idx]
            
            direction = "≤" if feature_value <= threshold else ">"
            feature_used.append(f"{feature_name} {direction} {threshold:.2f}")
    
    explanation = {
        'predicted_rul': predicted_rul,
        'actual_rul': sample_rul,
        'decision_path': feature_used,
        'explanation_text': []
    }
    
    # Generate human-readable text
    explanation['explanation_text'].append(f"🎯 Predicted RUL: {predicted_rul:.0f} days")
    if sample_rul is not None:
        error = abs(predicted_rul - sample_rul)
        explanation['explanation_text'].append(f"📊 Actual RUL: {sample_rul:.0f} days (Error: {error:.0f} days)")
    
    explanation['explanation_text'].append(f"\n🌳 Decision Path:")
    for i, condition in enumerate(feature_used[:5], 1):
        explanation['explanation_text'].append(f"   {i}. {condition}")
    
    return explanation

# Demonstrate explanations with a sample
def demonstrate_explanations():
    """
    Demonstrate explanation systems with sample predictions.
    """
    if X is None or len(X_test) == 0:
        print("❌ No test data available for explanation demonstration")
        return
    
    print("🔍 EXPLANATION DEMONSTRATION")
    print("="*50)
    
    # Get a sample from test set
    sample_idx = 0
    sample_features = X_test.iloc[sample_idx].values
    sample_rul = y_test.iloc[sample_idx]
    
    print(f"Sample Vehicle Analysis:")
    print(f"Vehicle ID: {feature_dataset.iloc[X_test.index[sample_idx]].get('vehicle_number', 'Unknown')}")
    
    # Linear Regression Explanation
    if 'linear_model' in globals():
        print(f"\n📈 LINEAR REGRESSION EXPLANATION:")
        linear_explanation = explain_linear_prediction(
            linear_model, linear_scaler, feature_cols, linear_importance, 
            sample_features, sample_rul
        )
        
        for line in linear_explanation['explanation_text']:
            print(line)
    
    # Decision Tree Explanation
    if 'tree_model' in globals():
        print(f"\n🌳 DECISION TREE EXPLANATION:")
        tree_explanation = explain_tree_prediction(
            tree_model, feature_cols, sample_features, sample_rul
        )
        
        for line in tree_explanation['explanation_text']:
            print(line)
    
    print(f"\n💡 How Fleet Managers Should Use These Explanations:")
    print(f"   • Linear Regression: Focus on top contributing factors")
    print(f"   • Decision Tree: Follow the if-then rule logic")
    print(f"   • Both models: Look for actionable maintenance triggers")

# Demonstrate explanations
demonstrate_explanations()

## 🎯 Step 5: Model Validation for Fleet Management

Beyond statistical metrics, we need to validate that our models make operational sense for fleet managers.

In [None]:
# Fleet management validation
def validate_for_fleet_management():
    """
    Validate models from a fleet management perspective.
    """
    if X is None:
        print("❌ No models to validate")
        return
    
    print("🚛 FLEET MANAGEMENT VALIDATION")
    print("="*50)
    
    # Test different scenarios
    validation_scenarios = [
        {
            'name': 'Early Warning Capability',
            'description': 'Can the model predict issues 30+ days in advance?',
            'test': lambda pred, actual: (pred <= 30 and actual <= 45) or (pred > 30 and actual > 30)
        },
        {
            'name': 'Emergency Alert Accuracy',
            'description': 'Does the model correctly identify urgent cases (<15 days)?',
            'test': lambda pred, actual: (pred <= 15 and actual <= 20) or (pred > 15 and actual > 10)
        },
        {
            'name': 'False Alarm Rate',
            'description': 'How often does the model predict urgent when not needed?',
            'test': lambda pred, actual: not (pred <= 30 and actual > 60)
        }
    ]
    
    # Test each model
    model_results = {}
    
    if 'linear_model' in globals():
        X_test_scaled = linear_scaler.transform(X_test)
        linear_predictions = linear_model.predict(X_test_scaled)
        model_results['Linear Regression'] = linear_predictions
    
    if 'tree_model' in globals():
        tree_predictions = tree_model.predict(X_test)
        model_results['Decision Tree'] = tree_predictions
    
    if 'rf_model' in globals():
        rf_predictions = rf_model.predict(X_test)
        model_results['Random Forest'] = rf_predictions
    
    # Evaluate scenarios for each model
    print(f"📋 Scenario-Based Validation Results:")
    print()
    
    for scenario in validation_scenarios:
        print(f"🎯 {scenario['name']}:")
        print(f"   {scenario['description']}")
        
        for model_name, predictions in model_results.items():
            # Calculate success rate for this scenario
            successes = 0
            total = len(predictions)
            
            for pred, actual in zip(predictions, y_test):
                if scenario['test'](pred, actual):
                    successes += 1
            
            success_rate = (successes / total) * 100
            print(f"   {model_name}: {success_rate:.1f}% success rate")
        
        print()
    
    # Operational recommendations
    print(f"🎯 OPERATIONAL RECOMMENDATIONS:")
    
    # Find best model for early warning
    best_early_warning = None
    best_early_warning_rate = 0
    
    for model_name, predictions in model_results.items():
        early_warning_successes = sum(1 for pred, actual in zip(predictions, y_test) 
                                    if (pred <= 30 and actual <= 45) or (pred > 30 and actual > 30))
        rate = early_warning_successes / len(predictions)
        
        if rate > best_early_warning_rate:
            best_early_warning_rate = rate
            best_early_warning = model_name
    
    if best_early_warning:
        print(f"   🏆 Best for Early Warning: {best_early_warning} ({best_early_warning_rate:.1%} success)")
    
    print(f"   📅 Recommended Check Frequency: Weekly for vehicles with <60 days predicted RUL")
    print(f"   🚨 Urgent Alert Threshold: <30 days predicted RUL")
    print(f"   ⚠️ Caution Alert Threshold: 30-60 days predicted RUL")
    print(f"   ✅ Normal Monitoring: >60 days predicted RUL")

# Run fleet management validation
validate_for_fleet_management()

## 🚀 Step 6: Deployment Strategy for Explainable Models

Now let's create a practical deployment strategy that fleet managers can implement.

In [ ]:
# Create deployment package
def create_deployment_strategy():
    """
    Create a practical deployment strategy for fleet managers.
    """
    print("🚀 EXPLAINABLE MODEL DEPLOYMENT STRATEGY")
    print("="*60)
    
    print(f"📋 PHASE 1: PILOT DEPLOYMENT (Weeks 1-4)")
    print(f"   🎯 Objective: Validate model performance with small fleet subset")
    print(f"   📊 Scope: 5-10 highest-risk vehicles")
    print(f"   📈 Success Criteria:")
    print(f"      • Predictions within ±15 days of actual maintenance")
    print(f"      • Fleet managers understand all explanations")
    print(f"      • At least 2 successful proactive maintenance actions")
    print(f"   🔧 Actions:")
    print(f"      • Daily model runs and prediction reviews")
    print(f"      • Weekly explanation validation meetings")
    print(f"      • Track prediction accuracy vs actual maintenance")
    
    print(f"\n📋 PHASE 2: SCALED DEPLOYMENT (Weeks 5-12)")
    print(f"   🎯 Objective: Expand to full fleet with automated alerts")
    print(f"   📊 Scope: All vehicles with sufficient data coverage")
    print(f"   📈 Success Criteria:")
    print(f"      • 20% reduction in emergency DPF repairs")
    print(f"      • 15% improvement in maintenance planning efficiency")
    print(f"      • Fleet manager confidence score >8/10")
    print(f"   🔧 Actions:")
    print(f"      • Automated daily reports with explanations")
    print(f"      • Integration with existing maintenance systems")
    print(f"      • Staff training on model interpretation")
    
    print(f"\n📋 PHASE 3: OPTIMIZATION (Weeks 13+)")
    print(f"   🎯 Objective: Continuous improvement and model refinement")
    print(f"   📊 Scope: Full fleet + model enhancements")
    print(f"   🔧 Actions:")
    print(f"      • Monthly model retraining with new data")
    print(f"      • Feature threshold adjustments based on experience")
    print(f"      • Expansion to other maintenance categories")
    
    print(f"\n🛠️ TECHNICAL IMPLEMENTATION:")
    
    # Model selection recommendation
    if 'model_comparison' in globals() and model_comparison:
        best_model = min(model_comparison, key=lambda x: x['MAE (days)'])
        print(f"   🏆 Recommended Model: {best_model['Model']}")
        print(f"      Accuracy: ±{best_model['MAE (days)']:.1f} days")
        print(f"      Explainability: {best_model['Explainability']}")
        print(f"      Best For: {best_model['Best For']}")
    else:
        print(f"   🏆 Recommended Model: Linear Regression (Most Explainable)")
        print(f"      • Crystal clear coefficient interpretations")
        print(f"      • Easy to validate predictions manually")
        print(f"      • Maximum transparency for fleet managers")
    
    print(f"\n   📊 Alert System Configuration:")
    print(f"      🚨 URGENT (Immediate Action): RUL ≤ 15 days")
    print(f"         → Schedule maintenance within 1 week")
    print(f"         → Daily monitoring of key sensors")
    print(f"      ⚠️ WARNING (Plan Soon): RUL 16-30 days")
    print(f"         → Schedule maintenance within 2-3 weeks")
    print(f"         → Weekly sensor review")
    print(f"      ⚡ CAUTION (Monitor): RUL 31-60 days")
    print(f"         → Plan maintenance within 1-2 months")
    print(f"         → Bi-weekly trend analysis")
    print(f"      ✅ NORMAL (Routine): RUL > 60 days")
    print(f"         → Standard maintenance schedule")
    print(f"         → Monthly monitoring")
    
    print(f"\n🎓 TRAINING REQUIREMENTS:")
    print(f"   👥 Fleet Managers (4 hours):")
    print(f"      • Understanding model predictions and explanations")
    print(f"      • Interpreting feature importance and trends")
    print(f"      • Making data-driven maintenance decisions")
    print(f"   🔧 Maintenance Staff (2 hours):")
    print(f"      • Reading alert reports and explanations")
    print(f"      • Validating predictions against actual findings")
    print(f"      • Providing feedback for model improvement")
    
    print(f"\n📈 SUCCESS METRICS:")
    print(f"   📊 Operational Metrics:")
    print(f"      • Prediction accuracy (target: ±10 days)")
    print(f"      • Emergency repair reduction (target: 25%)")
    print(f"      • Maintenance cost savings (target: 15%)")
    print(f"      • Vehicle downtime reduction (target: 20%)")
    print(f"   👥 User Adoption Metrics:")
    print(f"      • Fleet manager explanation comprehension (target: >90%)")
    print(f"      • Daily system usage rate (target: >80%)")
    print(f"      • User confidence in predictions (target: >8/10)")

# Create deployment strategy
create_deployment_strategy()

## 💾 Step 7: Save Explainable Models for Production

Let's save our best performing explainable model for production use.

In [None]:
# Save models for production
def save_production_models():
    """
    Save the best explainable models for production deployment.
    """
    print("💾 SAVING EXPLAINABLE MODELS FOR PRODUCTION")
    print("="*50)
    
    import joblib
    import json
    from datetime import datetime
    
    # Create model metadata
    model_metadata = {
        'created_date': datetime.now().isoformat(),
        'model_purpose': 'DPF RUL Prediction',
        'target_variable': 'Days until next DPF maintenance',
        'feature_count': len(feature_cols) if feature_cols else 0,
        'training_samples': len(X_train) if X is not None else 0,
        'features': feature_cols if feature_cols else [],
        'model_versions': []
    }
    
    saved_models = []
    
    # Save Linear Regression (Most Explainable)
    if 'linear_model' in globals():
        try:
            joblib.dump(linear_model, '../models/linear_rul_model.pkl')
            joblib.dump(linear_scaler, '../models/linear_rul_scaler.pkl')
            linear_importance.to_csv('../models/linear_feature_importance.csv', index=False)
            
            # Test performance
            X_test_scaled = linear_scaler.transform(X_test)
            linear_pred = linear_model.predict(X_test_scaled)
            linear_mae = mean_absolute_error(y_test, linear_pred)
            
            model_metadata['model_versions'].append({
                'name': 'Linear Regression',
                'file': 'linear_rul_model.pkl',
                'scaler': 'linear_rul_scaler.pkl',
                'feature_importance': 'linear_feature_importance.csv',
                'mae_days': linear_mae,
                'explainability_level': 'Maximum',
                'explanation_type': 'Coefficient-based'
            })
            
            saved_models.append('Linear Regression')
            print(f"✅ Saved Linear Regression model (MAE: {linear_mae:.1f} days)")
            
        except Exception as e:
            print(f"❌ Failed to save Linear Regression: {e}")
    
    # Save Decision Tree (Rule-Based)
    if 'tree_model' in globals():
        try:
            joblib.dump(tree_model, '../models/tree_rul_model.pkl')
            tree_importance.to_csv('../models/tree_feature_importance.csv', index=False)
            
            # Test performance
            tree_pred = tree_model.predict(X_test)
            tree_mae = mean_absolute_error(y_test, tree_pred)
            
            model_metadata['model_versions'].append({
                'name': 'Decision Tree',
                'file': 'tree_rul_model.pkl',
                'scaler': None,
                'feature_importance': 'tree_feature_importance.csv',
                'mae_days': tree_mae,
                'explainability_level': 'High',
                'explanation_type': 'Rule-based'
            })
            
            saved_models.append('Decision Tree')
            print(f"✅ Saved Decision Tree model (MAE: {tree_mae:.1f} days)")
            
        except Exception as e:
            print(f"❌ Failed to save Decision Tree: {e}")
    
    # Save Random Forest (Balanced)
    if 'rf_model' in globals():
        try:
            joblib.dump(rf_model, '../models/rf_rul_model.pkl')
            rf_importance.to_csv('../models/rf_feature_importance.csv', index=False)
            
            # Test performance
            rf_pred = rf_model.predict(X_test)
            rf_mae = mean_absolute_error(y_test, rf_pred)
            
            model_metadata['model_versions'].append({
                'name': 'Random Forest',
                'file': 'rf_rul_model.pkl',
                'scaler': None,
                'feature_importance': 'rf_feature_importance.csv',
                'mae_days': rf_mae,
                'explainability_level': 'Medium',
                'explanation_type': 'Feature importance'
            })
            
            saved_models.append('Random Forest')
            print(f"✅ Saved Random Forest model (MAE: {rf_mae:.1f} days)")
            
        except Exception as e:
            print(f"❌ Failed to save Random Forest: {e}")
    
    # Save metadata
    try:
        with open('../models/model_metadata.json', 'w') as f:
            json.dump(model_metadata, f, indent=2)
        print(f"✅ Saved model metadata")
    except Exception as e:
        print(f"❌ Failed to save metadata: {e}")
    
    # Create usage example
    usage_example = f'''
# PRODUCTION USAGE EXAMPLE
import joblib
import pandas as pd
import numpy as np

# Load the best explainable model
model = joblib.load('models/linear_rul_model.pkl')  # Most explainable
scaler = joblib.load('models/linear_rul_scaler.pkl')
feature_importance = pd.read_csv('models/linear_feature_importance.csv')

# Prepare new vehicle data (same features as training)
new_vehicle_features = [...]  # Your feature values here

# Make prediction
scaled_features = scaler.transform([new_vehicle_features])
predicted_rul = model.predict(scaled_features)[0]

# Generate explanation
contributions = model.coef_ * scaled_features[0]
top_contributors = feature_importance.head(5)

print(f"Predicted RUL: {{predicted_rul:.0f}} days")
print(f"Top contributing factors:")
for _, row in top_contributors.iterrows():
    direction = "increases" if row['coefficient'] > 0 else "decreases"
    print(f"  - {{row['feature']}}: {{direction}} RUL")
'''
    
    try:
        with open('../models/usage_example.py', 'w') as f:
            f.write(usage_example)
        print(f"✅ Created usage example")
    except Exception as e:
        print(f"❌ Failed to create usage example: {e}")
    
    print(f"\n🎉 PRODUCTION READY!")
    print(f"   📁 Models saved in: ../models/")
    print(f"   🤖 Models available: {', '.join(saved_models)}")
    print(f"   📊 Metadata file: model_metadata.json")
    print(f"   💡 Usage example: usage_example.py")
    
    return model_metadata

# Create models directory and save
import os
os.makedirs('../models', exist_ok=True)

if X is not None:
    production_metadata = save_production_models()
else:
    print("❌ No models available to save")

print(f"\n📚 TUTORIAL COMPLETE!")
print(f"🎯 You've learned how to:")
print(f"   • Build explainable RUL prediction models")
print(f"   • Generate human-readable explanations for predictions")
print(f"   • Validate models for fleet management use")
print(f"   • Create deployment strategies for production")
print(f"   • Save models for operational use")
print(f"\n🚀 Ready for deployment in your fleet management system!")