# Comprehensive Process Energy Demand Modelling and Evaluation

This notebook provides a unified framework for preprocessing multiple datasets, extracting processes, simulating them, modeling energy profiles, and evaluating results across all datasets.

## 1. Import Required Libraries

Import all necessary libraries for data processing, process mining, modeling, and visualization.

In [2]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Process mining
import pm4py
# from pm4py.objects.log.importer.xes import importer as xes_importer
# from pm4py.objects.conversion.log import converter as log_converter
# from pm4py.algo.discovery.alpha import algorithm as alpha_miner
# from pm4py.algo.discovery.inductive import algorithm as inductive_miner
# from pm4py.simulation.playout import simulator

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import LabelEncoder

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Utilities
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")

All libraries imported successfully!


## 2. Configuration and Path Setup

Define paths for multiple datasets and set configuration parameters for preprocessing and modeling.

In [6]:
# Dataset configuration
DATASETS = {
    'dataset1': {
        'path': '/Users/davidzapata/Documents/GitHub/process_energy_demand_modelling/data/silver/company_1/df_sensor_joined.parquet',
        'name': 'Dataset 1',
        'case_id': 'case:concept:name',
        'activity': 'concept:name',
        'timestamp': 'time:timestamp',
        'energy': 'energy_consumed'
    },
    'dataset2': {
        'path': '/Users/davidzapata/Documents/GitHub/process_energy_demand_modelling/data/silver/company_2/df_combined_legend.parquet',
        'name': 'Dataset 2',
        'case_id': 'CaseID',
        'activity': 'Activity',
        'timestamp': 'Timestamp',
        'energy': 'EnergyDemand'
    },
    # Add more datasets as needed
}

# Modeling configuration
CONFIG = {
    'test_size': 0.2,
    'random_state': 42,
    'simulation_traces': 100,
    'models': ['RandomForest', 'GradientBoosting', 'LinearRegression']
}

# Results storage
results_dict = {}
preprocessed_datasets = {}

print(f"Configuration loaded for {len(DATASETS)} datasets")
print(f"Models to evaluate: {CONFIG['models']}")

Configuration loaded for 2 datasets
Models to evaluate: ['RandomForest', 'GradientBoosting', 'LinearRegression']


## 3. Data Preprocessing Module

Create reusable preprocessing functions that handle event log formatting, timestamp conversion, and data cleaning for all datasets.

In [7]:
def preprocess_event_log(dataset_key, config):
    """
    Preprocess event log data for process mining.
    
    Parameters:
    -----------
    dataset_key : str
        Key from DATASETS dictionary
    config : dict
        Dataset configuration with column mappings
        
    Returns:
    --------
    pd.DataFrame : Preprocessed event log
    """
    print(f"\n{'='*60}")
    print(f"Preprocessing: {config['name']}")
    print(f"{'='*60}")
    
    # Load data
    try:
        df = pd.read_csv(config['path'])
        print(f"✓ Loaded {len(df)} events from {config['path']}")
    except FileNotFoundError:
        print(f"✗ File not found: {config['path']}")
        return None
    
    # Rename columns to standard format
    column_mapping = {
        config['case_id']: 'case:concept:name',
        config['activity']: 'concept:name',
        config['timestamp']: 'time:timestamp',
        config['energy']: 'energy_consumed'
    }
    
    df = df.rename(columns=column_mapping)
    
    # Convert timestamp to datetime
    if not pd.api.types.is_datetime64_any_dtype(df['time:timestamp']):
        df['time:timestamp'] = pd.to_datetime(df['time:timestamp'], errors='coerce')
        print(f"✓ Converted timestamps to datetime format")
    
    # Remove rows with missing critical values
    initial_len = len(df)
    df = df.dropna(subset=['case:concept:name', 'concept:name', 'time:timestamp'])
    if len(df) < initial_len:
        print(f"✓ Removed {initial_len - len(df)} rows with missing critical values")
    
    # Sort by case and timestamp
    df = df.sort_values(['case:concept:name', 'time:timestamp'])
    print(f"✓ Sorted by case and timestamp")
    
    # Add additional features
    df['hour'] = df['time:timestamp'].dt.hour
    df['day_of_week'] = df['time:timestamp'].dt.dayofweek
    df['month'] = df['time:timestamp'].dt.month
    
    # Calculate duration (time since last event in case)
    df['duration'] = df.groupby('case:concept:name')['time:timestamp'].diff().dt.total_seconds()
    df['duration'] = df['duration'].fillna(0)
    
    print(f"✓ Added temporal features (hour, day_of_week, month, duration)")
    
    # Summary statistics
    print(f"\nDataset Summary:")
    print(f"  - Total events: {len(df)}")
    print(f"  - Unique cases: {df['case:concept:name'].nunique()}")
    print(f"  - Unique activities: {df['concept:name'].nunique()}")
    print(f"  - Date range: {df['time:timestamp'].min()} to {df['time:timestamp'].max()}")
    
    if 'energy_consumed' in df.columns:
        print(f"  - Energy range: {df['energy_consumed'].min():.2f} to {df['energy_consumed'].max():.2f}")
        print(f"  - Mean energy: {df['energy_consumed'].mean():.2f}")
    
    return df


def apply_preprocessing_to_all():
    """
    Apply preprocessing to all configured datasets.
    
    Returns:
    --------
    dict : Dictionary of preprocessed dataframes
    """
    preprocessed = {}
    
    for dataset_key, dataset_config in DATASETS.items():
        df = preprocess_event_log(dataset_key, dataset_config)
        if df is not None:
            preprocessed[dataset_key] = df
    
    print(f"\n{'='*60}")
    print(f"Preprocessing Complete: {len(preprocessed)}/{len(DATASETS)} datasets processed")
    print(f"{'='*60}")
    
    return preprocessed

In [8]:
# Execute preprocessing for all datasets
preprocessed_datasets = apply_preprocessing_to_all()


Preprocessing: Dataset 1


UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 7: invalid start byte

## 4. Process Discovery and Simulation

Extract process models from preprocessed event logs and generate simulated traces for each dataset.

In [None]:
def discover_and_simulate_process(df, dataset_name, num_traces=100):
    """
    Discover process model and simulate traces.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Preprocessed event log
    dataset_name : str
        Name of the dataset for logging
    num_traces : int
        Number of traces to simulate
        
    Returns:
    --------
    tuple : (original_log, simulated_log, process_model)
    """
    print(f"\n--- Process Discovery: {dataset_name} ---")
    
    # Convert to PM4Py event log
    log = log_converter.apply(df, variant=log_converter.Variants.TO_EVENT_LOG)
    print(f"✓ Converted to event log format")
    
    # Discover process model using Inductive Miner
    try:
        net, initial_marking, final_marking = inductive_miner.apply(log)
        print(f"✓ Process model discovered using Inductive Miner")
    except Exception as e:
        print(f"✗ Error in process discovery: {e}")
        return None, None, None
    
    # Simulate process traces
    try:
        simulated_log = simulator.apply(
            net, 
            initial_marking, 
            final_marking,
            parameters={
                simulator.Variants.STOCHASTIC_PLAYOUT.value.Parameters.NO_TRACES: num_traces
            }
        )
        print(f"✓ Simulated {len(simulated_log)} process traces")
    except Exception as e:
        print(f"✗ Error in simulation: {e}")
        return log, None, (net, initial_marking, final_marking)
    
    return log, simulated_log, (net, initial_marking, final_marking)


def convert_simulated_log_to_df(simulated_log, original_df):
    """
    Convert simulated log to DataFrame with features from original data.
    
    Parameters:
    -----------
    simulated_log : EventLog
        Simulated event log from PM4Py
    original_df : pd.DataFrame
        Original preprocessed dataframe for feature mapping
        
    Returns:
    --------
    pd.DataFrame : Simulated events as dataframe
    """
    simulated_data = []
    
    for trace_idx, trace in enumerate(simulated_log):
        for event_idx, event in enumerate(trace):
            event_dict = {
                'case:concept:name': f"simulated_case_{trace_idx}",
                'concept:name': event['concept:name'],
                'event_id': event_idx
            }
            simulated_data.append(event_dict)
    
    simulated_df = pd.DataFrame(simulated_data)
    
    # Add features based on original data patterns
    activity_features = original_df.groupby('concept:name').agg({
        'energy_consumed': 'mean',
        'duration': 'mean',
        'hour': lambda x: x.mode()[0] if len(x.mode()) > 0 else 12,
        'day_of_week': lambda x: x.mode()[0] if len(x.mode()) > 0 else 0
    }).reset_index()
    
    simulated_df = simulated_df.merge(activity_features, on='concept:name', how='left')
    
    print(f"✓ Converted simulated log to DataFrame with {len(simulated_df)} events")
    
    return simulated_df


# Process discovery and simulation for all datasets
simulated_datasets = {}

for dataset_key, df in preprocessed_datasets.items():
    dataset_name = DATASETS[dataset_key]['name']
    original_log, simulated_log, model = discover_and_simulate_process(
        df, 
        dataset_name, 
        CONFIG['simulation_traces']
    )
    
    if simulated_log is not None:
        simulated_df = convert_simulated_log_to_df(simulated_log, df)
        simulated_datasets[dataset_key] = {
            'original': df,
            'simulated': simulated_df,
            'model': model
        }

print(f"\nProcess discovery and simulation complete for {len(simulated_datasets)} datasets")

## 5. Energy Profile Modeling

Apply energy demand modeling to the simulated processes, creating energy profiles for each dataset.

In [None]:
def prepare_features_for_modeling(df):
    """
    Prepare features for machine learning models.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Event log dataframe
        
    Returns:
    --------
    tuple : (X, y, feature_names)
    """
    # Encode categorical features
    le = LabelEncoder()
    df_model = df.copy()
    
    if 'concept:name' in df_model.columns:
        df_model['activity_encoded'] = le.fit_transform(df_model['concept:name'])
    
    # Select features
    feature_cols = ['activity_encoded', 'hour', 'day_of_week', 'duration', 'event_id']
    available_features = [col for col in feature_cols if col in df_model.columns]
    
    X = df_model[available_features].fillna(0)
    y = df_model['energy_consumed'].fillna(df_model['energy_consumed'].mean())
    
    return X, y, available_features


def train_energy_models(X_train, y_train, X_test, y_test):
    """
    Train multiple energy prediction models.
    
    Parameters:
    -----------
    X_train, y_train : Training data
    X_test, y_test : Test data
    
    Returns:
    --------
    dict : Trained models and their predictions
    """
    models = {
        'RandomForest': RandomForestRegressor(n_estimators=100, random_state=CONFIG['random_state']),
        'GradientBoosting': GradientBoostingRegressor(n_estimators=100, random_state=CONFIG['random_state']),
        'LinearRegression': LinearRegression()
    }
    
    results = {}
    
    for model_name, model in models.items():
        # Train model
        model.fit(X_train, y_train)
        
        # Make predictions
        y_pred_train = model.predict(X_train)
        y_pred_test = model.predict(X_test)
        
        results[model_name] = {
            'model': model,
            'y_pred_train': y_pred_train,
            'y_pred_test': y_pred_test
        }
        
        print(f"✓ Trained {model_name}")
    
    return results


# Train models for all datasets
model_results = {}

for dataset_key, dataset in simulated_datasets.items():
    dataset_name = DATASETS[dataset_key]['name']
    print(f"\n{'='*60}")
    print(f"Energy Profile Modeling: {dataset_name}")
    print(f"{'='*60}")
    
    # Use simulated data for modeling
    df = dataset['simulated']
    
    # Prepare features
    X, y, feature_names = prepare_features_for_modeling(df)
    print(f"✓ Prepared {len(feature_names)} features: {feature_names}")
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=CONFIG['test_size'], 
        random_state=CONFIG['random_state']
    )
    print(f"✓ Split data: {len(X_train)} training, {len(X_test)} test samples")
    
    # Train models
    trained_models = train_energy_models(X_train, y_train, X_test, y_test)
    
    model_results[dataset_key] = {
        'name': dataset_name,
        'X_train': X_train,
        'X_test': X_test,
        'y_train': y_train,
        'y_test': y_test,
        'models': trained_models,
        'features': feature_names
    }

print(f"\nEnergy modeling complete for {len(model_results)} datasets")

## 6. Comprehensive Evaluation Framework

Implement evaluation metrics and comparison logic to assess model performance across all datasets.

In [None]:
def evaluate_model(y_true, y_pred, model_name, dataset_name, split='test'):
    """
    Evaluate model performance with multiple metrics.
    
    Parameters:
    -----------
    y_true : array-like
        True values
    y_pred : array-like
        Predicted values
    model_name : str
        Name of the model
    dataset_name : str
        Name of the dataset
    split : str
        'train' or 'test'
        
    Returns:
    --------
    dict : Evaluation metrics
    """
    metrics = {
        'Dataset': dataset_name,
        'Model': model_name,
        'Split': split,
        'RMSE': np.sqrt(mean_squared_error(y_true, y_pred)),
        'MAE': mean_absolute_error(y_true, y_pred),
        'R2': r2_score(y_true, y_pred),
        'MAPE': np.mean(np.abs((y_true - y_pred) / y_true)) * 100 if not np.any(y_true == 0) else np.nan
    }
    
    return metrics


def comprehensive_evaluation(model_results):
    """
    Perform comprehensive evaluation across all datasets and models.
    
    Parameters:
    -----------
    model_results : dict
        Dictionary containing model results for all datasets
        
    Returns:
    --------
    pd.DataFrame : Evaluation results
    """
    all_evaluations = []
    
    for dataset_key, results in model_results.items():
        dataset_name = results['name']
        
        print(f"\n--- Evaluating: {dataset_name} ---")
        
        for model_name, model_data in results['models'].items():
            # Evaluate on training set
            train_metrics = evaluate_model(
                results['y_train'],
                model_data['y_pred_train'],
                model_name,
                dataset_name,
                'train'
            )
            all_evaluations.append(train_metrics)
            
            # Evaluate on test set
            test_metrics = evaluate_model(
                results['y_test'],
                model_data['y_pred_test'],
                model_name,
                dataset_name,
                'test'
            )
            all_evaluations.append(test_metrics)
            
            print(f"  {model_name:20s} - Test R2: {test_metrics['R2']:.4f}, RMSE: {test_metrics['RMSE']:.4f}")
    
    evaluation_df = pd.DataFrame(all_evaluations)
    
    return evaluation_df


# Perform comprehensive evaluation
evaluation_results = comprehensive_evaluation(model_results)

print("\n" + "="*60)
print("EVALUATION COMPLETE")
print("="*60)
print(f"\nTotal evaluations: {len(evaluation_results)}")
print(f"Datasets: {evaluation_results['Dataset'].nunique()}")
print(f"Models: {evaluation_results['Model'].nunique()}")

## 7. Results Aggregation and Comparison

Collect, aggregate, and visualize evaluation results for all datasets in comparative tables and charts.

In [None]:
# Display test set results
test_results = evaluation_results[evaluation_results['Split'] == 'test'].copy()

print("TEST SET PERFORMANCE SUMMARY")
print("="*80)
print(test_results.to_string(index=False))
print("="*80)

In [None]:
# Calculate best model for each dataset
best_models = test_results.loc[test_results.groupby('Dataset')['R2'].idxmax()]

print("\nBEST MODEL FOR EACH DATASET (by R² Score)")
print("="*80)
print(best_models[['Dataset', 'Model', 'R2', 'RMSE', 'MAE']].to_string(index=False))
print("="*80)

In [None]:
# Visualization 1: R² Score Comparison
fig = px.bar(
    test_results,
    x='Dataset',
    y='R2',
    color='Model',
    barmode='group',
    title='R² Score Comparison Across Datasets and Models',
    labels={'R2': 'R² Score', 'Dataset': 'Dataset'},
    height=500
)

fig.update_layout(
    xaxis_title="Dataset",
    yaxis_title="R² Score",
    legend_title="Model",
    font=dict(size=12)
)

fig.show()

In [None]:
# Visualization 2: RMSE Comparison
fig = px.bar(
    test_results,
    x='Dataset',
    y='RMSE',
    color='Model',
    barmode='group',
    title='RMSE Comparison Across Datasets and Models',
    labels={'RMSE': 'Root Mean Squared Error', 'Dataset': 'Dataset'},
    height=500
)

fig.update_layout(
    xaxis_title="Dataset",
    yaxis_title="RMSE",
    legend_title="Model",
    font=dict(size=12)
)

fig.show()

In [None]:
# Visualization 3: Heatmap of R² scores
pivot_r2 = test_results.pivot(index='Model', columns='Dataset', values='R2')

fig = px.imshow(
    pivot_r2,
    labels=dict(x="Dataset", y="Model", color="R² Score"),
    title="R² Score Heatmap: Models vs Datasets",
    aspect="auto",
    color_continuous_scale='RdYlGn',
    text_auto='.3f'
)

fig.update_layout(height=400)
fig.show()

In [None]:
# Visualization 4: Performance metrics summary
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Comprehensive Model Performance Comparison', fontsize=16, fontweight='bold')

metrics = ['R2', 'RMSE', 'MAE', 'MAPE']
titles = ['R² Score (Higher is Better)', 'RMSE (Lower is Better)', 
          'MAE (Lower is Better)', 'MAPE (Lower is Better)']

for idx, (metric, title) in enumerate(zip(metrics, titles)):
    ax = axes[idx // 2, idx % 2]
    
    pivot_data = test_results.pivot(index='Model', columns='Dataset', values=metric)
    pivot_data.plot(kind='bar', ax=ax, width=0.8)
    
    ax.set_title(title, fontweight='bold')
    ax.set_xlabel('Model')
    ax.set_ylabel(metric)
    ax.legend(title='Dataset', bbox_to_anchor=(1.05, 1), loc='upper left')
    ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Summary statistics
print("\nOVERALL PERFORMANCE STATISTICS")
print("="*80)

for metric in ['R2', 'RMSE', 'MAE']:
    print(f"\n{metric}:")
    summary = test_results.groupby('Model')[metric].agg(['mean', 'std', 'min', 'max'])
    print(summary.to_string())
    print("-"*80)

In [None]:
# Export results to CSV
output_path = '/Users/davidzapata/Documents/GitHub/process_energy_demand_modelling/evaluation_results.csv'
evaluation_results.to_csv(output_path, index=False)
print(f"\n✓ Evaluation results exported to: {output_path}")

# Export best models summary
best_models_path = '/Users/davidzapata/Documents/GitHub/process_energy_demand_modelling/best_models_summary.csv'
best_models.to_csv(best_models_path, index=False)
print(f"✓ Best models summary exported to: {best_models_path}")

## Conclusion

This notebook provides a complete pipeline for:

1. **Preprocessing** multiple event log datasets with standardized formatting
2. **Process Discovery** using PM4Py's inductive miner
3. **Simulation** of process traces for model training
4. **Energy Modeling** using multiple machine learning algorithms
5. **Comprehensive Evaluation** across all datasets and models
6. **Results Visualization** for easy comparison and interpretation

### Key Findings:
- Compare model performance across different datasets
- Identify the best-performing model for each dataset
- Understand trade-offs between different modeling approaches
- Export results for further analysis and reporting

### Next Steps:
- Fine-tune hyperparameters for best-performing models
- Add additional datasets to the evaluation
- Experiment with advanced features engineering
- Deploy best models for production use

## Conclusion

This notebook provides a complete pipeline for:

1. **Preprocessing** multiple event log datasets with standardized formatting
2. **Process Discovery** using PM4Py's inductive miner
3. **Simulation** of process traces for model training
4. **Energy Modeling** using multiple machine learning algorithms
5. **Comprehensive Evaluation** across all datasets and models
6. **Results Visualization** for easy comparison and interpretation

### Key Findings:
- Compare model performance across different datasets
- Identify the best-performing model for each dataset
- Understand trade-offs between different modeling approaches
- Export results for further analysis and reporting

### Next Steps:
- Fine-tune hyperparameters for best-performing models
- Add additional datasets to the evaluation
- Experiment with advanced features engineering
- Deploy best models for production use