# 🚗 Complete Traffic Forecasting Pipeline - HCMC

**All-in-One Workflow**: Download → Preprocess → EDA → Feature Engineering → Model Training

---

## 📋 Pipeline Overview

| Step | Description | Optional |
|------|-------------|----------|
| 1️⃣ | **Configuration** - Set pipeline options | ✗ Required |
| 2️⃣ | **Download Data** - Get latest from VM | ✓ Skippable |
| 3️⃣ | **Explore Data** - Preview raw data | ✓ Skippable |
| 4️⃣ | **Preprocess** - Convert & add features | ✓ Skippable |
| 5️⃣ | **Comprehensive EDA** - Deep analysis | ✓ Skippable |
| 6️⃣ | **Feature Engineering** - ML features | ✗ Required |
| 7️⃣ | **Model Training** - Train & compare | ✗ Required |
| 8️⃣ | **Save Results** - Export models | ✓ Skippable |

---

**Project:** DSP391m Traffic Forecasting - Ho Chi Minh City  
**VM:** traffic-forecast-collector (GCP asia-southeast1-a)  
**Coverage:** 64 intersections, 144 road segments, 4096m radius

---

### 💡 Quick Start Tips:

- **First time?** Run all cells with default config
- **Have data already?** Set `USE_EXISTING_DATA = True`
- **Quick test?** Disable EDA steps, enable `QUICK_MODE = True`
- **Production?** Enable all steps for comprehensive analysis

---
## Step 1️⃣: Pipeline Configuration

**Configure which steps to run** - Customize the pipeline to your needs.

In [None]:
# ═══════════════════════════════════════════════════════════════════
# 🎛️  PIPELINE CONFIGURATION
# ═══════════════════════════════════════════════════════════════════

# ─── Data Source Options ───────────────────────────────────────────
USE_EXISTING_DATA = False      # True: Use current data | False: Download new data
DOWNLOAD_LATEST_ONLY = True    # True: Latest run only | False: All available runs
USE_PREPROCESSED = False       # True: Load from Parquet | False: Process from JSON

# ─── Pipeline Steps Control ────────────────────────────────────────
ENABLE_DATA_EXPLORATION = True    # Show raw data preview
ENABLE_PREPROCESSING = True       # Convert JSON to Parquet
ENABLE_COMPREHENSIVE_EDA = True   # Full exploratory analysis (maps, charts)
ENABLE_FEATURE_ENGINEERING = True # Create ML features (always needed for training)
ENABLE_MODEL_TRAINING = True      # Train and compare models
ENABLE_SAVE_MODELS = True         # Save trained models to disk

# ─── Analysis Options ──────────────────────────────────────────────
QUICK_MODE = False             # True: Faster execution, less detail
SHOW_INTERACTIVE_MAPS = True   # Folium geographic visualizations
SHOW_PLOTLY_CHARTS = True      # Interactive Plotly charts
VERBOSE = True                 # Print detailed progress

# ─── Data Paths ────────────────────────────────────────────────────
DATA_DIR = '../data/runs'
PROCESSED_DIR = '../data/processed'
MODELS_DIR = '../traffic_forecast/models/saved'

# ═══════════════════════════════════════════════════════════════════

import warnings
warnings.filterwarnings('ignore')

print("✅ Configuration loaded!\n")
print("=" * 70)
print("📋 Pipeline Configuration:")
print("=" * 70)
print(f"🔹 Data Source:")
print(f"   {'✓' if USE_EXISTING_DATA else '✗'} Use existing data (skip download)")
print(f"   {'✓' if DOWNLOAD_LATEST_ONLY else '✗'} Download latest only")
print(f"   {'✓' if USE_PREPROCESSED else '✗'} Use preprocessed Parquet files")
print(f"\n🔹 Pipeline Steps:")
print(f"   {'✓' if ENABLE_DATA_EXPLORATION else '✗'} Data exploration")
print(f"   {'✓' if ENABLE_PREPROCESSING else '✗'} Preprocessing")
print(f"   {'✓' if ENABLE_COMPREHENSIVE_EDA else '✗'} Comprehensive EDA")
print(f"   {'✓' if ENABLE_FEATURE_ENGINEERING else '✗'} Feature engineering")
print(f"   {'✓' if ENABLE_MODEL_TRAINING else '✗'} Model training")
print(f"   {'✓' if ENABLE_SAVE_MODELS else '✗'} Save models")
print(f"\n🔹 Analysis Options:")
print(f"   {'✓' if QUICK_MODE else '✗'} Quick mode (faster)")
print(f"   {'✓' if SHOW_INTERACTIVE_MAPS else '✗'} Interactive maps")
print(f"   {'✓' if SHOW_PLOTLY_CHARTS else '✗'} Plotly charts")
print(f"   {'✓' if VERBOSE else '✗'} Verbose output")
print("=" * 70)

# Validate configuration
if ENABLE_MODEL_TRAINING and not ENABLE_FEATURE_ENGINEERING:
    print("\n⚠️  WARNING: Model training requires feature engineering!")
    print("   Automatically enabling ENABLE_FEATURE_ENGINEERING")
    ENABLE_FEATURE_ENGINEERING = True

if USE_PREPROCESSED and not ENABLE_PREPROCESSING:
    print("\n💡 TIP: Using preprocessed data, skipping preprocessing step")

print("\n✨ Ready to run! Execute cells below to start the pipeline.")

---
## Step 2️⃣: Data Source Selection

Choose your data source based on the configuration above.

### Option A: Download Latest Data from VM

**Run this cell only if** `USE_EXISTING_DATA = False`

In [None]:
if not USE_EXISTING_DATA:
    import subprocess
    import os
    from datetime import datetime
    
    print("🔽 Downloading latest data from VM...\n")
    print("=" * 70)
    
    # Run download script
    result = subprocess.run(
        ['bash', 'scripts/data/download_latest.sh'],
        capture_output=True,
        text=True,
        cwd='..'  # Run from project root
    )
    
    print(result.stdout)
    if result.returncode == 0:
        print("\n✅ Download completed successfully!")
    else:
        print("\n❌ Download failed!")
        print(result.stderr)
else:
    print("⏭️  Skipping download - Using existing data")
    print("   Set USE_EXISTING_DATA = False to download new data")

### Option B: Use Existing Data

Current data will be loaded from `data/runs/`

---
## Step 3️⃣: Data Exploration

**Preview and validate** the data we'll be working with.

In [None]:
if ENABLE_DATA_EXPLORATION:
    import os
    import json
    import pandas as pd
    from pathlib import Path
    from datetime import datetime
    
    # Find all runs
    data_dir = Path(DATA_DIR)
    run_dirs = sorted([d for d in data_dir.iterdir() if d.is_dir()], reverse=True)
    
    print(f"📊 Found {len(run_dirs)} collection runs\n")
    print("=" * 80)
    
    # Display runs table
    runs_info = []
    for run_dir in run_dirs[:10]:  # Show latest 10
        files = list(run_dir.glob('*.json'))
        total_size = sum(f.stat().st_size for f in files)
        
        # Parse timestamp from run name
        run_name = run_dir.name
        if run_name.startswith('run_'):
            timestamp_str = run_name[4:]  # Remove 'run_' prefix
            try:
                run_time = datetime.strptime(timestamp_str, '%Y%m%d_%H%M%S')
                time_display = run_time.strftime('%Y-%m-%d %H:%M:%S')
            except:
                time_display = timestamp_str
        else:
            time_display = run_name
        
        runs_info.append({
            'Run': run_name,
            'Time': time_display,
            'Files': len(files),
            'Size (KB)': total_size // 1024
        })
    
    df_runs = pd.DataFrame(runs_info)
    print(df_runs.to_string(index=False))
    print("=" * 80)
    print(f"\n💾 Total data: {df_runs['Size (KB)'].sum():,} KB")
else:
    print("⏭️  Skipping data exploration")
    print("   Set ENABLE_DATA_EXPLORATION = True to view data details")

### 🔍 Inspect Latest Run Details

In [None]:
if ENABLE_DATA_EXPLORATION:
    # Load latest run
    latest_run = run_dirs[0]
    print(f"📁 Latest Run: {latest_run.name}\n")
    print("=" * 80)
    
    # Load all JSON files
    files_content = {}
    for json_file in latest_run.glob('*.json'):
        with open(json_file, 'r', encoding='utf-8') as f:
            files_content[json_file.stem] = json.load(f)
        print(f"✓ Loaded: {json_file.name}")
    
    # Display summary
    print("\n📊 Data Summary:")
    print("=" * 80)
    
    if 'nodes' in files_content:
        nodes = files_content['nodes']
        print(f"🔷 Nodes (Intersections): {len(nodes)}")
        if nodes:
            print(f"   Sample: {nodes[0].get('name', 'N/A')}")
            print(f"   Location: ({nodes[0]['lat']:.6f}, {nodes[0]['lon']:.6f})")
    
    if 'edges' in files_content:
        edges = files_content['edges']
        print(f"\n🔶 Edges (Road Segments): {len(edges)}")
        if edges:
            print(f"   Sample: {edges[0]['from_node']} → {edges[0]['to_node']}")
    
    if 'traffic_edges' in files_content:
        traffic = files_content['traffic_edges']
        print(f"\n🚦 Traffic Data: {len(traffic)} records")
        if traffic:
            sample = traffic[0]
            print(f"   Speed: {sample.get('speed_kmh', 'N/A')} km/h")
            print(f"   Duration: {sample.get('duration_sec', 'N/A')} sec")
            print(f"   Distance: {sample.get('distance_km', 'N/A')} km")
            print(f"   Timestamp: {sample.get('timestamp', 'N/A')}")
    
    if 'weather_snapshot' in files_content:
        weather = files_content['weather_snapshot']
        print(f"\n🌤️  Weather Data: {len(weather)} node records")
        if weather:
            sample = weather[0]
            print(f"   Temperature: {sample.get('temperature_c', 'N/A')}°C")
            print(f"   Wind Speed: {sample.get('wind_speed_kmh', 'N/A')} km/h")
            print(f"   Precipitation: {sample.get('precipitation_mm', 'N/A')} mm")
    
    if 'statistics' in files_content:
        stats = files_content['statistics']
        print(f"\n📈 Network Statistics:")
        print(f"   Total Nodes: {stats.get('total_nodes', 'N/A')}")
        print(f"   Total Edges: {stats.get('total_edges', 'N/A')}")
        print(f"   Avg Degree: {stats.get('avg_degree', 'N/A'):.2f}")
    
    print("\n" + "=" * 80)
else:
    print("⏭️  Skipping detailed exploration")

---
## Step 4️⃣: Data Preprocessing

**Convert JSON → Parquet** for 10x faster loading + add derived features.

In [None]:
if ENABLE_PREPROCESSING and not USE_PREPROCESSED:
    import sys
    sys.path.insert(0, '..')  # Add project root to path
    
    from pathlib import Path
    import pandas as pd
    import numpy as np
    import json
    
    print("⚙️  Preprocessing data...\n")
    print("=" * 80)
    
    # Create preprocessor class (inline for notebook)
    class SimplePreprocessor:
        def __init__(self, data_dir=DATA_DIR, output_dir=PROCESSED_DIR):
            self.data_dir = Path(data_dir)
            self.output_dir = Path(output_dir)
            self.output_dir.mkdir(parents=True, exist_ok=True)
        
        def process_run(self, run_dir):
            """Process a single run directory"""
            run_name = run_dir.name
            if VERBOSE:
                print(f"📦 Processing: {run_name}")
            
            # Load JSON files
            with open(run_dir / 'traffic_edges.json', 'r') as f:
                traffic_data = json.load(f)
            
            with open(run_dir / 'weather_snapshot.json', 'r') as f:
                weather_data = json.load(f)
            
            with open(run_dir / 'nodes.json', 'r') as f:
                nodes_data = json.load(f)
            
            # Convert to DataFrames
            df_traffic = pd.DataFrame(traffic_data)
            df_weather = pd.DataFrame(weather_data)
            df_nodes = pd.DataFrame(nodes_data)
            
            # Parse timestamp
            df_traffic['timestamp'] = pd.to_datetime(df_traffic['timestamp'])
            
            # Add time-based features
            df_traffic['hour'] = df_traffic['timestamp'].dt.hour
            df_traffic['minute'] = df_traffic['timestamp'].dt.minute
            df_traffic['day_of_week'] = df_traffic['timestamp'].dt.dayofweek
            df_traffic['day_name'] = df_traffic['timestamp'].dt.day_name()
            df_traffic['is_weekend'] = df_traffic['day_of_week'].isin([5, 6])
            
            # Add congestion levels based on speed
            def categorize_congestion(speed):
                if speed < 15:
                    return 'heavy'
                elif speed < 25:
                    return 'moderate'
                elif speed < 35:
                    return 'light'
                else:
                    return 'free_flow'
            
            df_traffic['congestion_level'] = df_traffic['speed_kmh'].apply(categorize_congestion)
            
            # Add speed categories
            df_traffic['speed_category'] = pd.cut(
                df_traffic['speed_kmh'],
                bins=[0, 10, 20, 30, 40, 100],
                labels=['very_slow', 'slow', 'moderate', 'fast', 'very_fast']
            )
            
            # Merge weather data (average across all nodes)
            weather_avg = df_weather.groupby('node_id').first().reset_index()
            avg_temp = weather_avg['temperature_c'].mean()
            avg_wind = weather_avg['wind_speed_kmh'].mean()
            avg_precip = weather_avg['precipitation_mm'].mean()
            
            df_traffic['temperature_c'] = avg_temp
            df_traffic['wind_speed_kmh'] = avg_wind
            df_traffic['precipitation_mm'] = avg_precip
            
            # Add run metadata
            df_traffic['run_name'] = run_name
            df_traffic['collection_time'] = df_traffic['timestamp'].iloc[0]
            
            # Save to Parquet
            output_file = self.output_dir / f"{run_name}.parquet"
            df_traffic.to_parquet(output_file, index=False)
            if VERBOSE:
                print(f"   ✓ Saved: {output_file.name} ({output_file.stat().st_size // 1024} KB)")
            
            return df_traffic
        
        def process_all(self, limit=None):
            """Process all runs"""
            run_dirs = sorted([d for d in self.data_dir.iterdir() if d.is_dir()], reverse=True)
            
            if limit:
                run_dirs = run_dirs[:limit]
            
            all_data = []
            for run_dir in run_dirs:
                try:
                    df = self.process_run(run_dir)
                    all_data.append(df)
                except Exception as e:
                    if VERBOSE:
                        print(f"   ⚠️  Error: {e}")
            
            # Combine all runs
            if all_data:
                df_combined = pd.concat(all_data, ignore_index=True)
                combined_file = self.output_dir / 'all_runs_combined.parquet'
                df_combined.to_parquet(combined_file, index=False)
                print(f"\n📊 Combined dataset: {combined_file.name}")
                print(f"   Records: {len(df_combined):,}")
                print(f"   Size: {combined_file.stat().st_size // 1024} KB")
                return df_combined
            
            return None
    
    # Run preprocessing
    preprocessor = SimplePreprocessor()
    df_all = preprocessor.process_all()
    
    print("\n" + "=" * 80)
    print("✅ Preprocessing completed!")
    print(f"\n📈 Total records: {len(df_all):,}")
    print(f"📅 Date range: {df_all['timestamp'].min()} to {df_all['timestamp'].max()}")
    print(f"🚦 Avg speed: {df_all['speed_kmh'].mean():.2f} km/h")

elif USE_PREPROCESSED:
    # Load from preprocessed Parquet files
    print("⚡ Loading preprocessed data (fast)...\n")
    combined_file = Path(PROCESSED_DIR) / 'all_runs_combined.parquet'
    if combined_file.exists():
        df_all = pd.read_parquet(combined_file)
        print(f"✅ Loaded {len(df_all):,} records from {combined_file.name}")
        print(f"📅 Date range: {df_all['timestamp'].min()} to {df_all['timestamp'].max()}")
        print(f"🚦 Avg speed: {df_all['speed_kmh'].mean():.2f} km/h")
    else:
        print(f"❌ Preprocessed file not found: {combined_file}")
        print("   Set ENABLE_PREPROCESSING = True to create it")
else:
    print("⏭️  Skipping preprocessing")
    print("   Set ENABLE_PREPROCESSING = True to process data")

### 🔍 Preview Preprocessed Data

In [None]:
print("📋 Dataset Info:\n")
print(df_all.info())

print("\n" + "=" * 80)
print("\n📊 First 5 rows:\n")
print(df_all.head())

print("\n" + "=" * 80)
print("\n📈 Statistical Summary:\n")
print(df_all[['speed_kmh', 'duration_sec', 'distance_km', 'temperature_c', 'wind_speed_kmh']].describe())

---
## Step 5️⃣: Comprehensive Exploratory Data Analysis

**Deep dive into patterns** - Geographic maps, correlations, time series.

In [None]:
if ENABLE_COMPREHENSIVE_EDA:
    print("📊 Running Comprehensive EDA...\n")
    print("=" * 70)
    
    # Import visualization libraries
    if SHOW_PLOTLY_CHARTS:
        import plotly.express as px
        import plotly.graph_objects as go
        from plotly.subplots import make_subplots
    
    if SHOW_INTERACTIVE_MAPS:
        import folium
        from folium.plugins import HeatMap
    
    print("✅ Visualization libraries loaded")
else:
    print("⏭️  Skipping comprehensive EDA")
    print("   Set ENABLE_COMPREHENSIVE_EDA = True for detailed analysis")

### 5a. Traffic Speed Distribution

In [None]:
if ENABLE_COMPREHENSIVE_EDA and SHOW_PLOTLY_CHARTS:
    # Traffic Speed Analysis
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=('Speed Distribution', 'Speed Box Plot', 
                        'Duration Distribution', 'Speed vs Distance'),
        specs=[[{'type': 'histogram'}, {'type': 'box'}],
               [{'type': 'histogram'}, {'type': 'scatter'}]]
    )
    
    # Speed Histogram
    fig.add_trace(
        go.Histogram(x=df_all['speed_kmh'], nbinsx=30, name='Speed',
                    marker_color='steelblue', showlegend=False),
        row=1, col=1
    )
    
    # Speed Box Plot
    fig.add_trace(
        go.Box(y=df_all['speed_kmh'], name='Speed',
               marker_color='steelblue', showlegend=False),
        row=1, col=2
    )
    
    # Duration Histogram
    fig.add_trace(
        go.Histogram(x=df_all['duration_sec'], nbinsx=30, name='Duration',
                    marker_color='coral', showlegend=False),
        row=2, col=1
    )
    
    # Speed vs Distance Scatter
    fig.add_trace(
        go.Scatter(x=df_all['distance_km'], y=df_all['speed_kmh'],
                  mode='markers', name='Speed vs Distance',
                  marker=dict(color='green', size=6, opacity=0.4),
                  showlegend=False),
        row=2, col=2
    )
    
    fig.update_xaxes(title_text="Speed (km/h)", row=1, col=1)
    fig.update_xaxes(title_text="Speed (km/h)", row=1, col=2)
    fig.update_xaxes(title_text="Duration (seconds)", row=2, col=1)
    fig.update_xaxes(title_text="Distance (km)", row=2, col=2)
    fig.update_yaxes(title_text="Frequency", row=1, col=1)
    fig.update_yaxes(title_text="Speed (km/h)", row=1, col=2)
    fig.update_yaxes(title_text="Frequency", row=2, col=1)
    fig.update_yaxes(title_text="Speed (km/h)", row=2, col=2)
    
    fig.update_layout(height=700, title_text="🚗 Traffic Speed & Duration Analysis", showlegend=False)
    fig.show()
    
    # Print congestion statistics
    print("\n📊 Congestion Level Distribution:")
    print(df_all['congestion_level'].value_counts().sort_index())
    print(f"\n🚦 Average Speed: {df_all['speed_kmh'].mean():.2f} km/h")
    print(f"⏱️  Average Duration: {df_all['duration_sec'].mean():.2f} sec")

### 5b. Hourly Traffic Patterns

In [None]:
if ENABLE_COMPREHENSIVE_EDA:
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Hourly patterns
    hourly_speed = df_all.groupby('hour')['speed_kmh'].agg(['mean', 'std', 'count']).reset_index()
    
    fig, axes = plt.subplots(1, 2, figsize=(16, 5))
    fig.suptitle('📈 Hourly Traffic Patterns', fontsize=16, fontweight='bold')
    
    # Average speed by hour
    axes[0].plot(hourly_speed['hour'], hourly_speed['mean'], marker='o', linewidth=2, markersize=8, color='steelblue')
    axes[0].fill_between(
        hourly_speed['hour'],
        hourly_speed['mean'] - hourly_speed['std'],
        hourly_speed['mean'] + hourly_speed['std'],
        alpha=0.3,
        color='steelblue'
    )
    axes[0].set_xlabel('Hour of Day', fontsize=12)
    axes[0].set_ylabel('Average Speed (km/h)', fontsize=12)
    axes[0].set_title('Average Speed by Hour (with std deviation)')
    axes[0].set_xticks(range(0, 24, 2))
    axes[0].grid(True, alpha=0.3)
    axes[0].axhline(y=df_all['speed_kmh'].mean(), color='red', linestyle='--', label='Overall Average')
    axes[0].legend()
    
    # Traffic volume by hour
    axes[1].bar(hourly_speed['hour'], hourly_speed['count'], color='coral', alpha=0.7, edgecolor='black')
    axes[1].set_xlabel('Hour of Day', fontsize=12)
    axes[1].set_ylabel('Number of Records', fontsize=12)
    axes[1].set_title('Traffic Data Volume by Hour')
    axes[1].set_xticks(range(0, 24, 2))
    axes[1].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    # Peak hours identification
    peak_hours = hourly_speed.nsmallest(3, 'mean')[['hour', 'mean']]
    print("\n🔴 Slowest Hours (Peak Congestion):")
    for _, row in peak_hours.iterrows():
        print(f"   • Hour {int(row['hour']):02d}:00 - Avg Speed: {row['mean']:.2f} km/h")

---
## Step 6️⃣: Feature Engineering for ML

**Create advanced features** for machine learning models.

In [None]:
if ENABLE_FEATURE_ENGINEERING:
    from sklearn.preprocessing import LabelEncoder
    
    print("🔧 Engineering features...\n")
    print("=" * 70)
    
    # Sort by timestamp for lag features
    df_all = df_all.sort_values(['edge_id', 'timestamp']).reset_index(drop=True)
    
    # 1. Lag features (previous speeds)
    print("1️⃣  Creating lag features...")
    for lag in [1, 2, 3]:
        df_all[f'speed_lag_{lag}'] = df_all.groupby('edge_id')['speed_kmh'].shift(lag)
    print(f"   ✓ Added: speed_lag_1, speed_lag_2, speed_lag_3")
    
    # 2. Rolling statistics
    print("\n2️⃣  Creating rolling statistics...")
    df_all['speed_rolling_mean_3'] = df_all.groupby('edge_id')['speed_kmh'].transform(
        lambda x: x.rolling(window=3, min_periods=1).mean()
    )
    df_all['speed_rolling_std_3'] = df_all.groupby('edge_id')['speed_kmh'].transform(
        lambda x: x.rolling(window=3, min_periods=1).std()
    )
    print(f"   ✓ Added: speed_rolling_mean_3, speed_rolling_std_3")
    
    # 3. Cyclical time features
    print("\n3️⃣  Creating cyclical time features...")
    df_all['hour_sin'] = np.sin(2 * np.pi * df_all['hour'] / 24)
    df_all['hour_cos'] = np.cos(2 * np.pi * df_all['hour'] / 24)
    df_all['day_of_week_sin'] = np.sin(2 * np.pi * df_all['day_of_week'] / 7)
    df_all['day_of_week_cos'] = np.cos(2 * np.pi * df_all['day_of_week'] / 7)
    print(f"   ✓ Added: hour_sin, hour_cos, day_of_week_sin, day_of_week_cos")
    
    # 4. Rush hour indicators
    print("\n4️⃣  Creating rush hour indicators...")
    df_all['is_morning_rush'] = df_all['hour'].isin([7, 8, 9]).astype(int)
    df_all['is_evening_rush'] = df_all['hour'].isin([17, 18, 19]).astype(int)
    df_all['is_rush_hour'] = (df_all['is_morning_rush'] | df_all['is_evening_rush']).astype(int)
    print(f"   ✓ Added: is_morning_rush, is_evening_rush, is_rush_hour")
    
    # 5. Encode categorical features
    print("\n5️⃣  Encoding categorical features...")
    le_congestion = LabelEncoder()
    df_all['congestion_level_encoded'] = le_congestion.fit_transform(df_all['congestion_level'])
    print(f"   ✓ Added: congestion_level_encoded")
    
    # Drop rows with NaN (from lag features)
    df_features = df_all.dropna().copy()
    
    print("\n" + "=" * 70)
    print(f"✅ Feature engineering completed!")
    print(f"\n📊 Dataset shape: {df_features.shape}")
    print(f"📋 Total features: {df_features.shape[1]}")
    print(f"🚫 Dropped {len(df_all) - len(df_features)} rows with missing lag values")
else:
    print("⏭️  Skipping feature engineering")
    print("   ⚠️  WARNING: Required for model training!")
    df_features = df_all.copy()

### 📋 Feature List

In [None]:
print("📝 All Features:\n")
print("=" * 80)

feature_groups = {
    '🚦 Traffic Features': ['speed_kmh', 'duration_sec', 'distance_km', 'congestion_level', 'speed_category'],
    '⏱️  Time Features': ['hour', 'minute', 'day_of_week', 'day_name', 'is_weekend', 'hour_sin', 'hour_cos', 'day_of_week_sin', 'day_of_week_cos'],
    '🌤️  Weather Features': ['temperature_c', 'wind_speed_kmh', 'precipitation_mm'],
    '📊 Lag Features': ['speed_lag_1', 'speed_lag_2', 'speed_lag_3'],
    '📈 Rolling Features': ['speed_rolling_mean_3', 'speed_rolling_std_3'],
    '🚨 Rush Hour Features': ['is_morning_rush', 'is_evening_rush', 'is_rush_hour'],
    '🔢 Encoded Features': ['congestion_level_encoded']
}

for group_name, features in feature_groups.items():
    print(f"\n{group_name}:")
    for feat in features:
        if feat in df_features.columns:
            print(f"   ✓ {feat}")
        else:
            print(f"   ✗ {feat} (missing)")

print("\n" + "=" * 80)

---
## Step 7️⃣: Train & Compare Models

**Train multiple models** and compare their performance.

In [None]:
if ENABLE_MODEL_TRAINING:
    from sklearn.model_selection import train_test_split
    
    print("✂️  Splitting data...\n")
    print("=" * 70)
    
    # Define features for modeling
    feature_columns = [
        # Time features
        'hour', 'day_of_week', 'is_weekend', 'hour_sin', 'hour_cos', 'day_of_week_sin', 'day_of_week_cos',
        # Traffic features
        'distance_km', 'duration_sec',
        # Weather features
        'temperature_c', 'wind_speed_kmh', 'precipitation_mm',
        # Lag features
        'speed_lag_1', 'speed_lag_2', 'speed_lag_3',
        # Rolling features
        'speed_rolling_mean_3', 'speed_rolling_std_3',
        # Rush hour
        'is_morning_rush', 'is_evening_rush', 'is_rush_hour'
    ]
    
    target_column = 'speed_kmh'
    
    # Prepare X and y
    X = df_features[feature_columns].copy()
    y = df_features[target_column].copy()
    
    # Time-based split (last 20% as test)
    split_idx = int(len(df_features) * 0.8)
    X_train = X.iloc[:split_idx]
    X_test = X.iloc[split_idx:]
    y_train = y.iloc[:split_idx]
    y_test = y.iloc[split_idx:]
    
    print(f"📊 Dataset split (time-based):") 
    print(f"   Training set: {len(X_train):,} samples ({len(X_train)/len(X)*100:.1f}%)")
    print(f"   Test set: {len(X_test):,} samples ({len(X_test)/len(X)*100:.1f}%)")
    print(f"\n📋 Features: {len(feature_columns)}")
    print(f"🎯 Target: {target_column}")
else:
    print("⏭️  Skipping train-test split")
    print("   Set ENABLE_MODEL_TRAINING = True to train models")

### 📊 Model Comparison Results

In [None]:
if ENABLE_MODEL_TRAINING:
    from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
    from sklearn.linear_model import LinearRegression, Ridge
    from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
    import time
    
    print("🤖 Training models...\n")
    print("=" * 70)
    
    # Define models
    if QUICK_MODE:
        models = {
            'Linear Regression': LinearRegression(),
            'Random Forest': RandomForestRegressor(n_estimators=50, max_depth=10, random_state=42, n_jobs=-1),
        }
        print("⚡ Quick Mode: Training 2 fast models\n")
    else:
        models = {
            'Linear Regression': LinearRegression(),
            'Ridge Regression': Ridge(alpha=1.0),
            'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1),
            'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42)
        }
        print("🔬 Full Mode: Training 4 models\n")
    
    results = []
    
    for model_name, model in models.items():
        print(f"🔧 Training: {model_name}")
        
        # Train
        start_time = time.time()
        model.fit(X_train, y_train)
        train_time = time.time() - start_time
        
        # Predict
        y_pred = model.predict(X_test)
        
        # Evaluate
        mae = mean_absolute_error(y_test, y_pred)
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)
        
        results.append({
            'Model': model_name,
            'MAE': mae,
            'RMSE': rmse,
            'R²': r2,
            'Train Time (s)': train_time
        })
        
        print(f"   ✓ MAE: {mae:.3f} km/h")
        print(f"   ✓ RMSE: {rmse:.3f} km/h")
        print(f"   ✓ R²: {r2:.3f}")
        print(f"   ⏱️  Time: {train_time:.2f}s\n")
    
    print("=" * 70)
    print("✅ All models trained!")
else:
    print("⏭️  Skipping model training")
    print("   Set ENABLE_MODEL_TRAINING = True to train models")

### 📊 Model Comparison

In [None]:
if ENABLE_MODEL_TRAINING:
    # Create comparison table
    df_results = pd.DataFrame(results)
    df_results = df_results.sort_values('RMSE')
    
    print("\n🏆 Model Performance Comparison:\n")
    print("=" * 70)
    print(df_results.to_string(index=False))
    print("=" * 70)
    
    best_model_name = df_results.iloc[0]['Model']
    best_rmse = df_results.iloc[0]['RMSE']
    best_r2 = df_results.iloc[0]['R²']
    
    print(f"\n🥇 Best Model: {best_model_name}")
    print(f"   RMSE: {best_rmse:.3f} km/h")
    print(f"   R²: {best_r2:.3f}")

### 📈 Performance Visualization

In [None]:
if ENABLE_MODEL_TRAINING:
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    fig.suptitle('Model Performance Metrics', fontsize=16, fontweight='bold')
    
    # MAE comparison
    axes[0].barh(df_results['Model'], df_results['MAE'], color='skyblue')
    axes[0].set_xlabel('MAE (km/h) - Lower is Better')
    axes[0].set_title('Mean Absolute Error')
    axes[0].grid(True, alpha=0.3, axis='x')
    
    # RMSE comparison
    axes[1].barh(df_results['Model'], df_results['RMSE'], color='lightcoral')
    axes[1].set_xlabel('RMSE (km/h) - Lower is Better')
    axes[1].set_title('Root Mean Squared Error')
    axes[1].grid(True, alpha=0.3, axis='x')
    
    # R² comparison
    axes[2].barh(df_results['Model'], df_results['R²'], color='lightgreen')
    axes[2].set_xlabel('R² Score - Higher is Better')
    axes[2].set_title('R² Score')
    axes[2].set_xlim(0, 1)
    axes[2].grid(True, alpha=0.3, axis='x')
    
    plt.tight_layout()
    plt.show()

---
## Step 8️⃣: Save Models & Results

**Export trained models** for deployment and future use.

In [None]:
if ENABLE_SAVE_MODELS and ENABLE_MODEL_TRAINING:
    import joblib
    from pathlib import Path
    
    # Create models directory
    models_dir = Path(MODELS_DIR)
    models_dir.mkdir(parents=True, exist_ok=True)
    
    print("💾 Saving models...\n")
    print("=" * 70)
    
    # Save all models
    for model_name, model in models.items():
        filename = model_name.lower().replace(' ', '_') + '.pkl'
        filepath = models_dir / filename
        joblib.dump(model, filepath)
        print(f"✓ Saved: {filename} ({filepath.stat().st_size // 1024} KB)")
    
    # Save feature columns
    feature_info = {
        'feature_columns': feature_columns,
        'target_column': target_column,
        'num_features': len(feature_columns)
    }
    joblib.dump(feature_info, models_dir / 'feature_info.pkl')
    print(f"\n✓ Saved feature info")
    
    # Save results
    df_results.to_csv(models_dir / 'model_comparison.csv', index=False)
    print(f"✓ Saved model comparison")
    
    print("\n" + "=" * 70)
    print(f"✅ All artifacts saved to: {models_dir}")
elif not ENABLE_MODEL_TRAINING:
    print("⏭️  No models to save (training was skipped)")
else:
    print("⏭️  Skipping model saving")
    print("   Set ENABLE_SAVE_MODELS = True to save models")

---
## 🎉 Pipeline Complete!

**Summary of execution** and next steps.

In [None]:
print("╔" + "═" * 68 + "╗")
print("║" + " " * 20 + "🎉 PIPELINE EXECUTION COMPLETE!" + " " * 17 + "║")
print("╚" + "═" * 68 + "╝")
print()

# Summary
print("📊 EXECUTION SUMMARY")
print("=" * 70)

steps_summary = [
    ("Configuration", True, "✅ Loaded"),
    ("Download Data", not USE_EXISTING_DATA, "✅ Downloaded" if not USE_EXISTING_DATA else "⏭️  Skipped (used existing)"),
    ("Data Exploration", ENABLE_DATA_EXPLORATION, "✅ Explored" if ENABLE_DATA_EXPLORATION else "⏭️  Skipped"),
    ("Preprocessing", ENABLE_PREPROCESSING, "✅ Processed" if ENABLE_PREPROCESSING else "⏭️  Skipped"),
    ("Comprehensive EDA", ENABLE_COMPREHENSIVE_EDA, "✅ Analyzed" if ENABLE_COMPREHENSIVE_EDA else "⏭️  Skipped"),
    ("Feature Engineering", ENABLE_FEATURE_ENGINEERING, f"✅ Created {len(feature_columns)} features" if ENABLE_FEATURE_ENGINEERING else "⏭️  Skipped"),
    ("Model Training", ENABLE_MODEL_TRAINING, f"✅ Trained {len(models)} models" if ENABLE_MODEL_TRAINING else "⏭️  Skipped"),
    ("Save Models", ENABLE_SAVE_MODELS and ENABLE_MODEL_TRAINING, "✅ Saved" if ENABLE_SAVE_MODELS and ENABLE_MODEL_TRAINING else "⏭️  Skipped"),
]

for step_name, executed, status in steps_summary:
    print(f"• {step_name:.<25} {status}")

if ENABLE_MODEL_TRAINING:
    print("\n📈 BEST MODEL RESULTS")
    print("=" * 70)
    print(f"• Model: {best_model_name}")
    print(f"• RMSE: {best_rmse:.3f} km/h")
    print(f"• MAE: {df_results.iloc[0]['MAE']:.3f} km/h")
    print(f"• R² Score: {best_r2:.3f}")
    print(f"• Training Time: {df_results.iloc[0]['Train Time (s)']:.2f}s")

print("\n🚀 NEXT STEPS")
print("=" * 70)
print("1. 🔧 Hyperparameter Tuning - Optimize best model with GridSearch")
print("2. 🧠 Advanced Models - Try LSTM, GNN for temporal/spatial patterns")
print("3. 🎯 Ensemble Methods - Combine multiple models")
print("4. 🌐 Deploy API - Create endpoint for real-time predictions")
print("5. 📊 Dashboard - Build real-time monitoring dashboard")

print("\n💡 TO RE-RUN THIS PIPELINE")
print("=" * 70)
print("• Just 'Run All Cells' again!")
print("• Or adjust configuration in Step 1 and re-run")
print("• Set USE_EXISTING_DATA = False to download fresh data")
print("• Enable/disable steps as needed")

print("\n" + "=" * 70)
print("✨ Thank you for using the Traffic Forecasting Pipeline!")
print("=" * 70)