# ML Optimization Framework with Optuna

## Comprehensive Analysis and Demonstration

This notebook demonstrates the complete workflow of the ML optimization framework, including:

1. **Data Pipeline Setup** - Loading and preprocessing the Adult Income dataset
2. **Model Optimization** - Hyperparameter tuning for multiple ML models
3. **Advanced Features** - Multi-objective optimization, sampler comparison
4. **Visualization & Analysis** - Comprehensive result analysis and plotting
5. **Performance Comparison** - Model comparison and insights

---

## 1. Setup and Imports

First, let's import all necessary libraries and set up our environment.

In [None]:
# Standard library imports
import sys
import os
import warnings
import time
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path.cwd().parent))

# Data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

# ML libraries
from sklearn.metrics import classification_report, confusion_matrix

# Optuna
import optuna

# Our framework
from src.data.data_pipeline import DataPipeline
from src.models.random_forest_optimizer import RandomForestOptimizer
from src.models.xgboost_optimizer import XGBoostOptimizer
from src.models.lightgbm_optimizer import LightGBMOptimizer
from src.optimization.config import OptimizationConfig
from src.optimization.study_manager import StudyManager
from src.optimization.advanced_features import MultiObjectiveOptimizer
from src.visualization.plots import OptimizationPlotter

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("✅ All imports successful!")
print(f"📊 Working directory: {Path.cwd()}")
print(f"🎲 Random state: {RANDOM_STATE}")

## 2. Data Pipeline

Let's start by setting up our data pipeline and exploring the Adult Income dataset.

In [None]:
# Initialize data pipeline
print("🔄 Initializing data pipeline...")
data_pipeline = DataPipeline(
    random_state=RANDOM_STATE,
    test_size=0.2,
    val_size=0.2
)

# Complete data preparation
print("⚙️ Preparing data...")
summary = data_pipeline.prepare_data()

print(f"\n✅ Data Preparation Summary:")
for key, value in summary.items():
    print(f"   • {key}: {value}")

# Get prepared data splits
X_train, X_val, y_train, y_val = data_pipeline.get_train_val_data()
X_test, y_test = data_pipeline.get_test_data()

print(f"\n📊 Data splits:")
print(f"   • Training: {X_train.shape[0]} samples")
print(f"   • Validation: {X_val.shape[0]} samples")
print(f"   • Test: {X_test.shape[0]} samples")
print(f"   • Features: {X_train.shape[1]}")

## 3. Single Model Optimization

Let's start with optimizing a Random Forest classifier.

In [None]:
# Initialize Random Forest optimizer
print("🌲 Initializing Random Forest optimizer...")
rf_optimizer = RandomForestOptimizer(
    random_state=RANDOM_STATE,
    cv_folds=5,
    scoring_metric='accuracy',
    verbose=True
)

# Run optimization
print("🚀 Starting Random Forest optimization...")
start_time = time.time()

rf_study = rf_optimizer.optimize(
    X_train, X_val, y_train, y_val,
    n_trials=50  # Reduced for demo
)

optimization_time = time.time() - start_time

print(f"\n✅ Random Forest optimization completed!")
print(f"⏱️ Time taken: {optimization_time:.2f} seconds")
print(f"🎯 Best CV score: {rf_study.best_value:.4f}")
print(f"⚙️ Best parameters: {rf_study.best_params}")

In [None]:
# Evaluate on test set
print("📊 Evaluating Random Forest on test set...")
rf_test_metrics = rf_optimizer.evaluate(X_test, y_test)

print(f"\n🎯 Random Forest Test Results:")
for metric, value in rf_test_metrics.items():
    print(f"   • {metric}: {value:.4f}")

# Get feature importance
rf_importance = rf_optimizer.analyze_feature_importance()
print(f"\n🔍 Feature importance analysis completed")
print(f"   • Mean importance: {rf_importance['mean_importance']:.4f}")
print(f"   • Top 5 features: {rf_importance['top_features_indices'][:5]}")

## 4. Multi-Model Comparison

Now let's optimize multiple models and compare their performance.

In [None]:
# Initialize optimizers for all models
print("🔧 Initializing all model optimizers...")

optimizers = {
    'Random Forest': RandomForestOptimizer(
        random_state=RANDOM_STATE,
        cv_folds=5,
        verbose=False
    ),
    'XGBoost': XGBoostOptimizer(
        random_state=RANDOM_STATE,
        cv_folds=5,
        early_stopping_rounds=10,
        verbose=False
    ),
    'LightGBM': LightGBMOptimizer(
        random_state=RANDOM_STATE,
        cv_folds=5,
        early_stopping_rounds=10,
        verbose=False
    )
}

print(f"✅ Initialized {len(optimizers)} optimizers")

In [None]:
# Run optimization for all models
results = {}
n_trials = 30  # Reduced for demo

for model_name, optimizer in optimizers.items():
    print(f"\n🚀 Optimizing {model_name}...")
    start_time = time.time()
    
    study = optimizer.optimize(
        X_train, X_val, y_train, y_val,
        n_trials=n_trials
    )
    
    optimization_time = time.time() - start_time
    test_metrics = optimizer.evaluate(X_test, y_test)
    
    results[model_name] = {
        'optimizer': optimizer,
        'study': study,
        'best_cv_score': study.best_value,
        'best_params': study.best_params,
        'test_metrics': test_metrics,
        'optimization_time': optimization_time
    }
    
    print(f"   ✅ {model_name}: CV={study.best_value:.4f}, Test={test_metrics['accuracy']:.4f}")

print("\n🎉 All model optimizations completed!")