# ML Engine Tutorial

This notebook demonstrates the custom ML engine capabilities of the Magik Merlin ML Platform.

## What You'll Learn

1. ü§ñ **AutoML Pipeline** - Automated model comparison
2. ‚öôÔ∏è **Hyperparameter Optimization** - Using Optuna for tuning
3. üìä **Model Evaluation** - Performance metrics and comparison
4. üîç **Feature Importance** - Understanding model decisions
5. üéØ **Individual Models** - Using specific models directly

## Requirements

```bash
uv sync --extra ml
# OR
pip install xgboost lightgbm catboost optuna scikit-learn pandas numpy matplotlib seaborn
```

In [None]:
# Setup
import sys
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / "src"))

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# ML Engine imports
from core.ml_engine import (
    AutoMLPipeline,
    XGBoostClassifier,
    model_registry,
)

# Set style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

print("‚úÖ Imports successful!")

## 1. Model Registry

The ML engine includes a centralized model registry for discovering available models.

In [None]:
# List all available models
all_models = model_registry.list_models()
print("üìã All available models:")
for model in all_models:
    print(f"   ‚Ä¢ {model}")

# List by category
print("\nüéØ Classification models:")
for model in model_registry.list_models(category="classification"):
    print(f"   ‚Ä¢ {model}")

print("\nüìà Regression models:")
for model in model_registry.list_models(category="regression"):
    print(f"   ‚Ä¢ {model}")

## 2. Generate Sample Data

Let's create a binary classification dataset for demonstration.

In [None]:
# Generate synthetic dataset
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=3,
    n_classes=2,
    random_state=42,
    flip_y=0.1,
)

# Convert to DataFrame
feature_names = [f"feature_{i}" for i in range(X.shape[1])]
X_df = pd.DataFrame(X, columns=feature_names)
y_series = pd.Series(y, name="target")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_df, y_series, test_size=0.2, random_state=42, stratify=y
)

print("üìä Dataset Information:")
print(f"   Total samples: {len(X_df)}")
print(f"   Features: {X_df.shape[1]}")
print(f"   Training samples: {len(X_train)}")
print(f"   Test samples: {len(X_test)}")
print("\n   Class distribution:")
print(y_series.value_counts())

## 3. AutoML Pipeline - Model Comparison

The AutoML pipeline automatically compares multiple models using cross-validation.

In [None]:
# Create AutoML pipeline
pipeline = AutoMLPipeline(task_type="classification", random_state=42)

# Compare models with 5-fold CV
print("ü§ñ Comparing models... (this may take a minute)\n")
results = pipeline.compare_models(X_train, y_train, cv=5, test_size=0.2)

# Display results
print("\nüèÜ Model Comparison Results:\n")
display(results)

# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# CV Mean scores
results.plot(x="model", y="cv_mean", kind="barh", ax=ax1, color="steelblue")
ax1.set_xlabel("Cross-Validation Mean Score")
ax1.set_title("Model Comparison - CV Scores")

# Test scores
results.plot(x="model", y="test_score", kind="barh", ax=ax2, color="coral")
ax2.set_xlabel("Test Score")
ax2.set_title("Model Comparison - Test Scores")

plt.tight_layout()
plt.show()

print(f"\n‚ú® Best model: {pipeline.best_model_name}")

## 4. Model Evaluation

Let's evaluate the best model on the test set.

In [None]:
# Get best model
best_model = pipeline.get_best_model()

# Make predictions
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)

# Calculate metrics
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    f1_score,
    precision_score,
    recall_score,
)

metrics = {
    "Accuracy": accuracy_score(y_test, y_pred),
    "Precision": precision_score(y_test, y_pred),
    "Recall": recall_score(y_test, y_pred),
    "F1-Score": f1_score(y_test, y_pred),
}

print("üìä Test Set Performance:\n")
for metric, value in metrics.items():
    print(f"   {metric}: {value:.4f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.title(f"Confusion Matrix - {pipeline.best_model_name}")
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
plt.show()

## 5. Feature Importance

Understanding which features are most important for the model's predictions.

In [None]:
# Get feature importance
if hasattr(best_model, "get_feature_importance"):
    importance_df = best_model.get_feature_importance()

    print("üîç Top 15 Most Important Features:\n")
    display(importance_df.head(15))

    # Visualize
    plt.figure(figsize=(10, 8))
    importance_df.head(15).plot(
        x="feature", y="importance", kind="barh", color="green", alpha=0.7
    )
    plt.xlabel("Importance")
    plt.title("Top 15 Feature Importances")
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
else:
    print("‚ö†Ô∏è Feature importance not available for this model.")

## 6. Hyperparameter Optimization

Use Optuna to find optimal hyperparameters for the best model.

**Note:** This cell may take several minutes to run depending on `n_trials`.

In [None]:
# Uncomment to run hyperparameter optimization
# WARNING: This may take several minutes!

# optimization_result = pipeline.optimize_hyperparameters(
#     X_train, y_train,
#     model_name=pipeline.best_model_name,
#     n_trials=30,  # Increase for better results
#     cv=5
# )

# print("‚öôÔ∏è Optimization Results:\n")
# print(f"   Best Score: {optimization_result['best_score']:.4f}")
# print(f"\n   Best Parameters:")
# for param, value in optimization_result['best_params'].items():
#     print(f"      {param}: {value}")

# # Get optimized model
# optimized_model = optimization_result['model']
# optimized_accuracy = optimized_model.score(X_test, y_test)
# print(f"\n   Test Accuracy (optimized): {optimized_accuracy:.4f}")

print("üí° Uncomment the code above to run hyperparameter optimization.")

## 7. Using Individual Models

You can also use specific models directly with custom parameters.

In [None]:
# Create XGBoost classifier with custom parameters
xgb_model = XGBoostClassifier(
    n_estimators=100, max_depth=6, learning_rate=0.1, random_state=42
)

# Train
xgb_model.fit(X_train, y_train)

# Evaluate
xgb_accuracy = xgb_model.score(X_test, y_test)
print(f"üéØ XGBoost Accuracy: {xgb_accuracy:.4f}")

# Get and display parameters
params = xgb_model.get_params()
print("\nüìã Model Parameters:")
for key, value in list(params.items())[:5]:  # Show first 5
    print(f"   {key}: {value}")
print("   ...")

## 8. Sklearn Compatibility

All ML engine models are fully sklearn-compatible.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Create sklearn pipeline
sklearn_pipeline = Pipeline(
    [("scaler", StandardScaler()), ("classifier", XGBoostClassifier(n_estimators=50))]
)

# Use with cross_val_score
cv_scores = cross_val_score(
    sklearn_pipeline, X_train, y_train, cv=5, scoring="accuracy"
)

print("‚úÖ Sklearn Compatibility Demo:\n")
print(f"   Cross-validation scores: {cv_scores}")
print(f"   Mean CV score: {cv_scores.mean():.4f} (¬±{cv_scores.std():.4f})")

## 9. Summary & Next Steps

### What We Covered

‚úÖ Model registry and discovery  
‚úÖ Automated model comparison with AutoML  
‚úÖ Model evaluation and metrics  
‚úÖ Feature importance analysis  
‚úÖ Hyperparameter optimization (optional)  
‚úÖ Individual model usage  
‚úÖ Sklearn compatibility  

### Next Steps

1. üìö Read the comprehensive [ML Engine Guide](../docs/ML_ENGINE_GUIDE.md)
2. üî¨ Try with your own datasets
3. ‚öôÔ∏è Experiment with hyperparameter optimization
4. üéØ Integrate with MLflow for experiment tracking
5. üöÄ Deploy your best model

### Resources

- [README.md](../README.md) - Platform overview
- [CLAUDE.md](../CLAUDE.md) - Development commands
- [ML_ENGINE_GUIDE.md](../docs/ML_ENGINE_GUIDE.md) - Comprehensive guide
- [ROADMAP.md](../ROADMAP.md) - Development roadmap