# Phase 2.4: Understanding Model Flavors in MLflow

This comprehensive notebook demonstrates:
1. **What are Flavors** - Understanding the concept
2. **Sklearn Flavor** - Logging different sklearn models
3. **Pipeline Flavor** - Logging sklearn pipelines
4. **Custom PyFunc** - Creating your own model wrapper
5. **Loading & Using** - How to load models with different interfaces

## What are Model Flavors?

**Flavors** are MLflow's way of supporting different ML frameworks. Think of a flavor as a "format" or "packaging" for your model.

### Why Flavors Exist

Different ML frameworks save models differently:
- **Scikit-learn** uses pickle/joblib
- **TensorFlow** uses SavedModel format
- **PyTorch** uses .pt/.pth files
- **XGBoost** uses its own binary format

MLflow flavors provide:
- **Native interface**: Use the original framework's API
- **Unified interface**: Use `mlflow.pyfunc` for any model type

## Common MLflow Flavors

| Flavor | Framework | Import |
|--------|-----------|--------|
| `sklearn` | Scikit-learn | `mlflow.sklearn` |
| `pytorch` | PyTorch | `mlflow.pytorch` |
| `tensorflow` | TensorFlow | `mlflow.tensorflow` |
| `xgboost` | XGBoost | `mlflow.xgboost` |
| `lightgbm` | LightGBM | `mlflow.lightgbm` |
| `pyfunc` | Any Python | `mlflow.pyfunc` |

## Learning Goals
- Understand what flavors are and why they're useful
- Log different types of sklearn models
- Create and log sklearn pipelines
- Build a custom PyFunc model
- Load models using different interfaces

## Step 1: Import Libraries

We'll import MLflow along with various sklearn models and pipeline tools.

In [None]:
# MLflow for experiment tracking
import mlflow
import mlflow.sklearn
import mlflow.pyfunc

# sklearn datasets and data splitting
from sklearn.datasets import load_iris

# Various sklearn classifiers to compare
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

# sklearn pipeline and preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Data handling
import pandas as pd
import numpy as np

# System
import os
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")
print("Ready to learn about model flavors!")

## Step 2: Connect to MLflow

Set up our connection to the MLflow tracking server.

In [None]:
# Get MLflow tracking server URL
TRACKING_URI = os.getenv("MLFLOW_TRACKING_URI", "http://localhost:5000")

# Connect to MLflow
mlflow.set_tracking_uri(TRACKING_URI)

# Set experiment for this tutorial
mlflow.set_experiment("phase2-model-flavors")

print(f"Connected to MLflow at: {TRACKING_URI}")
print(f"Experiment: phase2-model-flavors")

## Step 3: Prepare Data

Load and prepare the Iris dataset for our experiments.

In [None]:
# Load Iris dataset
iris = load_iris()

# Create DataFrame with feature names
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

print("Dataset loaded!")
print(f"Samples: {len(X)}")
print(f"Features: {list(iris.feature_names)}")
print(f"Classes: {list(iris.target_names)}")

## Part 1: Sklearn Flavor - Multiple Model Types

The `sklearn` flavor works with ANY scikit-learn model. Let's train and log several different types of classifiers to see how the same flavor handles different model types.

In [None]:
print("="*60)
print("Part 1: Logging Multiple Sklearn Models")
print("="*60)

# Define several different classifiers to compare
# Each uses different algorithms but all can be logged with sklearn flavor
models = {
    "random_forest": RandomForestClassifier(
        n_estimators=50,    # 50 decision trees
        random_state=42
    ),
    "gradient_boosting": GradientBoostingClassifier(
        n_estimators=50,    # 50 boosting stages
        random_state=42
    ),
    "logistic_regression": LogisticRegression(
        max_iter=200,       # Maximum iterations for convergence
        random_state=42
    ),
}

print("\nTraining and logging models:")
print("-" * 40)

# Train and log each model
for name, model in models.items():
    # Each run is named after the model type
    with mlflow.start_run(run_name=name):
        # Train the model
        model.fit(X, y)
        
        # Create signature from the data
        signature = mlflow.models.infer_signature(X, model.predict(X))
        
        # Log the model using sklearn flavor
        # This works for RandomForest, GradientBoosting, LogisticRegression, etc.
        mlflow.sklearn.log_model(model, "model", signature=signature)
        
        # Add tags to identify the model
        # type(model).__name__ gets the class name like "RandomForestClassifier"
        mlflow.set_tag("model_type", type(model).__name__)
        mlflow.set_tag("flavor", "sklearn")
        
        # Calculate and log training accuracy
        accuracy = (model.predict(X) == y).mean()
        mlflow.log_metric("train_accuracy", accuracy)
        
        print(f"  {name}: accuracy = {accuracy:.4f} ({type(model).__name__})")

print("\nAll sklearn models logged with the same flavor!")

## Part 2: Sklearn Pipeline

A **Pipeline** chains multiple steps together (preprocessing + model). The sklearn flavor can log entire pipelines as a single model!

**Benefits of Pipelines:**
- All preprocessing is bundled with the model
- Prevents data leakage during cross-validation
- Easy to deploy - single object handles everything

In [None]:
print("\n" + "="*60)
print("Part 2: Logging Sklearn Pipeline")
print("="*60)

# Create a pipeline with preprocessing and classification
# The pipeline has two steps:
# 1. StandardScaler: Normalizes features (mean=0, std=1)
# 2. RandomForestClassifier: Makes predictions

pipeline = Pipeline([
    ("scaler", StandardScaler()),  # Step 1: Normalize data
    ("classifier", RandomForestClassifier(n_estimators=50, random_state=42))  # Step 2: Classify
])

print("\nPipeline structure:")
print("-" * 40)
for i, (name, step) in enumerate(pipeline.steps, 1):
    print(f"  Step {i}: {name} -> {type(step).__name__}")

# Train the pipeline
print("\nTraining pipeline...")
pipeline.fit(X, y)

# Log the pipeline
with mlflow.start_run(run_name="sklearn-pipeline"):
    # Create signature
    signature = mlflow.models.infer_signature(X, pipeline.predict(X))
    
    # Log the entire pipeline as a single model
    # When loaded, it will automatically apply scaling before prediction
    mlflow.sklearn.log_model(pipeline, "model", signature=signature)
    
    # Add tags
    mlflow.set_tag("model_type", "Pipeline")
    mlflow.set_tag("steps", "StandardScaler + RandomForestClassifier")
    
    # Log accuracy
    accuracy = (pipeline.predict(X) == y).mean()
    mlflow.log_metric("train_accuracy", accuracy)
    
    print(f"\nPipeline logged!")
    print(f"  Accuracy: {accuracy:.4f}")
    print("  The model includes both preprocessing AND classification!")

## Part 3: Custom PyFunc Model

**PyFunc** (Python Function) is MLflow's universal model format. It allows you to create custom models with:
- Custom preprocessing logic
- Custom postprocessing (e.g., return class names instead of numbers)
- Combine multiple models
- Any Python code you need!

To create a PyFunc model, inherit from `mlflow.pyfunc.PythonModel` and implement:
- `__init__()`: Store your model and configuration
- `predict(context, model_input)`: Define prediction logic

In [None]:
print("\n" + "="*60)
print("Part 3: Creating Custom PyFunc Model")
print("="*60)

# Define a custom model class
# This wraps a sklearn classifier with custom preprocessing and postprocessing

class PreprocessingModel(mlflow.pyfunc.PythonModel):
    """
    A custom model that includes:
    - Data normalization (preprocessing)
    - Classification
    - Returns class names instead of numbers (postprocessing)
    
    This demonstrates how to wrap existing models with custom logic.
    """
    
    def __init__(self, classifier, class_names):
        """
        Initialize the custom model.
        
        Args:
            classifier: Any sklearn classifier
            class_names: List of human-readable class names
        """
        self.classifier = classifier
        self.class_names = class_names
        self.scaler = StandardScaler()
    
    def fit(self, X, y):
        """
        Fit both the scaler and classifier.
        
        This method trains:
        1. The scaler to learn mean/std from training data
        2. The classifier on the scaled data
        """
        # First, fit the scaler and transform the data
        X_scaled = self.scaler.fit_transform(X)
        
        # Then, train the classifier on scaled data
        self.classifier.fit(X_scaled, y)
        
        return self
    
    def predict(self, context, model_input):
        """
        Make predictions with preprocessing and postprocessing.
        
        This method:
        1. Converts input to numpy array
        2. Scales the input (preprocessing)
        3. Makes predictions
        4. Converts numeric predictions to class names (postprocessing)
        
        Args:
            context: MLflow context (can contain model artifacts)
            model_input: Input data (DataFrame or array)
        
        Returns:
            List of class names (e.g., ['setosa', 'versicolor', ...])
        """
        # Handle different input types
        if isinstance(model_input, pd.DataFrame):
            X = model_input.values
        else:
            X = np.array(model_input)
        
        # Preprocess: Scale the input data
        X_scaled = self.scaler.transform(X)
        
        # Make predictions (returns numbers like 0, 1, 2)
        predictions = self.classifier.predict(X_scaled)
        
        # Postprocess: Convert numbers to class names
        return [self.class_names[p] for p in predictions]


print("\nCustom model class defined!")
print("\nFeatures:")
print("  - Automatic data scaling")
print("  - Returns class NAMES instead of numbers")
print("  - Works with any sklearn classifier")

In [None]:
# Create and train the custom model
print("\nCreating and training custom model...")

# Create a RandomForest as the base classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Wrap it in our custom model with Iris class names
custom_model = PreprocessingModel(
    classifier=rf,
    class_names=list(iris.target_names)  # ['setosa', 'versicolor', 'virginica']
)

# Train the custom model
custom_model.fit(X, y)

# Test it
sample_predictions = custom_model.predict(None, X.iloc[:5])
print(f"\nSample predictions: {sample_predictions}")
print("Note: Returns class NAMES, not numbers!")

In [None]:
# Log the custom model to MLflow
print("\nLogging custom PyFunc model...")

with mlflow.start_run(run_name="custom-pyfunc") as pyfunc_run:
    # Log using pyfunc flavor (not sklearn!)
    # python_model= tells MLflow this is a custom Python class
    mlflow.pyfunc.log_model(
        "model",
        python_model=custom_model,
        signature=mlflow.models.infer_signature(
            X,
            custom_model.predict(None, X)  # Sample output for signature
        )
    )
    
    # Add tags
    mlflow.set_tag("model_type", "CustomPyFunc")
    mlflow.set_tag("flavor", "pyfunc")
    mlflow.set_tag("features", "preprocessing + class_names")
    
    saved_pyfunc_run_id = pyfunc_run.info.run_id
    
    print(f"\nCustom PyFunc model logged!")
    print(f"Run ID: {saved_pyfunc_run_id}")

## Part 4: Loading and Testing Models

Now let's load the models we logged and test them. We'll use the `pyfunc` interface which works for ANY flavor - this is the recommended way to load models for deployment.

In [None]:
print("\n" + "="*60)
print("Part 4: Loading and Testing Models")
print("="*60)

# Get all runs from our experiment
runs = mlflow.search_runs(
    experiment_names=["phase2-model-flavors"],
    max_results=10
)

# Prepare sample input
sample_input = X.iloc[:3]

print("\nSample input data:")
print(sample_input)
print("\n" + "-"*60)

# Load and test each model
for _, run in runs.iterrows():
    run_name = run["tags.mlflow.runName"]
    run_id = run["run_id"]
    
    print(f"\n{run_name}:")
    
    # Load using pyfunc interface (works for all flavors!)
    model_uri = f"runs:/{run_id}/model"
    
    try:
        loaded = mlflow.pyfunc.load_model(model_uri)
        preds = loaded.predict(sample_input)
        
        # Format output based on prediction type
        if isinstance(preds[0], str):
            # Custom model returns strings
            print(f"  Predictions: {list(preds[:3])}")
        else:
            # Standard models return numbers
            pred_names = [iris.target_names[p] for p in preds[:3]]
            print(f"  Predictions: {list(preds[:3])} -> {pred_names}")
            
    except Exception as e:
        print(f"  Error loading: {e}")

## Part 5: Comparing Sklearn and PyFunc Loading

Let's compare the two ways to load a model:
1. **Native flavor** (`mlflow.sklearn.load_model`): Returns original sklearn object
2. **PyFunc** (`mlflow.pyfunc.load_model`): Returns MLflow wrapper

In [None]:
print("\n" + "="*60)
print("Part 5: Sklearn vs PyFunc Loading")
print("="*60)

# Find the random_forest run
rf_run = runs[runs["tags.mlflow.runName"] == "random_forest"].iloc[0]
rf_uri = f"runs:/{rf_run['run_id']}/model"

print(f"\nLoading Random Forest model...")

# Method 1: Load with sklearn flavor
print("\n[Method 1: mlflow.sklearn.load_model]")
print("-" * 40)
sklearn_model = mlflow.sklearn.load_model(rf_uri)
print(f"Type: {type(sklearn_model).__name__}")
print(f"n_estimators: {sklearn_model.n_estimators}")
print(f"Has feature_importances_: {hasattr(sklearn_model, 'feature_importances_')}")

# Method 2: Load with pyfunc
print("\n[Method 2: mlflow.pyfunc.load_model]")
print("-" * 40)
pyfunc_model = mlflow.pyfunc.load_model(rf_uri)
print(f"Type: {type(pyfunc_model).__name__}")
print(f"Has feature_importances_: {hasattr(pyfunc_model, 'feature_importances_')}")

# Both make the same predictions
print("\n[Comparison]")
print("-" * 40)
sklearn_preds = sklearn_model.predict(sample_input)
pyfunc_preds = pyfunc_model.predict(sample_input)
print(f"Sklearn predictions: {list(sklearn_preds)}")
print(f"PyFunc predictions:  {list(pyfunc_preds)}")
print(f"Same results: {list(sklearn_preds) == list(pyfunc_preds)}")

## Summary: Model Flavors

### When to Use Each Approach

| Approach | When to Use |
|----------|-------------|
| **Native Flavor** (e.g., `mlflow.sklearn`) | Need access to framework-specific features (like `feature_importances_`) |
| **PyFunc Loading** | Deployment, serving, or when you just need predictions |
| **Custom PyFunc** | Need custom preprocessing, postprocessing, or logic |

### Key Concepts

1. **Flavors are Format Handlers**
   - Each framework has its own flavor
   - MLflow handles serialization/deserialization

2. **PyFunc is Universal**
   - Works with any flavor
   - Provides consistent `predict()` interface
   - Best for deployment

3. **Custom PyFunc for Custom Logic**
   - Inherit from `mlflow.pyfunc.PythonModel`
   - Implement `predict(context, model_input)`
   - Bundle preprocessing/postprocessing with model

In [None]:
print("="*60)
print("Model Flavors Tutorial Complete!")
print("="*60)
print(f"\nView at: {TRACKING_URI}")
print("\nWhat you learned:")
print("  1. What model flavors are and why they exist")
print("  2. How to log different sklearn model types")
print("  3. How to log sklearn pipelines")
print("  4. How to create and log custom PyFunc models")
print("  5. Difference between native and pyfunc loading")
print("\nTry viewing the models in MLflow UI and comparing their structures!")