# Phase 2.2: Logging and Loading ML Models with MLflow

This comprehensive notebook demonstrates:
1. **Training an ML Model** - Build a RandomForest classifier
2. **Logging Models** - Save models to MLflow with metadata
3. **Model Signatures** - Define input/output schemas
4. **Input Examples** - Store sample inputs with your model
5. **Loading Models** - Retrieve models for predictions

## What is Model Logging?

**Model logging** saves your trained model to MLflow so you can:
- **Reproduce** results later by loading the exact same model
- **Deploy** models to production
- **Compare** different model versions
- **Share** models with your team

## Learning Goals
- Understand how to log sklearn models to MLflow
- Learn about model signatures and why they matter
- Know how to load models back for predictions
- Use both `sklearn` and `pyfunc` interfaces

## Step 1: Import Libraries

We'll use MLflow for experiment tracking and sklearn for building our model.

In [None]:
# mlflow: Main library for experiment tracking
import mlflow

# mlflow.sklearn: Special module for logging sklearn models
# This provides optimized functions for sklearn model serialization
import mlflow.sklearn

# sklearn.datasets: Contains built-in datasets for practice
from sklearn.datasets import load_iris

# sklearn.model_selection: Tools for splitting data
from sklearn.model_selection import train_test_split

# sklearn.ensemble: Contains RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# sklearn.metrics: Functions to evaluate model performance
from sklearn.metrics import accuracy_score, classification_report

# pandas: For working with tabular data
import pandas as pd

# os: For environment variables
import os

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")
print("Ready to learn about model logging!")

## Step 2: Connect to MLflow

Connect to the MLflow tracking server and set up our experiment.

In [None]:
# Get the MLflow tracking server URL
TRACKING_URI = os.getenv("MLFLOW_TRACKING_URI", "http://localhost:5000")

# Tell MLflow where to send tracking data
mlflow.set_tracking_uri(TRACKING_URI)

# Create or select an experiment
mlflow.set_experiment("phase2-model-logging")

print(f"Connected to MLflow at: {TRACKING_URI}")
print(f"Experiment: phase2-model-logging")

## Step 3: Load and Explore the Dataset

We'll use the classic **Iris dataset** - a simple dataset for classification that contains measurements of iris flowers.

**Why use pandas DataFrame?**
- Preserves feature names (important for signatures)
- Better compatibility with MLflow
- Easier to inspect and work with

In [None]:
# Load the Iris dataset from sklearn
iris = load_iris()

# Create a pandas DataFrame with feature names as column headers
# This is better than using raw numpy arrays because:
# 1. Column names are preserved
# 2. MLflow can infer better signatures
X = pd.DataFrame(iris.data, columns=iris.feature_names)

# Target variable (what we're trying to predict)
y = iris.target

print("="*60)
print("Dataset Information")
print("="*60)
print(f"\nDataset: Iris")
print(f"Total samples: {len(X)}")
print(f"\nFeatures (measurements):")
for i, name in enumerate(iris.feature_names, 1):
    print(f"  {i}. {name}")
print(f"\nClasses (flower types):")
for i, name in enumerate(iris.target_names):
    print(f"  {i}: {name}")

print(f"\nFirst 5 samples:")
X.head()

## Step 4: Split Data into Train and Test Sets

We split the data to:
- **Train** the model on 80% of the data
- **Test** the model on the remaining 20% (unseen data)

In [None]:
# Split the data into training and testing sets
# test_size=0.2 means 20% for testing
# random_state=42 ensures reproducibility (same split every time)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print(f"Data split complete!")
print(f"Training samples: {len(X_train)} (80%)")
print(f"Testing samples: {len(X_test)} (20%)")

## Step 5: Train Model and Log to MLflow

Now let's train a RandomForest model and log everything to MLflow.

**What gets logged:**
1. Parameters (hyperparameters like n_estimators, max_depth)
2. Metrics (performance scores like accuracy)
3. Model (the trained model itself)
4. Signature (input/output schema)
5. Input example (sample input data)

In [None]:
print("="*60)
print("Training and Logging Model")
print("="*60)

# Start an MLflow run - all logs will be grouped under this run
with mlflow.start_run(run_name="model-logging-demo") as run:
    
    # ===== STEP 1: Train the Model =====
    print("\n[1] Training model...")
    
    # Create a RandomForest classifier
    # - n_estimators: Number of trees in the forest (more trees = better but slower)
    # - max_depth: How deep each tree can grow (prevents overfitting)
    # - random_state: Ensures reproducibility
    model = RandomForestClassifier(
        n_estimators=100,  # Use 100 decision trees
        max_depth=5,       # Limit tree depth to 5 levels
        random_state=42    # For reproducibility
    )
    
    # Train the model on training data
    # The model learns patterns from X_train to predict y_train
    model.fit(X_train, y_train)
    
    # ===== STEP 2: Evaluate the Model =====
    # Use the trained model to predict on test data
    y_pred = model.predict(X_test)
    
    # Calculate accuracy: percentage of correct predictions
    accuracy = accuracy_score(y_test, y_pred)
    print(f"    Accuracy: {accuracy:.4f} ({accuracy*100:.1f}% correct)")
    
    # ===== STEP 3: Log Parameters =====
    # Parameters are the configuration settings you chose
    print("\n[2] Logging parameters...")
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 5)
    print("    Logged: n_estimators=100, max_depth=5")
    
    # ===== STEP 4: Log Metrics =====
    # Metrics are numerical performance measurements
    print("\n[3] Logging metrics...")
    mlflow.log_metric("accuracy", accuracy)
    print(f"    Logged: accuracy={accuracy:.4f}")
    
    # ===== STEP 5: Create Model Signature =====
    print("\n[4] Creating model signature...")
    
    # Signature describes what the model expects as input and produces as output
    # This is crucial for:
    # - Model validation when loading
    # - Automatic type checking in production
    # - Documentation for users of your model
    signature = mlflow.models.infer_signature(
        X_train,               # Sample input data
        model.predict(X_train) # Sample output (predictions)
    )
    
    print(f"    Input schema: {signature.inputs}")
    print(f"    Output schema: {signature.outputs}")
    
    # ===== STEP 6: Log the Model =====
    print("\n[5] Logging model with signature and input example...")
    
    # Log the model with all metadata
    # - "random_forest_model": Name of the model artifact folder
    # - signature: Input/output schema
    # - input_example: Sample input for documentation/testing
    mlflow.sklearn.log_model(
        model,                          # The trained model object
        "random_forest_model",          # Artifact folder name
        signature=signature,            # Schema information
        input_example=X_train.iloc[:3]  # First 3 rows as example
    )
    
    print(f"    Model logged successfully!")
    print(f"    Run ID: {run.info.run_id}")
    
    # Save run_id for later use
    saved_run_id = run.info.run_id

print("\nTraining complete!")

## Step 6: Load the Model Back from MLflow

Once a model is logged, you can load it back at any time. MLflow provides multiple ways to load models:

1. **`mlflow.sklearn.load_model()`** - Returns native sklearn model
2. **`mlflow.pyfunc.load_model()`** - Returns unified MLflow wrapper

**Model URI Format:**
- `runs:/<run_id>/<artifact_path>` - Load from a specific run
- `models:/<model_name>/<version>` - Load from Model Registry

In [None]:
print("="*60)
print("Loading Model from MLflow")
print("="*60)

# Construct the model URI (Uniform Resource Identifier)
# Format: runs:/<run_id>/<artifact_path>
model_uri = f"runs:/{saved_run_id}/random_forest_model"

print(f"\n[1] Loading model from: {model_uri}")

# Load the model using sklearn flavor
# This returns the native sklearn RandomForestClassifier object
loaded_model = mlflow.sklearn.load_model(model_uri)

print(f"    Model loaded successfully!")
print(f"    Model type: {type(loaded_model).__name__}")
print(f"    Number of trees: {loaded_model.n_estimators}")
print(f"    Max depth: {loaded_model.max_depth}")

## Step 7: Make Predictions with the Loaded Model

Let's verify that the loaded model works correctly by making predictions on test data.

In [None]:
print("\n[2] Making predictions with loaded model...")

# Take 5 samples from test data
sample_data = X_test.iloc[:5]

# Make predictions using the loaded model
predictions = loaded_model.predict(sample_data)

# Display results
print("\n    Sample Predictions:")
print("    " + "-" * 55)
print(f"    {'#':<4} {'Predicted':<15} {'Actual':<15} {'Match'}")
print("    " + "-" * 55)

for i in range(5):
    actual_class = iris.target_names[y_test.iloc[i]]
    predicted_class = iris.target_names[predictions[i]]
    match = "correct" if actual_class == predicted_class else "wrong"
    print(f"    {i+1:<4} {predicted_class:<15} {actual_class:<15} {match}")

print("    " + "-" * 55)

## Step 8: Use the PyFunc Interface

MLflow's **pyfunc** (Python Function) interface provides a unified way to load ANY model type. This is useful because:
- Same loading code works for sklearn, TensorFlow, PyTorch, etc.
- Easier to deploy to production
- Consistent predict() interface

In [None]:
print("\n[3] Testing pyfunc interface...")

# Load model using pyfunc interface
# This returns an MLflow wrapper, not the native sklearn object
pyfunc_model = mlflow.pyfunc.load_model(model_uri)

print(f"    Model type: {type(pyfunc_model).__name__}")

# Make predictions - same predict() method, same results!
pyfunc_predictions = pyfunc_model.predict(sample_data)

print(f"    Pyfunc predictions: {list(pyfunc_predictions)}")

# Verify predictions are identical
if list(predictions) == list(pyfunc_predictions):
    print("\n    Both interfaces produce identical results!")

## Step 9: View Full Classification Report

In [None]:
print("\n" + "="*60)
print("Full Classification Report (Loaded Model)")
print("="*60)

# Get predictions for all test data
all_predictions = loaded_model.predict(X_test)

# Print detailed classification report
print(classification_report(
    y_test, 
    all_predictions, 
    target_names=iris.target_names
))

## Summary: Model Logging Workflow

### Step-by-Step Process

```python
# 1. Start an MLflow run
with mlflow.start_run():
    
    # 2. Train your model
    model.fit(X_train, y_train)
    
    # 3. Log parameters and metrics
    mlflow.log_param("param_name", value)
    mlflow.log_metric("metric_name", value)
    
    # 4. Create signature
    signature = mlflow.models.infer_signature(X, model.predict(X))
    
    # 5. Log the model
    mlflow.sklearn.log_model(model, "model_name", signature=signature)
```

### Loading Models

```python
# Load as native sklearn model
model = mlflow.sklearn.load_model("runs:/<run_id>/model_name")

# Load as pyfunc (universal interface)
model = mlflow.pyfunc.load_model("runs:/<run_id>/model_name")
```

### Key Concepts

| Concept | Description |
|---------|-------------|
| **Signature** | Defines expected input/output types |
| **Input Example** | Sample data stored with model |
| **Model URI** | Address to load model from |
| **Flavor** | Framework-specific model type (sklearn, pytorch, etc.) |

In [None]:
print("="*60)
print("Model Logging Tutorial Complete!")
print("="*60)
print(f"\nView at: {TRACKING_URI}/#/experiments")
print("\nWhat you learned:")
print("  1. How to log sklearn models with mlflow.sklearn.log_model()")
print("  2. How to create and use model signatures")
print("  3. How to store input examples with your model")
print("  4. How to load models using sklearn and pyfunc interfaces")
print("  5. How to make predictions with loaded models")