# OPTIONAL: Model Deployment - Lightweight Sklearn Pipeline Saving

**Module**: ML700 Advanced Topics (Optional)  
**Notebook**: 04 - Model Deployment: Lightweight Sklearn Pipeline Saving  
**Status**: OPTIONAL - This notebook covers advanced material beyond the core curriculum.

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain the difference between training and serving, and why saving the full pipeline matters
2. Use `joblib` to save and load scikit-learn pipelines
3. Build a complete pipeline (preprocessing + model) and persist it to disk
4. Verify that a loaded model produces identical predictions
5. Follow best practices for model versioning and reproducibility

## Prerequisites

- Understanding of scikit-learn pipelines and `ColumnTransformer` (Module ML600)
- Familiarity with `Pipeline`, `fit`, `predict` API
- Basic file I/O concepts

## Table of Contents

1. [Training vs. Serving](#1.-Training-vs.-Serving)
2. [Why Save the Full Pipeline?](#2.-Why-Save-the-Full-Pipeline?)
3. [Joblib: Save and Load](#3.-Joblib)
4. [Hands-On: Build, Save, Load, Predict](#4.-Hands-On)
5. [Pickle vs. Joblib](#5.-Pickle-vs.-Joblib)
6. [Best Practices](#6.-Best-Practices)
7. [Beyond Joblib: ONNX and Docker](#7.-Beyond-Joblib)
8. [Common Mistakes](#8.-Common-Mistakes)
9. [Summary](#9.-Summary)

---

## 1. Training vs. Serving

In production ML, there are two distinct phases:

**Training** (offline):
- Access to full dataset
- Fit preprocessors (scalers, encoders) and model
- Evaluate and tune hyperparameters
- Can take minutes to hours

**Serving / Inference** (online):
- Receive new, unseen data points one at a time (or in small batches)
- Apply the same preprocessing and model to produce predictions
- Must be fast (milliseconds to seconds)
- **Must not re-fit anything** -- only `transform` and `predict`

The bridge between training and serving is **model serialization**: saving the trained
pipeline to disk so it can be loaded elsewhere.

## 2. Why Save the Full Pipeline?

A common mistake is saving only the model (e.g., the Random Forest) without the preprocessor.
This breaks because:

- The model expects **preprocessed** input (scaled, encoded, etc.)
- At serving time, you need the **exact same** scaler parameters (mean, std) that were fit on training data
- Rebuilding the preprocessor requires the original training data, which may not be available

**Solution**: Always save the **entire pipeline** (preprocessor + model) as a single artifact.

## 3. Joblib: Save and Load

`joblib` is the recommended way to serialize scikit-learn objects:

```python
import joblib

# Save
joblib.dump(pipeline, 'model.joblib')

# Load
loaded_pipeline = joblib.load('model.joblib')
```

`joblib` is preferred over `pickle` for scikit-learn because it handles large NumPy arrays
more efficiently (using memory mapping and compression).

## 4. Hands-On: Build, Save, Load, Predict

Let us build a full pipeline on the breast cancer dataset, save it, load it, and verify.

In [None]:
import numpy as np
import pandas as pd
import joblib
import sklearn
import os
import tempfile
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

print(f"scikit-learn version: {sklearn.__version__}")
print(f"numpy version: {np.__version__}")

In [None]:
# Load data into a DataFrame for realistic column-based processing
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set:     {X_test.shape}")
print(f"Features:     {list(X.columns[:5])} ... ({X.shape[1]} total)")

In [None]:
# Build a full pipeline: ColumnTransformer (scaling) + RandomForest
# All features are numeric in this dataset, so we scale all of them
numeric_features = list(X.columns)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
    ],
    remainder='drop'
)

full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train the pipeline
full_pipeline.fit(X_train, y_train)

# Evaluate on test set
y_pred = full_pipeline.predict(X_test)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

In [None]:
# Save the full pipeline with joblib
save_dir = tempfile.mkdtemp()
model_path = os.path.join(save_dir, 'breast_cancer_pipeline.joblib')

joblib.dump(full_pipeline, model_path)

file_size_kb = os.path.getsize(model_path) / 1024
print(f"Model saved to: {model_path}")
print(f"File size: {file_size_kb:.1f} KB")

In [None]:
# Load the pipeline from disk
loaded_pipeline = joblib.load(model_path)

print(f"Loaded pipeline type: {type(loaded_pipeline)}")
print(f"Pipeline steps: {[step[0] for step in loaded_pipeline.steps]}")

In [None]:
# Verify: loaded model produces IDENTICAL predictions
y_pred_original = full_pipeline.predict(X_test)
y_pred_loaded = loaded_pipeline.predict(X_test)

predictions_match = np.array_equal(y_pred_original, y_pred_loaded)
print(f"Predictions match: {predictions_match}")

# Also check predicted probabilities
proba_original = full_pipeline.predict_proba(X_test)
proba_loaded = loaded_pipeline.predict_proba(X_test)

probas_match = np.allclose(proba_original, proba_loaded)
print(f"Probabilities match: {probas_match}")

if predictions_match and probas_match:
    print("\nThe loaded model is identical to the original. Safe to deploy.")

In [None]:
# Simulate serving: predict on new data
# In production, this is all you need: load the pipeline and call predict
new_sample = X_test.iloc[:3]  # simulate 3 new patients

serving_pipeline = joblib.load(model_path)
predictions = serving_pipeline.predict(new_sample)
probabilities = serving_pipeline.predict_proba(new_sample)

print("Serving predictions on 3 new samples:")
for i in range(len(new_sample)):
    label = data.target_names[predictions[i]]
    confidence = probabilities[i].max()
    print(f"  Sample {i+1}: {label} (confidence: {confidence:.2f})")

## 5. Pickle vs. Joblib

| Feature | `pickle` | `joblib` |
|---------|----------|----------|
| Part of standard library | Yes | No (but included with scikit-learn) |
| NumPy array handling | Standard serialization | Optimized (memory mapping, compression) |
| Large model files | Slower | Faster for large arrays |
| API | `pickle.dump/load` | `joblib.dump/load` |
| Compression | Manual | Built-in (`compress` parameter) |

**Recommendation**: Use `joblib` for scikit-learn models. It is more efficient for objects
containing large NumPy arrays (like Random Forests with many trees).

In [None]:
# Joblib with compression
compressed_path = os.path.join(save_dir, 'breast_cancer_pipeline_compressed.joblib')
joblib.dump(full_pipeline, compressed_path, compress=3)

original_size = os.path.getsize(model_path) / 1024
compressed_size = os.path.getsize(compressed_path) / 1024

print(f"Original size:   {original_size:.1f} KB")
print(f"Compressed size: {compressed_size:.1f} KB")
print(f"Compression ratio: {original_size / compressed_size:.1f}x")

# Verify compressed model works identically
loaded_compressed = joblib.load(compressed_path)
y_pred_compressed = loaded_compressed.predict(X_test)
print(f"Compressed model predictions match: {np.array_equal(y_pred_original, y_pred_compressed)}")

## 6. Best Practices

### Save the entire pipeline, not individual components
The pipeline encapsulates all preprocessing steps. Saving just the model means you must
recreate the preprocessor at serving time, which is error-prone.

### Version your model files
Use meaningful file names that include version or date information:
```
model_v1.0_2024-01-15.joblib
model_v1.1_2024-02-01.joblib
```

### Log scikit-learn version and feature names
Version mismatches between training and serving can cause subtle bugs or failures.

In [None]:
# Best practice: save metadata alongside the model
import json
from datetime import datetime

metadata = {
    'model_name': 'breast_cancer_classifier',
    'model_version': '1.0',
    'sklearn_version': sklearn.__version__,
    'numpy_version': np.__version__,
    'training_date': datetime.now().isoformat(),
    'feature_names': list(X.columns),
    'n_features': X.shape[1],
    'target_names': list(data.target_names),
    'training_samples': X_train.shape[0],
    'test_accuracy': float(accuracy_score(y_test, y_pred)),
    'random_state': 42,
}

metadata_path = os.path.join(save_dir, 'model_metadata.json')
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print("Model metadata saved:")
print(json.dumps(metadata, indent=2))

In [None]:
# Best practice: verify at load time that versions match
def load_model_with_checks(model_path, metadata_path):
    """Load a model and verify version compatibility."""
    # Load metadata
    with open(metadata_path, 'r') as f:
        meta = json.load(f)
    
    # Check sklearn version
    if meta['sklearn_version'] != sklearn.__version__:
        print(f"WARNING: Model trained with sklearn {meta['sklearn_version']}, "
              f"but current version is {sklearn.__version__}")
    else:
        print(f"sklearn version match: {sklearn.__version__}")
    
    # Load model
    pipeline = joblib.load(model_path)
    
    print(f"Model loaded: {meta['model_name']} v{meta['model_version']}")
    print(f"Trained on: {meta['training_date']}")
    print(f"Expected features: {meta['n_features']}")
    print(f"Training accuracy: {meta['test_accuracy']:.4f}")
    
    return pipeline, meta

loaded_pipe, loaded_meta = load_model_with_checks(model_path, metadata_path)

## 7. Beyond Joblib: ONNX and Docker (Conceptual)

For production deployments beyond simple scripts, two technologies are commonly used:

### ONNX (Open Neural Network Exchange)
- An open format for representing ML models
- Convert scikit-learn models to ONNX for language-agnostic serving (C++, Java, etc.)
- Faster inference in some cases
- Library: `skl2onnx` for scikit-learn to ONNX conversion

### Docker
- Package your model, dependencies, and serving code into a container
- Ensures identical environment between development and production
- Typical pattern: Flask/FastAPI app inside a Docker container that loads the joblib model

These are beyond the scope of this notebook, but important to know about for real deployments.

## 8. Common Mistakes

1. **Saving the model without the preprocessor**: At serving time, raw data must go through the same preprocessing. Save the entire `Pipeline`.
2. **Version mismatch**: A model saved with sklearn 1.2 may not load correctly in sklearn 1.4. Always log and check versions.
3. **Not testing the loaded model**: Always verify that `loaded_model.predict(X_test)` matches the original predictions before deploying.
4. **Hardcoding feature order**: If the input DataFrame columns change order, predictions will be wrong. Use `ColumnTransformer` with explicit column names.
5. **Not saving metadata**: Without metadata (feature names, training date, performance metrics), you cannot audit or debug deployed models.

In [None]:
# Clean up temporary files
import shutil
shutil.rmtree(save_dir)
print(f"Cleaned up temporary directory: {save_dir}")

## 9. Summary

- **Training vs. serving**: Train offline, serve online. The bridge is model serialization.
- **Save the full pipeline**: Always serialize the entire `Pipeline` (preprocessor + model), not just the model.
- **joblib** is preferred over pickle for scikit-learn objects (better NumPy array handling).
- **Workflow**: `joblib.dump(pipeline, path)` to save, `joblib.load(path)` to restore.
- **Best practices**:
  - Version your model files
  - Log sklearn version and feature names in metadata
  - Verify loaded model produces identical results
  - Use compression for large models (`compress` parameter)
- **Beyond joblib**: ONNX for cross-language serving, Docker for reproducible deployment environments.