# Phase 2.3: Understanding Model Signatures in MLflow

This comprehensive notebook demonstrates:
1. **Automatic Signature Inference** - Let MLflow figure out the schema
2. **Manual Signature Definition** - Define schemas explicitly
3. **ColSpec Signatures** - For tabular data (DataFrames)
4. **TensorSpec Signatures** - For tensor data (deep learning)
5. **Signature Validation** - How MLflow validates inputs

## What is a Model Signature?

A **model signature** is a contract that defines:
- **Input Schema**: What data types and shapes the model expects
- **Output Schema**: What the model returns

Think of it like a function definition:
```python
def predict(sepal_length: float, sepal_width: float, ...) -> int:
    ...
```

## Why Signatures Matter

1. **Documentation**: Others know what data to provide
2. **Validation**: MLflow can catch input errors early
3. **Deployment**: Production systems know the expected format
4. **Type Safety**: Prevents runtime errors from wrong data types

## Learning Goals
- Understand what signatures are and why they're important
- Know how to infer signatures automatically
- Learn to define signatures manually (ColSpec and TensorSpec)
- See how signature validation works

## Step 1: Import Libraries

We'll import MLflow's signature-related classes along with our usual tools.

In [1]:
# mlflow: Main library for experiment tracking
import mlflow
import mlflow.sklearn

# Signature-related imports
from mlflow.models import infer_signature, ModelSignature
from mlflow.types.schema import Schema, ColSpec, TensorSpec

# sklearn for building models
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

# Data handling
import pandas as pd
import numpy as np

# System
import os
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")
print("Ready to learn about model signatures!")

All libraries imported successfully!
Ready to learn about model signatures!


## Step 2: Connect to MLflow

Set up our connection to the MLflow tracking server.

In [2]:
# Get MLflow tracking server URL
TRACKING_URI = os.getenv("MLFLOW_TRACKING_URI", "http://localhost:5000")

# Connect to MLflow
mlflow.set_tracking_uri(TRACKING_URI)

# Set experiment for this tutorial
mlflow.set_experiment("phase2-signatures")

print(f"Connected to MLflow at: {TRACKING_URI}")
print(f"Experiment: phase2-signatures")

2026/01/10 22:22:07 INFO mlflow.tracking.fluent: Experiment with name 'phase2-signatures' does not exist. Creating a new experiment.


Connected to MLflow at: http://localhost:5000
Experiment: phase2-signatures


## Step 3: Prepare Data and Train a Model

We'll use the Iris dataset and train a simple model to demonstrate signatures.

In [3]:
# Load Iris dataset
iris = load_iris()

# Create DataFrame with feature names
# Using pandas DataFrame is important because column names become part of the signature
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

print("Dataset loaded!")
print(f"\nFeature names: {list(iris.feature_names)}")
print(f"\nSample data:")
print(X.head())

# Train a model
print("\nTraining model...")
model = RandomForestClassifier(n_estimators=50, random_state=42)
model.fit(X, y)

# Get predictions for signature inference
predictions = model.predict(X)

print(f"Model trained! Sample predictions: {predictions[:5]}")

Dataset loaded!

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Sample data:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2

Training model...
Model trained! Sample predictions: [0 0 0 0 0]


## Method 1: Automatic Signature Inference

The easiest way to create a signature is to let MLflow infer it from sample data. MLflow analyzes your input data and model predictions to automatically determine the schema.

**Pros:**
- Very easy - just one function call
- Accurate for most use cases

**Cons:**
- Less control over exact types
- May not be suitable for complex models

In [4]:
print("="*60)
print("Method 1: Inferred Signature (Automatic)")
print("="*60)

# infer_signature(input_data, output_data) automatically creates a signature
# It looks at the data types and shapes of your inputs and outputs
inferred_signature = infer_signature(X, predictions)

# Let's examine what MLflow inferred
print("\nInput Schema (what the model expects):")
print("-" * 40)
# to_dict() converts the schema to a readable dictionary
for col in inferred_signature.inputs.to_dict():
    print(f"  - {col['name']}: {col['type']}")

print("\nOutput Schema (what the model returns):")
print("-" * 40)
for col in inferred_signature.outputs.to_dict():
    if 'name' in col:
        print(f"  - {col['name']}: {col['type']}")
    else:
        print(f"  - {col['type']}")

# Log model with inferred signature
with mlflow.start_run(run_name="inferred-signature"):
    mlflow.sklearn.log_model(
        model,
        "model",
        signature=inferred_signature
    )
    print("\nModel logged with inferred signature!")

Method 1: Inferred Signature (Automatic)

Input Schema (what the model expects):
----------------------------------------
  - sepal length (cm): double
  - sepal width (cm): double
  - petal length (cm): double
  - petal width (cm): double

Output Schema (what the model returns):
----------------------------------------
  - tensor





Model logged with inferred signature!
üèÉ View run inferred-signature at: http://localhost:5000/#/experiments/15/runs/485bb821b9544f7bb7dc24d7c218d304
üß™ View experiment at: http://localhost:5000/#/experiments/15


## Method 2: Manual Signature with ColSpec

**ColSpec** (Column Specification) is used for tabular data with named columns. This is perfect for:
- Pandas DataFrames
- CSV data
- SQL query results

### Common ColSpec Types

| Type | Python Equivalent | Use For |
|------|-------------------|----------|
| `"double"` | float64 | Continuous numbers |
| `"float"` | float32 | Memory-efficient floats |
| `"long"` | int64 | Large integers |
| `"integer"` | int32 | Regular integers |
| `"boolean"` | bool | True/False values |
| `"string"` | str | Text data |

In [5]:
print("\n" + "="*60)
print("Method 2: Manual Signature with ColSpec")
print("="*60)

# Define input schema explicitly
# ColSpec(data_type, column_name)
input_schema = Schema([
    ColSpec("double", "sepal length (cm)"),  # First feature
    ColSpec("double", "sepal width (cm)"),   # Second feature
    ColSpec("double", "petal length (cm)"),  # Third feature
    ColSpec("double", "petal width (cm)"),   # Fourth feature
])

# Define output schema
# Our model outputs integer class predictions (0, 1, or 2)
output_schema = Schema([
    ColSpec("long", "prediction")  # "long" = 64-bit integer
])

# Create the signature by combining input and output schemas
manual_signature = ModelSignature(
    inputs=input_schema, 
    outputs=output_schema
)

# Display the manual signature
print("\nManual Input Schema:")
print("-" * 40)
for col in manual_signature.inputs.to_dict():
    print(f"  - {col['name']}: {col['type']}")

print("\nManual Output Schema:")
print("-" * 40)
for col in manual_signature.outputs.to_dict():
    print(f"  - {col['name']}: {col['type']}")

# Log model with manual ColSpec signature
with mlflow.start_run(run_name="manual-colspec-signature"):
    mlflow.sklearn.log_model(
        model,
        "model",
        signature=manual_signature
    )
    print("\nModel logged with manual ColSpec signature!")




Method 2: Manual Signature with ColSpec

Manual Input Schema:
----------------------------------------
  - sepal length (cm): double
  - sepal width (cm): double
  - petal length (cm): double
  - petal width (cm): double

Manual Output Schema:
----------------------------------------
  - prediction: long





Model logged with manual ColSpec signature!
üèÉ View run manual-colspec-signature at: http://localhost:5000/#/experiments/15/runs/c12ffd0ed7934635b4bd256f149b6820
üß™ View experiment at: http://localhost:5000/#/experiments/15


## Method 3: TensorSpec Signature (For Deep Learning)

**TensorSpec** is used for tensor data, common in deep learning models (TensorFlow, PyTorch). It specifies:
- Data type (dtype)
- Shape (including batch dimension)
- Optional name

**Shape Convention:**
- Use `-1` for the batch dimension (variable number of samples)
- Other dimensions are fixed (e.g., 4 features for Iris)

Example shapes:
- `(-1, 4)` = Any number of samples, each with 4 features
- `(-1, 28, 28, 1)` = Batch of 28x28 grayscale images
- `(-1, 224, 224, 3)` = Batch of 224x224 RGB images

In [6]:
print("\n" + "="*60)
print("Method 3: TensorSpec Signature (For Tensors)")
print("="*60)

# Define input schema with TensorSpec
# TensorSpec(dtype, shape, name)
tensor_input_schema = Schema([
    TensorSpec(
        np.dtype("float64"),  # Data type: 64-bit float
        (-1, 4),               # Shape: batch of 4-feature vectors
        "input_features"       # Name for documentation
    )
])

# Define output schema with TensorSpec
tensor_output_schema = Schema([
    TensorSpec(
        np.dtype("int64"),    # Data type: 64-bit integer
        (-1,),                 # Shape: batch of scalar predictions
        "predictions"          # Name for documentation
    )
])

# Create the tensor signature
tensor_signature = ModelSignature(
    inputs=tensor_input_schema,
    outputs=tensor_output_schema
)

# Display the tensor signature
print("\nTensor Input Schema:")
print("-" * 40)
for spec in tensor_signature.inputs.to_dict():
    print(f"  - name: {spec.get('name', 'N/A')}")
    print(f"    dtype: {spec.get('tensor-spec', {}).get('dtype', 'N/A')}")
    print(f"    shape: {spec.get('tensor-spec', {}).get('shape', 'N/A')}")

print("\nTensor Output Schema:")
print("-" * 40)
for spec in tensor_signature.outputs.to_dict():
    print(f"  - name: {spec.get('name', 'N/A')}")
    print(f"    dtype: {spec.get('tensor-spec', {}).get('dtype', 'N/A')}")
    print(f"    shape: {spec.get('tensor-spec', {}).get('shape', 'N/A')}")

# Log model with TensorSpec signature
with mlflow.start_run(run_name="tensor-signature") as tensor_run:
    mlflow.sklearn.log_model(
        model,
        "model",
        signature=tensor_signature
    )
    saved_run_id = tensor_run.info.run_id
    print("\nModel logged with TensorSpec signature!")




Method 3: TensorSpec Signature (For Tensors)

Tensor Input Schema:
----------------------------------------
  - name: input_features
    dtype: float64
    shape: (-1, 4)

Tensor Output Schema:
----------------------------------------
  - name: predictions
    dtype: int64
    shape: (-1,)





Model logged with TensorSpec signature!
üèÉ View run tensor-signature at: http://localhost:5000/#/experiments/15/runs/954c451b0e41489c866c17bbd29b9ad5
üß™ View experiment at: http://localhost:5000/#/experiments/15


## Step 4: Signature Validation in Action

Now let's see how signatures help validate input data. When you load a model with a signature, MLflow can validate that your input data matches the expected schema.

In [7]:
print("\n" + "="*60)
print("Signature Validation Demo")
print("="*60)

# Load the model with tensor signature
model_uri = f"runs:/{saved_run_id}/model"
loaded_model = mlflow.pyfunc.load_model(model_uri)

print(f"\nLoaded model from: {model_uri}")

# Test 1: Valid DataFrame input
print("\n[Test 1] Valid DataFrame Input:")
print("-" * 40)
valid_df_input = pd.DataFrame(
    [[5.1, 3.5, 1.4, 0.2]], 
    columns=iris.feature_names
)
print(f"Input: {valid_df_input.values.tolist()}")
try:
    result = loaded_model.predict(valid_df_input)
    print(f"Prediction: {result}")
    print("Status: SUCCESS")
except Exception as e:
    print(f"Error: {e}")


Signature Validation Demo

Loaded model from: runs:/954c451b0e41489c866c17bbd29b9ad5/model

[Test 1] Valid DataFrame Input:
----------------------------------------
Input: [[5.1, 3.5, 1.4, 0.2]]
Error: Failed to enforce schema of data '   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2' with schema '['input_features': Tensor('float64', (-1, 4))]'. Error: Model is missing inputs ['input_features']. Note that there were extra inputs: ['petal width (cm)', 'sepal length (cm)', 'sepal width (cm)', 'petal length (cm)'].


In [8]:
# Test 2: Valid NumPy array input
print("\n[Test 2] Valid NumPy Array Input:")
print("-" * 40)
valid_numpy_input = np.array([[5.1, 3.5, 1.4, 0.2]])
print(f"Input shape: {valid_numpy_input.shape}")
print(f"Input: {valid_numpy_input.tolist()}")
try:
    result = loaded_model.predict(valid_numpy_input)
    print(f"Prediction: {result}")
    print("Status: SUCCESS")
except Exception as e:
    print(f"Error: {e}")


[Test 2] Valid NumPy Array Input:
----------------------------------------
Input shape: (1, 4)
Input: [[5.1, 3.5, 1.4, 0.2]]
Error: float() argument must be a string or a real number, not 'dict'


In [9]:
# Test 3: Multiple samples
print("\n[Test 3] Multiple Samples:")
print("-" * 40)
multi_sample_input = np.array([
    [5.1, 3.5, 1.4, 0.2],  # Sample 1
    [6.2, 2.9, 4.3, 1.3],  # Sample 2
    [7.7, 3.0, 6.1, 2.3],  # Sample 3
])
print(f"Input shape: {multi_sample_input.shape}")
try:
    results = loaded_model.predict(multi_sample_input)
    print(f"Predictions: {results}")
    print(f"Class names: {[iris.target_names[p] for p in results]}")
    print("Status: SUCCESS")
except Exception as e:
    print(f"Error: {e}")


[Test 3] Multiple Samples:
----------------------------------------
Input shape: (3, 4)
Error: float() argument must be a string or a real number, not 'dict'


## Comparing Signature Methods

Let's create a summary of when to use each signature method.

In [10]:
print("\n" + "="*60)
print("Signature Methods Comparison")
print("="*60)

comparison = pd.DataFrame({
    "Method": ["infer_signature()", "ColSpec", "TensorSpec"],
    "Best For": [
        "Quick setup, most sklearn models",
        "Tabular data with named columns",
        "Deep learning, image/tensor data"
    ],
    "Complexity": ["Low", "Medium", "Medium"],
    "Control": ["Automatic", "Full control", "Full control"],
    "Example Input": [
        "DataFrame, ndarray",
        "DataFrame with column names",
        "Numpy arrays, tensors"
    ]
})

print("\n" + comparison.to_string(index=False))


Signature Methods Comparison

           Method                         Best For Complexity      Control               Example Input
infer_signature() Quick setup, most sklearn models        Low    Automatic          DataFrame, ndarray
          ColSpec  Tabular data with named columns     Medium Full control DataFrame with column names
       TensorSpec Deep learning, image/tensor data     Medium Full control       Numpy arrays, tensors


## Summary: Model Signatures

### Key Takeaways

1. **Signatures are Important!**
   - They document what your model expects
   - They enable input validation
   - They're required for some deployment targets

2. **Three Ways to Create Signatures:**
   ```python
   # Method 1: Automatic inference (easiest)
   signature = mlflow.models.infer_signature(X, predictions)
   
   # Method 2: ColSpec (for tabular data)
   input_schema = Schema([ColSpec("double", "feature_name")])
   signature = ModelSignature(inputs=input_schema, outputs=output_schema)
   
   # Method 3: TensorSpec (for tensors/deep learning)
   input_schema = Schema([TensorSpec(np.dtype("float32"), (-1, 10))])
   signature = ModelSignature(inputs=input_schema, outputs=output_schema)
   ```

3. **Best Practices:**
   - Always include a signature when logging models
   - Use `infer_signature()` for quick development
   - Use manual signatures for production models
   - Test that your signature works with real inputs

In [11]:
print("="*60)
print("Signatures Tutorial Complete!")
print("="*60)
print(f"\nView models at: {TRACKING_URI}")
print("\nWhat you learned:")
print("  1. What signatures are and why they matter")
print("  2. How to infer signatures automatically")
print("  3. How to define ColSpec signatures for tabular data")
print("  4. How to define TensorSpec signatures for tensors")
print("  5. How signature validation works")
print("\nCheck the models in MLflow UI to see their signatures!")

Signatures Tutorial Complete!

View models at: http://localhost:5000

What you learned:
  1. What signatures are and why they matter
  2. How to infer signatures automatically
  3. How to define ColSpec signatures for tabular data
  4. How to define TensorSpec signatures for tensors
  5. How signature validation works

Check the models in MLflow UI to see their signatures!
