# Phase 5.1: Automatic Logging with MLflow Autolog

This comprehensive notebook demonstrates:
1. **Enabling Autolog** - One-line setup for automatic logging
2. **What Gets Logged** - Parameters, metrics, models, and more
3. **GridSearchCV Support** - Automatic CV result logging
4. **Configuration Options** - Customizing autolog behavior

## What is Autolog?

**Autolog** automatically logs everything without you writing explicit logging code!

### Without Autolog (Manual)
```python
mlflow.start_run()
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 5)
model.fit(X, y)
mlflow.log_metric("accuracy", accuracy)
mlflow.sklearn.log_model(model, "model")
mlflow.end_run()
```

### With Autolog (Automatic)
```python
mlflow.sklearn.autolog()  # Enable once
model.fit(X, y)           # Everything logged automatically!
```

## Supported Frameworks

| Framework | Import |
|-----------|--------|
| Scikit-learn | `mlflow.sklearn.autolog()` |
| TensorFlow | `mlflow.tensorflow.autolog()` |
| PyTorch | `mlflow.pytorch.autolog()` |
| XGBoost | `mlflow.xgboost.autolog()` |
| LightGBM | `mlflow.lightgbm.autolog()` |

## Learning Goals
- Enable autologging with one line
- Understand what gets logged automatically
- Use autolog with GridSearchCV
- Configure autolog options

## Step 1: Import Libraries

In [None]:
# MLflow imports
import mlflow
import mlflow.sklearn

# sklearn imports
from sklearn.datasets import load_iris, load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# System
import os
import warnings
warnings.filterwarnings("ignore")

print("All libraries imported successfully!")
print("Ready to learn about autologging!")

## Step 2: Connect to MLflow

In [None]:
# Get MLflow tracking server URL
TRACKING_URI = os.getenv("MLFLOW_TRACKING_URI", "http://localhost:5000")

# Connect to MLflow
mlflow.set_tracking_uri(TRACKING_URI)
mlflow.set_experiment("phase5-autolog")

print(f"Connected to MLflow at: {TRACKING_URI}")
print(f"Experiment: phase5-autolog")

## Step 3: Enable Autologging

This is the magic line! Just call `autolog()` once, and MLflow will automatically log everything for all subsequent model training.

In [None]:
print("="*60)
print("MLflow Autologging Demo")
print("="*60)

# Enable autologging for sklearn
# This single line enables automatic logging for ALL sklearn operations!
mlflow.sklearn.autolog()

print("\nAutologging enabled for sklearn!")
print("\nWhat will be logged automatically:")
print("  - Model parameters (n_estimators, max_depth, etc.)")
print("  - Training metrics (accuracy, f1, etc.)")
print("  - The trained model itself")
print("  - Model signature")
print("  - Feature importance (if available)")

## Step 4: Prepare Data

In [None]:
# Load Iris dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

print("Data loaded!")
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

## Example 1: Simple Model Training

Just train a model normally - autolog handles everything!

In [None]:
print("\n" + "-"*60)
print("[Example 1: Simple Model Training]")
print("-"*60)

print("\nTraining RandomForestClassifier...")
print("(Parameters, metrics, and model will be logged automatically!)")

# Just train normally - no mlflow.log_* calls needed!
model = RandomForestClassifier(
    n_estimators=100, 
    max_depth=5, 
    random_state=42
)
model.fit(X_train, y_train)

# Calculate accuracy
accuracy = model.score(X_test, y_test)

print(f"\nAccuracy: {accuracy:.4f}")
print("\nCheck MLflow UI - everything was logged automatically!")
print("Look for:")
print("  - Parameters: n_estimators=100, max_depth=5")
print("  - Metrics: training_score, training_accuracy_score")
print("  - Artifacts: model, feature_importances, etc.")

## Example 2: GridSearchCV with Autolog

Autolog works beautifully with GridSearchCV - it logs all cross-validation results!

In [None]:
print("\n" + "-"*60)
print("[Example 2: GridSearchCV]")
print("-"*60)

print("\nRunning GridSearchCV...")
print("(All CV results will be logged automatically!)")

# Define parameter grid
param_grid = {
    "n_estimators": [50, 100],
    "max_depth": [3, 5, 10]
}

# Create GridSearchCV
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=3,                      # 3-fold cross-validation
    scoring="accuracy",        # Optimize for accuracy
    return_train_score=True    # Log training scores too
)

# Fit - autolog captures everything!
grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Test score: {grid_search.score(X_test, y_test):.4f}")
print("\nAll CV results logged to MLflow!")
print("Check the UI - you'll see child runs for each parameter combination.")

## Example 3: Multiple Models

Train multiple different models - each gets its own run automatically.

In [None]:
print("\n" + "-"*60)
print("[Example 3: Multiple Models]")
print("-"*60)

# Load different dataset for variety
wine = load_wine()
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42
)

# Define multiple models to try
models = [
    ("LogisticRegression", LogisticRegression(max_iter=1000, random_state=42)),
    ("DecisionTree", DecisionTreeClassifier(max_depth=5, random_state=42)),
    ("RandomForest", RandomForestClassifier(n_estimators=50, random_state=42)),
]

print("\nTraining multiple models on Wine dataset...")

# Train each model - autolog creates separate runs automatically!
for name, model in models:
    print(f"  Training {name}...")
    model.fit(X_train_w, y_train_w)
    accuracy = model.score(X_test_w, y_test_w)
    print(f"    Accuracy: {accuracy:.4f}")

print("\nAll models logged automatically with autolog!")
print("Each model has its own run in MLflow.")

## Example 4: Configuring Autolog

You can customize what autolog captures.

In [None]:
print("\n" + "-"*60)
print("[Example 4: Autolog Configuration]")
print("-"*60)

# Configure autolog with specific options
mlflow.sklearn.autolog(
    log_input_examples=True,      # Log sample input data
    log_model_signatures=True,    # Log input/output schemas
    log_models=True,              # Log the model itself
    log_post_training_metrics=True,  # Log metrics after training
    silent=False,                 # Show autolog messages
    max_tuning_runs=5             # Limit GridSearchCV child runs
)

print("\nAutolog configured with custom settings:")
print("  - log_input_examples: True")
print("  - log_model_signatures: True")
print("  - log_models: True")
print("  - log_post_training_metrics: True")
print("  - max_tuning_runs: 5")

# Train with custom config
print("\nTraining model with custom config...")
model = RandomForestClassifier(n_estimators=75, max_depth=7, random_state=42)
model.fit(X_train, y_train)

print(f"Accuracy: {model.score(X_test, y_test):.4f}")

## Example 5: Disabling Autolog

You can disable autolog when you want manual control.

In [None]:
print("\n" + "-"*60)
print("[Example 5: Disabling Autolog]")
print("-"*60)

# Disable autolog
mlflow.sklearn.autolog(disable=True)
print("\nAutolog disabled.")

# This training will NOT be logged
print("\nTraining model (this will NOT be logged)...")
model = RandomForestClassifier(n_estimators=25, random_state=42)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.4f}")
print("(No run created in MLflow)")

# Re-enable for future examples
mlflow.sklearn.autolog()
print("\nAutolog re-enabled.")

## Summary: Autolog Configuration Options

| Option | Default | Description |
|--------|---------|-------------|
| `log_input_examples` | False | Log sample input data |
| `log_model_signatures` | True | Log input/output schemas |
| `log_models` | True | Log the trained model |
| `log_post_training_metrics` | True | Log metrics after training |
| `disable` | False | Disable autologging |
| `silent` | False | Suppress autolog messages |
| `max_tuning_runs` | 5 | Max GridSearchCV child runs to log |

### When to Use Autolog

**Use Autolog When:**
- Rapid prototyping
- Training many models quickly
- You want consistent logging

**Use Manual Logging When:**
- You need custom metrics
- You want specific artifact organization
- You need full control over what's logged

In [None]:
print("="*60)
print("Autologging Tutorial Complete!")
print("="*60)
print(f"\nView experiments at: {TRACKING_URI}")
print("\nWhat you learned:")
print("  1. How to enable autolog with one line")
print("  2. What gets logged automatically")
print("  3. How autolog works with GridSearchCV")
print("  4. How to configure autolog options")
print("  5. When to use autolog vs manual logging")