# Model Versioning and MLOps

## Overview
This notebook implements model versioning, automated retraining, and MLOps practices for production model management. It tracks model lineage, performance metrics, and enables canary deployments.

## Prerequisites
- Completed: `kserve-model-deployment.ipynb`
- KServe InferenceService deployed
- Model registry access
- Prometheus for metrics collection

## Learning Objectives
- Implement model versioning strategy
- Track model lineage and metadata
- Automate model retraining
- Implement canary deployments
- Monitor model drift

## Key Concepts
- **Model Registry**: Central repository for models
- **Versioning**: Semantic versioning for models
- **Lineage**: Track data and code used for training
- **Canary Deployment**: Gradual traffic shifting
- **Model Drift**: Performance degradation over time

## Setup Section

In [None]:
import sys
import os
import json
import yaml
import pickle
import logging
from pathlib import Path
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
import hashlib

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"✅ Utils path found: {utils_path}")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("✅ Common functions imported")
except ImportError as e:
    print(f"⚠️ Using fallback setup_environment")
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Define paths
MODELS_DIR = Path('/opt/app-root/src/models')
MODELS_DIR.mkdir(parents=True, exist_ok=True)
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# Model registry
MODEL_REGISTRY = MODELS_DIR / 'registry'
MODEL_REGISTRY.mkdir(exist_ok=True)

logger.info(f"Model registry: {MODEL_REGISTRY}")

## Implementation Section

### 1. Create Model Metadata

In [None]:
def create_model_metadata(model_name, model_version, training_data_hash, performance_metrics):
    """
    Create comprehensive model metadata.
    """
    metadata = {
        'model_name': model_name,
        'model_version': model_version,
        'created_at': datetime.now().isoformat(),
        'training_data_hash': training_data_hash,
        'performance_metrics': performance_metrics,
        'lineage': {
            'training_notebook': 'ensemble-anomaly-methods.ipynb',
            'training_date': datetime.now().isoformat(),
            'data_sources': [
                'prometheus-metrics-collection',
                'openshift-events-analysis',
                'log-parsing-analysis'
            ]
        },
        'deployment_config': {
            'framework': 'sklearn',
            'python_version': '3.11',
            'dependencies': [
                'scikit-learn>=1.0',
                'pandas>=1.5',
                'numpy>=1.23'
            ]
        }
    }
    return metadata

# Load or generate training data for hash
training_data_file = PROCESSED_DIR / 'synthetic_anomalies.parquet'
if training_data_file.exists():
    training_data = pd.read_parquet(training_data_file)
    data_hash = hashlib.md5(pd.util.hash_pandas_object(training_data, index=True).values).hexdigest()
    logger.info(f"Loaded training data for hash: {training_data.shape}")
else:
    logger.info("Training data not found - generating for validation")
    np.random.seed(42)
    training_data = pd.DataFrame(np.random.normal(50, 10, (100, 5)), columns=[f'metric_{i}' for i in range(5)])
    training_data['label'] = np.random.choice([0, 1], 100, p=[0.95, 0.05])
    training_data['timestamp'] = pd.date_range(end=datetime.now(), periods=100, freq='1min')
    training_data.to_parquet(training_data_file)
    data_hash = hashlib.md5(pd.util.hash_pandas_object(training_data, index=True).values).hexdigest()

# Load or create ensemble config for performance metrics
ensemble_config_file = MODELS_DIR / 'ensemble_config.pkl'
if ensemble_config_file.exists():
    with open(ensemble_config_file, 'rb') as f:
        ensemble_config = pickle.load(f)
    performance = ensemble_config.get('performance', [{}])[0]
else:
    logger.info("Ensemble config not found - creating default")
    ensemble_config = {
        'best_method': 'ensemble_weighted',
        'performance': [{'Method': 'Ensemble', 'Precision': 0.92, 'Recall': 0.88, 'F1': 0.90}]
    }
    with open(ensemble_config_file, 'wb') as f:
        pickle.dump(ensemble_config, f)
    performance = ensemble_config['performance'][0]

# Create metadata
metadata = create_model_metadata(
    'anomaly-detector',
    '1.0.0',
    data_hash,
    performance
)

logger.info(f"Created model metadata")
print(json.dumps(metadata, indent=2, default=str))

### 2. Implement Model Registry

In [None]:
def register_model(metadata, model_path):
    """
    Register model in model registry.
    
    Args:
        metadata: Model metadata
        model_path: Path to model file
    
    Returns:
        Registration result
    """
    try:
        # Create version directory
        version_dir = MODEL_REGISTRY / metadata['model_version']
        version_dir.mkdir(exist_ok=True)
        
        # Save metadata
        metadata_file = version_dir / 'metadata.json'
        with open(metadata_file, 'w') as f:
            json.dump(metadata, f, indent=2, default=str)
        
        # Copy model
        import shutil
        model_dest = version_dir / 'model.pkl'
        if model_path.exists():
            shutil.copy(model_path, model_dest)
        
        logger.info(f"Registered model version {metadata['model_version']}")
        return {'success': True, 'version': metadata['model_version'], 'path': str(version_dir)}
    except Exception as e:
        logger.error(f"Registration error: {e}")
        return {'success': False, 'error': str(e)}

# Register model
registration_result = register_model(metadata, MODELS_DIR / 'ensemble_config.pkl')
logger.info(f"Registration result: {registration_result}")

### 3. Canary Deployment Configuration

In [None]:
# Create canary deployment configuration
canary_config = {
    'current_version': '1.0.0',
    'canary_version': '1.1.0',
    'traffic_split': {
        'stable': 90,  # 90% traffic to stable version
        'canary': 10   # 10% traffic to canary version
    },
    'canary_duration_hours': 24,
    'success_criteria': {
        'error_rate_threshold': 0.05,  # 5%
        'latency_p99_threshold': 500,  # milliseconds
        'accuracy_threshold': 0.85
    },
    'rollback_triggers': [
        'error_rate > 5%',
        'latency_p99 > 500ms',
        'accuracy < 85%'
    ]
}

# Save canary config
with open(MODELS_DIR / 'canary_config.json', 'w') as f:
    json.dump(canary_config, f, indent=2)

logger.info(f"Created canary deployment configuration")
print(json.dumps(canary_config, indent=2))

### 4. Automated Retraining Schedule

In [None]:
# Create retraining schedule
retraining_schedule = {
    'enabled': True,
    'schedule': '0 2 * * 0',  # Weekly at 2 AM on Sunday (cron format)
    'triggers': {
        'model_drift': {
            'enabled': True,
            'threshold': 0.1  # 10% performance drop
        },
        'data_drift': {
            'enabled': True,
            'threshold': 0.05  # 5% data distribution change
        },
        'manual': {
            'enabled': True
        }
    },
    'retraining_config': {
        'training_data_window': 30,  # days
        'validation_split': 0.2,
        'test_split': 0.1,
        'hyperparameter_tuning': True
    },
    'deployment_strategy': 'canary'
}

# Save retraining schedule
with open(MODELS_DIR / 'retraining_schedule.json', 'w') as f:
    json.dump(retraining_schedule, f, indent=2)

logger.info(f"Created retraining schedule")
print(json.dumps(retraining_schedule, indent=2))

### 5. Model Performance Tracking

In [None]:
# Create model performance tracking
performance_tracking = pd.DataFrame([
    {
        'timestamp': datetime.now().isoformat(),
        'model_version': '1.0.0',
        'accuracy': 0.92,
        'precision': 0.89,
        'recall': 0.95,
        'f1_score': 0.92,
        'latency_p50': 45,
        'latency_p99': 180,
        'throughput': 120,
        'error_rate': 0.01,
        'status': 'healthy'
    }
])

# Save performance tracking
performance_tracking.to_parquet(MODELS_DIR / 'performance_tracking.parquet')

logger.info(f"Created performance tracking")
print(performance_tracking.to_string())

## Validation Section

In [None]:
# Verify outputs
assert (MODEL_REGISTRY / '1.0.0' / 'metadata.json').exists(), "Model metadata not saved"
assert (MODELS_DIR / 'canary_config.json').exists(), "Canary config not saved"
assert (MODELS_DIR / 'retraining_schedule.json').exists(), "Retraining schedule not saved"
assert (MODELS_DIR / 'performance_tracking.parquet').exists(), "Performance tracking not saved"

logger.info("✅ All validations passed")
print(f"\nModel Registry Summary:")
print(f"  Model Version: 1.0.0")
print(f"  Registry Path: {MODEL_REGISTRY}")
print(f"  Canary Deployment: Enabled")
print(f"  Automated Retraining: Enabled")

## Integration Section

This notebook integrates with:
- **Input**: Deployed KServe InferenceService
- **Output**: Model registry with versioning and metadata
- **Monitoring**: Prometheus for performance tracking
- **Automation**: Scheduled retraining and canary deployments

## Next Steps

1. Monitor model performance metrics
2. Proceed to `inference-pipeline-setup.ipynb`
3. Set up real-time inference pipeline
4. Implement model drift detection
5. Automate retraining workflow

## References

- ADR-008: Kubeflow Pipelines for MLOps Automation
- ADR-012: Notebook Architecture for End-to-End Workflows
- [MLflow Model Registry](https://mlflow.org/docs/latest/model-registry.html)
- [Semantic Versioning](https://semver.org/)