# Model Training and Management with MLflow

This notebook demonstrates the complete model lifecycle management using MLflow. We'll perform the following steps:

1. Set up the environment and MLflow tracking with DagsHub
2. Load and explore the processed data
3. Train multiple models with different hyperparameters
4. Track experiments with MLflow
5. Compare and visualize model performance
6. Select the best model
7. Register the model in the MLflow Model Registry
8. Manage the model lifecycle (staging, production, etc.)
9. Deploy the model for inference

## Prerequisites

- Python 3.11+
- Required packages (scikit-learn, pandas, numpy, matplotlib, mlflow, dagshub)
- Completed data cleaning notebook (01_data_cleaning.ipynb)

## 1. Environment Setup

Let's start by importing the necessary libraries and setting up MLflow tracking with DagsHub.

In [None]:
# Import necessary libraries
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import json
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# For visualization
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_theme(style="whitegrid")

# Add parent directory to path to import project modules
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

# For MLflow tracking
try:
    import mlflow
    import dagshub
    
    # Initialize DagsHub with MLflow tracking
    dagshub.init(
        repo_owner='yahiaehab10', 
        repo_name='MLflow_demo_MF', 
        mlflow=True
    )
    
    # Set experiment name
    mlflow.set_experiment("model_training")
    
    print("MLflow tracking with DagsHub initialized.")
except Exception as e:
    print(f"Warning: Could not initialize DagsHub MLflow tracking: {e}")
    print("Continuing with default MLflow tracking.")
    import mlflow
    mlflow.set_experiment("model_training")

# Set up paths
data_dir = os.path.join('..', 'data')
processed_dir = os.path.join(data_dir, 'processed')
models_dir = os.path.join('..', 'models')

# Create directories if they don't exist
os.makedirs(models_dir, exist_ok=True)

print("Environment setup completed.")

## 2. Load and Explore Processed Data

Let's load the processed data that we prepared in the data cleaning notebook.

In [None]:
# Load the processed data
try:
    X_train = pd.read_csv(os.path.join(processed_dir, "diabetes_X_train.csv"))
    X_test = pd.read_csv(os.path.join(processed_dir, "diabetes_X_test.csv"))
    y_train = pd.read_csv(os.path.join(processed_dir, "diabetes_y_train.csv")).iloc[:, 0]
    y_test = pd.read_csv(os.path.join(processed_dir, "diabetes_y_test.csv")).iloc[:, 0]
    
    print("Processed data loaded successfully.")
    print(f"Training data shape: {X_train.shape}")
    print(f"Testing data shape: {X_test.shape}")
except FileNotFoundError:
    print("Processed data not found. Please run the data cleaning notebook first.")
    X_train, X_test, y_train, y_test = None, None, None, None

# Basic exploration of the processed data
if X_train is not None:
    # Display basic statistics
    print("\nFeature Statistics:")
    print(X_train.describe())
    
    # Correlation with target
    correlations = pd.DataFrame()
    correlations['feature'] = X_train.columns
    correlations['correlation_with_target'] = [np.corrcoef(X_train[col], y_train)[0, 1] for col in X_train.columns]
    correlations = correlations.sort_values('correlation_with_target', ascending=False)
    
    print("\nFeature Correlations with Target:")
    print(correlations)
    
    # Visualize correlations
    plt.figure(figsize=(12, 6))
    sns.barplot(x='feature', y='correlation_with_target', data=correlations)
    plt.title('Feature Correlations with Target')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    # Pair plot of top correlated features
    top_features = correlations['feature'].head(3).tolist()
    plot_data = X_train[top_features].copy()
    plot_data['target'] = y_train
    
    plt.figure(figsize=(10, 8))
    sns.pairplot(plot_data, corner=True)
    plt.suptitle('Pairplot of Top Correlated Features', y=1.02)
    plt.tight_layout()
    plt.show()

## 3. Train Multiple Models and Track with MLflow

Let's train several regression models with different hyperparameters and track everything with MLflow.

In [None]:
# Create a function to train and log models
def train_and_log_model(model, model_name, X_train, y_train, X_test, y_test, params=None, log_artifacts=True):
    """
    Train a model, evaluate it, and log to MLflow
    
    Args:
        model: The model instance to train
        model_name: Name of the model for logging
        X_train: Training features
        y_train: Training target
        X_test: Testing features
        y_test: Testing target
        params: Model parameters to log
        log_artifacts: Whether to log artifacts
    
    Returns:
        dict: Results including metrics
    """
    # Start MLflow run
    with mlflow.start_run(run_name=f"train_{model_name}") as run:
        # Log model parameters
        if params:
            for param_name, param_value in params.items():
                mlflow.log_param(param_name, param_value)
        
        # Train the model
        model.fit(X_train, y_train)
        
        # Make predictions
        y_pred_train = model.predict(X_train)
        y_pred_test = model.predict(X_test)
        
        # Calculate metrics
        train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
        test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
        
        train_mae = mean_absolute_error(y_train, y_pred_train)
        test_mae = mean_absolute_error(y_test, y_pred_test)
        
        train_r2 = r2_score(y_train, y_pred_train)
        test_r2 = r2_score(y_test, y_pred_test)
        
        # Log metrics to MLflow
        mlflow.log_metric("train_rmse", train_rmse)
        mlflow.log_metric("test_rmse", test_rmse)
        mlflow.log_metric("train_mae", train_mae)
        mlflow.log_metric("test_mae", test_mae)
        mlflow.log_metric("train_r2", train_r2)
        mlflow.log_metric("test_r2", test_r2)
        
        # Perform cross-validation
        cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
        cv_rmse = np.sqrt(-cv_scores.mean())
        mlflow.log_metric("cv_rmse", cv_rmse)
        
        # Log the model
        mlflow.sklearn.log_model(model, f"{model_name}_model")
        
        # Save model locally
        model_path = os.path.join(models_dir, f"{model_name}_model.joblib")
        joblib.dump(model, model_path)
        
        if log_artifacts:
            # Create residual plot
            plt.figure(figsize=(10, 6))
            residuals = y_test - y_pred_test
            plt.scatter(y_pred_test, residuals)
            plt.axhline(y=0, color='r', linestyle='-')
            plt.xlabel('Predicted Values')
            plt.ylabel('Residuals')
            plt.title(f'Residual Plot - {model_name}')
            plt.tight_layout()
            
            # Save plot
            residual_plot_path = os.path.join(models_dir, f"{model_name}_residual_plot.png")
            plt.savefig(residual_plot_path)
            plt.close()
            
            # Create actual vs predicted plot
            plt.figure(figsize=(10, 6))
            plt.scatter(y_test, y_pred_test, alpha=0.5)
            min_val = min(y_test.min(), y_pred_test.min())
            max_val = max(y_test.max(), y_pred_test.max())
            plt.plot([min_val, max_val], [min_val, max_val], 'r--')
            plt.xlabel('Actual Values')
            plt.ylabel('Predicted Values')
            plt.title(f'Actual vs Predicted - {model_name}')
            plt.tight_layout()
            
            # Save plot
            pred_plot_path = os.path.join(models_dir, f"{model_name}_prediction_plot.png")
            plt.savefig(pred_plot_path)
            plt.close()
            
            # Feature importance if available
            if hasattr(model, 'feature_importances_'):
                # Create feature importance dataframe
                feature_importance = pd.DataFrame({
                    'feature': X_train.columns,
                    'importance': model.feature_importances_
                }).sort_values('importance', ascending=False)
                
                # Create feature importance plot
                plt.figure(figsize=(10, 6))
                sns.barplot(x='importance', y='feature', data=feature_importance)
                plt.title(f'Feature Importance - {model_name}')
                plt.tight_layout()
                
                # Save plot
                importance_plot_path = os.path.join(models_dir, f"{model_name}_feature_importance.png")
                plt.savefig(importance_plot_path)
                plt.close()
                
                # Save and log feature importance
                feature_importance.to_csv(os.path.join(models_dir, f"{model_name}_feature_importance.csv"), index=False)
                mlflow.log_artifact(os.path.join(models_dir, f"{model_name}_feature_importance.csv"))
            
            # Log plots as artifacts
            mlflow.log_artifact(residual_plot_path)
            mlflow.log_artifact(pred_plot_path)
            if hasattr(model, 'feature_importances_'):
                mlflow.log_artifact(importance_plot_path)
        
        # Get the run ID
        run_id = run.info.run_id
        
    # Return results
    results = {
        'model': model,
        'model_name': model_name,
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'train_r2': train_r2,
        'test_r2': test_r2,
        'cv_rmse': cv_rmse,
        'run_id': run_id,
        'model_path': model_path
    }
    
    return results

In [None]:
# Train multiple models
models_to_train = [
    {
        'name': 'linear_regression',
        'model': LinearRegression(),
        'params': {}
    },
    {
        'name': 'ridge_regression',
        'model': Ridge(alpha=0.1),
        'params': {'alpha': 0.1}
    },
    {
        'name': 'lasso_regression',
        'model': Lasso(alpha=0.01),
        'params': {'alpha': 0.01}
    },
    {
        'name': 'elastic_net',
        'model': ElasticNet(alpha=0.01, l1_ratio=0.5),
        'params': {'alpha': 0.01, 'l1_ratio': 0.5}
    },
    {
        'name': 'random_forest',
        'model': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
        'params': {'n_estimators': 100, 'max_depth': 10, 'random_state': 42}
    },
    {
        'name': 'gradient_boosting',
        'model': GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42),
        'params': {'n_estimators': 100, 'learning_rate': 0.1, 'max_depth': 3, 'random_state': 42}
    }
]

# Train and log all models
all_results = {}

for model_info in models_to_train:
    print(f"\nTraining {model_info['name']}...")
    results = train_and_log_model(
        model=model_info['model'],
        model_name=model_info['name'],
        X_train=X_train,
        y_train=y_train,
        X_test=X_test,
        y_test=y_test,
        params=model_info['params']
    )
    all_results[model_info['name']] = results
    
    print(f"  Test RMSE: {results['test_rmse']:.4f}")
    print(f"  Test R²: {results['test_r2']:.4f}")

# Create a summary dataframe of all model results
summary = pd.DataFrame({
    'model_name': [results['model_name'] for results in all_results.values()],
    'train_rmse': [results['train_rmse'] for results in all_results.values()],
    'test_rmse': [results['test_rmse'] for results in all_results.values()],
    'train_r2': [results['train_r2'] for results in all_results.values()],
    'test_r2': [results['test_r2'] for results in all_results.values()],
    'cv_rmse': [results['cv_rmse'] for results in all_results.values()],
    'run_id': [results['run_id'] for results in all_results.values()]
})

# Sort by test RMSE
summary = summary.sort_values('test_rmse')

print("\nModel Performance Summary:")
print(summary)

## 4. Visualize and Compare Model Performance

Let's visualize the performance of all models to help select the best one.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import mlflow
from mlflow.tracking import MlflowClient

# Get the experiment ID
experiment = mlflow.get_experiment_by_name("ml_demo")
experiment_id = experiment.experiment_id

# Create MLflow client
client = MlflowClient()

# Get all runs for our experiment
runs = mlflow.search_runs(experiment_ids=[experiment_id])

# If we have runs, create a visualization
if not runs.empty:
    # Filter out runs with missing metrics
    runs = runs.dropna(subset=['metrics.accuracy', 'metrics.precision', 'metrics.recall', 'metrics.f1'])
    
    # Create a comparison dataframe
    comparison_df = pd.DataFrame({
        'Model': runs['tags.model_name'],
        'Accuracy': runs['metrics.accuracy'],
        'Precision': runs['metrics.precision'],
        'Recall': runs['metrics.recall'],
        'F1 Score': runs['metrics.f1']
    })
    
    # Sort by F1 score
    comparison_df = comparison_df.sort_values('F1 Score', ascending=False)
    
    # Display the comparison table
    print("Model Performance Comparison:")
    display(comparison_df)
    
    # Plot the results
    plt.figure(figsize=(12, 8))
    metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
    
    # Create bar chart
    for i, metric in enumerate(metrics):
        plt.subplot(2, 2, i+1)
        ax = comparison_df.plot.bar(x='Model', y=metric, ax=plt.gca(), legend=False)
        plt.title(f'{metric} Comparison')
        plt.xticks(rotation=45)
        plt.tight_layout()
    
    plt.suptitle('Model Performance Metrics Comparison', fontsize=16)
    plt.tight_layout(rect=[0, 0, 1, 0.95])
    plt.show()
else:
    print("No runs found in the experiment.")

## 5. Register the Best Model

MLflow Model Registry is a centralized model store that enables you to:
- Store models with version control
- Stage transition (from dev to staging to production)
- Annotate models with metadata
- Handle the full lifecycle of a model

Let's register our best performing model to the MLflow Model Registry.

In [None]:
# Get the best run based on F1 score
if not runs.empty:
    best_run_id = runs.loc[runs['metrics.f1'].idxmax()]['run_id']
    best_model_name = runs.loc[runs['metrics.f1'].idxmax()]['tags.model_name']
    
    # Load the model from the best run
    best_model = mlflow.sklearn.load_model(f"runs:/{best_run_id}/model")
    
    # Register the model
    model_name = "classification_model"
    model_version = mlflow.register_model(f"runs:/{best_run_id}/model", model_name)
    
    print(f"Registered model: {model_name}, version: {model_version.version}")
    print(f"Best model: {best_model_name}")
    
    # Transition the model to staging
    client = MlflowClient()
    client.transition_model_version_stage(
        name=model_name,
        version=model_version.version,
        stage="Staging"
    )
    
    # Add description
    client.update_model_version(
        name=model_name,
        version=model_version.version,
        description=f"This is the best model ({best_model_name}) based on F1 score."
    )
    
    print(f"Model {model_name} version {model_version.version} transitioned to Staging.")
    
    # Get all registered model versions
    model_versions = client.get_latest_versions(model_name)
    print("\nRegistered Model Versions:")
    for mv in model_versions:
        print(f"Model: {mv.name}, Version: {mv.version}, Stage: {mv.current_stage}")
else:
    print("No runs found to register models.")

## 6. Load and Use the Registered Model

Let's demonstrate how to load a model from the registry and use it for prediction. In a real-world scenario, this would be part of the deployment process.

In [None]:
# Load model from registry by name and stage
model_name = "classification_model"
stage = "Staging"  # Could also be "Production" or "Archived"

try:
    # Load the model from the registry
    loaded_model = mlflow.sklearn.load_model(f"models:/{model_name}/{stage}")
    
    print(f"Successfully loaded model {model_name} from {stage} stage")
    
    # Use the model to make predictions on test data
    y_pred = loaded_model.predict(X_test)
    
    # Calculate metrics to verify the model works as expected
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    print(f"Model Performance on Test Data:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    
    # Display classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    # Make predictions on a few examples
    print("\nPrediction Examples:")
    n_samples = min(5, len(X_test))
    sample_indices = np.random.choice(len(X_test), n_samples, replace=False)
    
    for i, idx in enumerate(sample_indices):
        true_label = y_test.iloc[idx] if hasattr(y_test, 'iloc') else y_test[idx]
        pred_label = loaded_model.predict(X_test.iloc[idx:idx+1] if hasattr(X_test, 'iloc') else X_test[idx:idx+1])[0]
        print(f"Example {i+1}: True label = {true_label}, Predicted label = {pred_label}")
        
except Exception as e:
    print(f"Error loading or using the model: {str(e)}")

## 7. Model Versioning and Promotion

In a real-world scenario, you would train new model versions as more data becomes available or as you refine your approach. MLflow allows you to track these versions and promote them through different stages (Development → Staging → Production).

Let's simulate training a new model version and promoting it.

In [None]:
# Let's train a new improved model (for demonstration purposes)
# In a real scenario, this might include new data or improved features

with mlflow.start_run(run_name="improved_random_forest") as run:
    # Set a tag to identify this as an improved model
    mlflow.set_tag("model_name", "ImprovedRandomForest")
    mlflow.set_tag("version", "2.0")
    
    # Train a new model with different hyperparameters
    rf_model = RandomForestClassifier(
        n_estimators=200,  # More trees
        max_depth=10,      # Deeper trees
        min_samples_split=5,
        random_state=42
    )
    
    # Train the model
    rf_model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = rf_model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    # Log parameters
    mlflow.log_param("n_estimators", 200)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("min_samples_split", 5)
    
    # Log metrics
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1", f1)
    
    # Log the model
    mlflow.sklearn.log_model(rf_model, "model")
    
    print(f"Trained improved model with:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    
    # Register the new model version
    model_name = "classification_model"
    new_model_version = mlflow.register_model(f"runs:/{run.info.run_id}/model", model_name)
    
    print(f"Registered new model version: {new_model_version.version}")

# Promote the new model to production if it's better
client = MlflowClient()

# Get all versions of the model
model_versions = client.get_latest_versions(model_name)
print("\nAll Model Versions:")
for mv in model_versions:
    print(f"Version: {mv.version}, Stage: {mv.current_stage}")

# Get the latest version
latest_version = max([int(mv.version) for mv in model_versions])

# Transition the latest version to Production
client.transition_model_version_stage(
    name=model_name,
    version=str(latest_version),
    stage="Production"
)

print(f"\nPromoted version {latest_version} to Production")

# Update description
client.update_model_version(
    name=model_name,
    version=str(latest_version),
    description="This is the improved model with higher F1 score."
)

# Get all versions after promotion
model_versions = client.get_latest_versions(model_name)
print("\nUpdated Model Versions:")
for mv in model_versions:
    print(f"Version: {mv.version}, Stage: {mv.current_stage}")

## 8. DagsHub Integration for Collaboration

DagsHub provides a collaborative platform for MLOps, making it easy to share experiments, models, and artifacts with your team. Let's explore how to integrate with DagsHub for enhanced collaboration.

In [None]:
try:
    # Check if dagshub is installed
    import dagshub
    dagshub_installed = True
except ImportError:
    print("DagsHub package not installed. Run 'pip install dagshub' to enable DagsHub integration.")
    dagshub_installed = False

if dagshub_installed:
    # Setup DagsHub integration
    # Replace with your DagsHub username and repo
    username = "your-dagshub-username"
    repo_name = "MLFlow_demo"
    
    print("To configure DagsHub integration, you would run:")
    print(f"dagshub.init(repo_owner='{username}', repo_name='{repo_name}', mlflow=True)")
    
    # In a real scenario, you would uncomment and use the line below
    # dagshub.init(repo_owner=username, repo_name=repo_name, mlflow=True)
    
    print("\nBenefits of DagsHub integration:")
    print("1. Centralized experiment tracking accessible to all team members")
    print("2. Automated versioning of models and data")
    print("3. Collaborative model review process")
    print("4. Integrated CI/CD for MLOps workflows")
    print("5. Metrics visualization and sharing")
    
    print("\nTo share your MLflow experiments with your team, you would:")
    print("1. Push your code to the DagsHub repository")
    print("2. Share the DagsHub repository URL with your team")
    print("3. Team members can view experiments at https://dagshub.com/{username}/{repo_name}/experiments")
else:
    print("\nTo enable DagsHub integration, install the package:")
    print("pip install dagshub")
    print("\nThen configure it in your scripts with:")
    print("import dagshub")
    print("dagshub.init(repo_owner='your-username', repo_name='your-repo', mlflow=True)")

## 9. Conclusion

In this notebook, we've demonstrated a complete MLOps workflow using MLflow for model training, tracking, and registry:

1. **Setup**: We configured our MLflow tracking environment
2. **Data Loading**: We loaded and prepared our dataset
3. **Model Training**: We trained multiple models with hyperparameter tracking
4. **Model Comparison**: We visualized and compared model performance
5. **Model Registry**: We registered our best model and managed its lifecycle
6. **Model Deployment**: We showed how to load and use registered models
7. **Model Versioning**: We demonstrated version management and promotion
8. **Collaboration**: We explored DagsHub integration for team collaboration

This workflow provides a solid foundation for implementing MLOps practices in your organization, enabling reproducibility, collaboration, and effective model lifecycle management.