# MLflow Model Training and Serving Tutorial with DigitalHub

This notebook demonstrates how to build an end-to-end machine learning pipeline using MLflow with the DigitalHub SDK. We'll work with the Iris dataset, train a classifier with hyperparameter tuning, track experiments with MLflow, and deploy the model as a REST API service.

## Overview
- **Model Training**: Train an SVM classifier with grid search hyperparameter tuning
- **MLflow Integration**: Automatic logging of datasets, parameters, and metrics
- **Model Serving**: Deploy the trained MLflow model as a REST API endpoint
- **Orchestrate**: Create a workflow pipeline to automate the ML process

## Setup and Function Definitions

First, we'll create the necessary directory structure and define all the functions we'll need for our MLflow pipeline. All functions will be stored in a single `src/functions.py` file for easy management.

In [None]:
from pathlib import Path

Path("src").mkdir(exist_ok=True)

### Function Definitions

This cell creates our main functions file with the following components:

- **`train_model`**: Trains an SVM classifier with grid search hyperparameter tuning using MLflow autologging
- **MLflow Integration**: Automatically logs datasets, parameters, metrics, and model artifacts
- **Model Registration**: Registers the trained model in DigitalHub with MLflow metadata

The function uses MLflow's autolog feature to automatically capture training metrics, parameters, and artifacts, then integrates them with DigitalHub's model management system.

In [None]:
%%writefile "src/functions.py"
import mlflow
from digitalhub import from_mlflow_run, get_mlflow_model_metrics
from digitalhub_runtime_python import handler
from sklearn import datasets, svm
from sklearn.model_selection import GridSearchCV


@handler(outputs=["model"])
def train_model(project):
    """
    Train an SVM classifier on the Iris dataset with hyperparameter tuning using MLflow
    """
    # Enable MLflow autologging for sklearn
    mlflow.sklearn.autolog(log_datasets=True)

    # Load Iris dataset
    iris = datasets.load_iris()

    # Define hyperparameter search space
    parameters = {"kernel": ("linear", "rbf"), "C": [1, 10]}
    svc = svm.SVC()
    clf = GridSearchCV(svc, parameters)

    # Train model with grid search
    clf.fit(iris.data, iris.target)

    # Get MLflow run information
    run_id = mlflow.last_active_run().info.run_id

    # Extract MLflow run artifacts and metadata for DigitalHub integration
    model_params = from_mlflow_run(run_id)
    metrics = get_mlflow_model_metrics(run_id)

    # Register model in DigitalHub with MLflow metadata
    model = project.log_model(name="iris-classifier", kind="mlflow", **model_params)
    model.log_metrics(metrics)
    return model

## Project Initialization

Now we'll initialize our DigitalHub project using consistent naming with other tutorials.

In [None]:
import digitalhub as dh

p_name = "tutorial-project"
project = dh.get_or_create_project(p_name)

## Step 1: Model Training with MLflow

We'll create and run our MLflow-integrated training function. This will train an SVM classifier on the Iris dataset with hyperparameter tuning, automatically logging all experiments with MLflow.

In [None]:
train_fn = project.new_function(
    name="train-mlflow-model",
    kind="python",
    python_version="PYTHON3_10",
    code_src="src/functions.py",
    handler="train_model",
    requirements=["numpy<2", "mlflow<3", "scikit-learn <= 1.6.1"],
)

In [None]:
train_model_run = train_fn.run(action="job", wait=True)

## Step 2: Model Serving

Now we'll deploy our trained MLflow model as a REST API service. This will allow us to make predictions via HTTP requests using the MLflow serving infrastructure.

In [None]:
model = train_model_run.output("model")
serve_func = project.new_function(
    name="serve-mlflow-model",
    kind="mlflowserve",
    model_name=model.name,
    path=model.key,
)

In [None]:
serve_run = serve_func.run("serve", wait=True)

### Test the Model API

Let's test our deployed MLflow model by making a prediction request with sample Iris data:

In [None]:
from sklearn import datasets

# Load some test data from the Iris dataset
iris = datasets.load_iris()
data = iris.data[0:2].tolist()
json_payload = {
    "inputs": [{"name": "input-0", "shape": [-1, 4], "datatype": "FP64", "data": data}]
}

# Make prediction
result = serve_run.invoke(model_name=model.name, json=json_payload).json()
print("Prediction result:")
print(result)

## Pipeline Orchestration

Now let's create a workflow that orchestrates the MLflow training process. This pipeline uses Hera (Argo Workflows) to define the execution flow for our MLflow-based ML pipeline.

The pipeline includes:
1. **A**: Train model with MLflow integration

In [None]:
%%writefile "src/pipeline.py"
from hera.workflows import Workflow, Steps
from digitalhub_runtime_hera.dsl import step


def pipeline():
    with Workflow(entrypoint="dag") as w:
        with Steps(name="dag"):
            A = step(template={"action":"job"}, function="train-mlflow-model", outputs=["model"])
    return w

### Execute the Complete Pipeline

Finally, let's create and execute our complete MLflow pipeline workflow. This will run the training process in an automated, orchestrated manner.

In [None]:
workflow = project.new_workflow(
    name="mlflow-pipeline",
    kind="hera",
    code_src="src/pipeline.py",
    handler="pipeline",
)

In [None]:
workflow.run("build", wait=True)

In [None]:
wf_run = workflow.run("pipeline", wait=True)