# Machine Learning Pipeline Tutorial with DigitalHub

This notebook demonstrates how to build an end-to-end machine learning pipeline using Scikit-learn with the DigitalHub SDK. We'll work with the breast cancer dataset, train a classification model, deploy it as a REST API service, and orchestrate the entire process.

## Overview
- **Data Preparation**: Generate and prepare the breast cancer dataset
- **Model Training**: Train an SVM classifier with performance metrics
- **Model Serving**: Deploy the trained model as a REST API endpoint
- **Orchestrate**: Create a workflow pipeline to automate the ML process

## Setup and Function Definitions

First, we'll create the necessary directory structure and define all the functions we'll need for our ML pipeline. All functions will be stored in a single `src/functions.py` file for easy management.

In [None]:
from pathlib import Path

Path("src").mkdir(exist_ok=True)

### Function Definitions

This cell creates our main functions file with the following components:

- **`data_generator`**: Generates the breast cancer dataset from scikit-learn
- **`train_model`**: Trains an SVM classifier and logs performance metrics

Each function is decorated with `@handler` to integrate with the DigitalHub runtime system. The training function also logs comprehensive metrics including accuracy, precision, recall, and F1-score.

In [None]:
%%writefile "src/functions.py"

import os
import pandas as pd
import numpy as np
from pickle import dump
import sklearn.metrics
from digitalhub_runtime_python import handler
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC


@handler(outputs=["dataset"])
def data_generator():
    """
    A function which generates the breast cancer dataset from scikit-learn
    """
    breast_cancer = load_breast_cancer()
    breast_cancer_dataset = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)
    breast_cancer_labels = pd.DataFrame(data=breast_cancer.target, columns=["target"])
    breast_cancer_dataset = pd.concat([breast_cancer_dataset, breast_cancer_labels], axis=1)
    return breast_cancer_dataset


@handler(outputs=["model"])
def train_model(project, di):
    """
    Train an SVM classifier on the breast cancer dataset and log metrics
    """
    df_cancer = di.as_df()
    X = df_cancer.drop(["target"], axis=1)
    y = df_cancer["target"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
    svc_model = SVC()
    svc_model.fit(X_train, y_train)
    y_predict = svc_model.predict(X_test)

    if not os.path.exists("model"):
        os.makedirs("model")

    with open("model/breast_cancer_classifier.pkl", "wb") as f:
        dump(svc_model, f, protocol=5)

    metrics = {
        "f1_score": sklearn.metrics.f1_score(y_test, y_predict),
        "accuracy": sklearn.metrics.accuracy_score(y_test, y_predict),
        "precision": sklearn.metrics.precision_score(y_test, y_predict),
        "recall": sklearn.metrics.recall_score(y_test, y_predict),
    }
    model = project.log_model(name="breast_cancer_classifier", kind="sklearn", source="./model/")
    model.log_metrics(metrics)
    return model

## Project Initialization

Now we'll initialize our DigitalHub project using consistent naming with other tutorials.

In [None]:
import digitalhub as dh

p_name = "tutorial-project"
project = dh.get_or_create_project(p_name)

## Step 1: Data Preparation

First step of our ML pipeline - we'll create and run the data preparation function to generate the breast cancer dataset.

In [None]:
data_gen_fn = project.new_function(
    name="prepare-data",
    kind="python",
    python_version="PYTHON3_10",
    code_src="src/functions.py",
    handler="data_generator",
)

In [None]:
gen_data_run = data_gen_fn.run("job", wait=True)

Let's examine the generated dataset:

In [None]:
dataset = gen_data_run.output("dataset")
dataset.as_df().head()

## Step 2: Model Training

Now we'll train our SVM classifier on the breast cancer dataset. The training function will split the data, train the model, and log comprehensive performance metrics.

In [None]:
train_fn = project.new_function(
    name="train-classifier",
    kind="python",
    python_version="PYTHON3_10",
    code_src="src/functions.py",
    handler="train_model",
    requirements=["numpy<2"],
)

In [None]:
train_run = train_fn.run(action="job", inputs={"di": dataset.key}, wait=True)

## Step 3: Model Serving

Now we'll deploy our trained model as a REST API service. This will allow us to make predictions via HTTP requests.

In [None]:
model = train_run.output("model")
serve_func = project.new_function(
    name="serve-classifier",
    kind="sklearnserve",
    path=model.spec.path + "breast_cancer_classifier.pkl",
)

In [None]:
serve_run = serve_func.run("serve", labels=["ml-service"], wait=True)

### Test the Model API

Let's test our deployed model by making a prediction request:

In [None]:
import numpy as np

# Generate sample data for prediction
data = np.random.rand(2, 30).tolist()
json_payload = {"inputs": [{"name": "input-0", "shape": [2, 30], "datatype": "FP32", "data": data}]}

# Make prediction
result = serve_run.refresh().invoke(json=json_payload).json()
print("Prediction result:")
print(result)

## Pipeline Orchestration

Now let's create a workflow that orchestrates all the ML steps automatically. This pipeline uses Hera (Argo Workflows) to define the execution flow:

1. **A**: Prepare data (generate dataset)
2. **B**: Train model (depends on A)

The pipeline creates a simple sequential flow where model training depends on data preparation completion.

In [None]:
%%writefile "src/pipeline.py"
from digitalhub_runtime_hera.dsl import step
from hera.workflows import DAG, Workflow


def pipeline():
    with Workflow(entrypoint="dag") as w:
        with DAG(name="dag"):
            A = step(template={"action": "job"},
                     function="prepare-data",
                     outputs=["dataset"])
            B = step(template={"action": "job", "inputs": {"di": "{{inputs.parameters.di}}"}},
                     function="train-classifier",
                     inputs={"di": A.get_parameter("dataset")})
            A >> B
    return w

### Execute the Complete Pipeline

Finally, let's create and execute our complete ML pipeline workflow. This will run data preparation and model training in an automated, orchestrated manner.

In [None]:
workflow = project.new_workflow(name="ml-pipeline", kind="hera", code_src="src/pipeline.py", handler="pipeline")

In [None]:
build_run = workflow.run("build", wait=True)

In [None]:
wf_run = workflow.run("pipeline", wait=True)