# Kubeflow Pipelines e2e mnist example

In this notebook you will create e2e mnist Kubeflow Pipeline to perform:
- Hyperparameter tuning using Katib
- Distributive training with the best hyperparameters using TFJob
- Serve the trained model using KServe

Reference documentation:

- https://www.kubeflow.org/docs/components/training/tftraining/
- https://www.kubeflow.org/docs/components/katib/
- https://www.kubeflow.org/docs/external-add-ons/kserve/

**Note**: This Pipeline runs in the multi-user mode. Follow [this guide](https://www.kubeflow.org/docs/components/pipelines/sdk/connect-api/#multi-user-mode) to give your Notebook access to Kubeflow Pipelines.

In [65]:
!ls -ltra /var

total 44
drwxrwsr-x 2 root staff 4096 Apr 15  2020 local
drwxr-xr-x 2 root root  4096 Apr 15  2020 backups
drwxr-xr-x 2 root root  4096 Apr 16  2021 spool
lrwxrwxrwx 1 root root     4 Apr 16  2021 run -> /run
drwxr-xr-x 2 root root  4096 Apr 16  2021 opt
drwxrwsr-x 2 root mail  4096 Apr 16  2021 mail
lrwxrwxrwx 1 root root     9 Apr 16  2021 lock -> /run/lock
drwxrwxrwt 2 root root  4096 Apr 16  2021 tmp
drwxr-xr-x 1 root root  4096 Jun  3  2021 .
drwxr-xr-x 1 root root  4096 Jun  3  2021 lib
drwxr-xr-x 1 root root  4096 Jun  3  2021 cache
drwxr-xr-x 1 root root  4096 Jun  3  2021 log
drwxr-xr-x 1 root root  4096 May  4 20:26 ..


In [66]:
# Install required packages (Kubeflow Pipelines and Katib SDK).
!pip install kfp==1.8.4
!pip install kubeflow-katib==0.12.0



In [67]:
import kfp
import kfp.dsl as dsl
from kfp import components

from kubeflow.katib import ApiClient
from kubeflow.katib import V1beta1ExperimentSpec
from kubeflow.katib import V1beta1AlgorithmSpec
from kubeflow.katib import V1beta1ObjectiveSpec
from kubeflow.katib import V1beta1ParameterSpec
from kubeflow.katib import V1beta1FeasibleSpace
from kubeflow.katib import V1beta1TrialTemplate
from kubeflow.katib import V1beta1TrialParameterSpec
from kubeflow.katib import V1beta1AlgorithmSetting


## Define the Pipelines tasks

To run this Pipeline, you should define:
1. Katib hyperparameter tuning
2. TFJob training
3. KServe inference



### Step 1. Katib hyperparameter tuning task

Create the Kubeflow Pipelines task for the Katib hyperparameter tuning. This Experiment uses "random" algorithm and TFJob for the Trial's worker.

The Katib Experiment is similar to this example: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/kubeflow-training-operator/tfjob-mnist-with-summaries.yaml.

In [68]:
# You should define the Experiment name, namespace and number of training steps in the arguments.
def create_katib_experiment_task(experiment_name, experiment_namespace, training_steps):
    # Trial count specification.
    max_trial_count = 30
    max_failed_trial_count = 5
    parallel_trial_count = 2

    # Objective specification.
    objective = V1beta1ObjectiveSpec(
        type="minimize",
        goal=0.001,
        objective_metric_name="loss"
    )

    # Algorithm specification.
    algorithm = V1beta1AlgorithmSpec(
        algorithm_name="bayesianoptimization",
        algorithm_settings=[
            V1beta1AlgorithmSetting(
                name="random_state",
                value="10"
            )
        ]
    )


    # Experiment search space.
    # In this example we tune learning rate and batch size.
    parameters = [
        V1beta1ParameterSpec(
            name="learning_rate",
            parameter_type="double",
            feasible_space=V1beta1FeasibleSpace(
                min="0.01",
                max="0.05"
            ),
        ),
        V1beta1ParameterSpec(
            name="batch_size",
            parameter_type="int",
            feasible_space=V1beta1FeasibleSpace(
                min="80",
                max="100"
            ),
        )
    ]

    # Experiment Trial template.
    # TODO (andreyvelich): Use community image for the mnist example.
    trial_spec = {
        "apiVersion": "kubeflow.org/v1",
        "kind": "TFJob",
        "spec": {
            "tfReplicaSpecs": {
                "Chief": {
                    "replicas": 1,
                    "restartPolicy": "OnFailure",
                    "template": {
                        "metadata": {
                            "annotations": {
                                "sidecar.istio.io/inject": "false"
                            }
                        },
                        "spec": {
                            "containers": [
                                {
                                    "name": "tensorflow",
                                    "image": "docker.io/liuhougangxa/tf-estimator-mnist",
                                    "command": [
                                        "python",
                                        "/opt/model.py",
                                        "--tf-train-steps=" + str(training_steps),
                                        "--tf-learning-rate=${trialParameters.learningRate}",
                                        "--tf-batch-size=${trialParameters.batchSize}"
                                    ]
                                }
                            ]
                        }
                    }
                },
                "Worker": {
                    "replicas": 1,
                    "restartPolicy": "OnFailure",
                    "template": {
                        "metadata": {
                            "annotations": {
                                "sidecar.istio.io/inject": "false"
                            }
                        },
                        "spec": {
                            "containers": [
                                {
                                    "name": "tensorflow",
                                    "image": "docker.io/liuhougangxa/tf-estimator-mnist",
                                    "command": [
                                        "python",
                                        "/opt/model.py",
                                        "--tf-train-steps=" + str(training_steps),
                                        "--tf-learning-rate=${trialParameters.learningRate}",
                                        "--tf-batch-size=${trialParameters.batchSize}"
                                    ]
                                }
                            ]
                        }
                    }
                }
            }
        }
    }

    # Configure parameters for the Trial template.
    trial_template = V1beta1TrialTemplate(
        primary_container_name="tensorflow",
        trial_parameters=[
            V1beta1TrialParameterSpec(
                name="learningRate",
                description="Learning rate for the training model",
                reference="learning_rate"
            ),
            V1beta1TrialParameterSpec(
                name="batchSize",
                description="Batch size for the model",
                reference="batch_size"
            ),
        ],
        trial_spec=trial_spec
    )

    # Create an Experiment from the above parameters.
    experiment_spec = V1beta1ExperimentSpec(
        max_trial_count=max_trial_count,
        max_failed_trial_count=max_failed_trial_count,
        parallel_trial_count=parallel_trial_count,
        objective=objective,
        algorithm=algorithm,
        parameters=parameters,
        trial_template=trial_template
    )

    # Create the KFP task for the Katib Experiment.
    # Experiment Spec should be serialized to a valid Kubernetes object.
    katib_experiment_launcher_op = components.load_component_from_url(
        "https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/katib-launcher/component.yaml")
    op = katib_experiment_launcher_op(
        experiment_name=experiment_name,
        experiment_namespace=experiment_namespace,
        experiment_spec=ApiClient().sanitize_for_serialization(experiment_spec),
        experiment_timeout_minutes=60,
        delete_finished_experiment=False)

    return op

### Step 2. TFJob training task

Create the Kubeflow Pipelines task for the TFJob training. In this example TFJob runs the Chief and Worker with 1 replica.

Learn more about TFJob replica specifications in the Kubeflow docs: https://www.kubeflow.org/docs/components/training/tftraining/#what-is-tfjob.

In [69]:
# This function converts Katib Experiment HP results to args.
def convert_katib_results(katib_results) -> str:
    import json
    import pprint
    katib_results_json = json.loads(katib_results)
    print("Katib results:")
    pprint.pprint(katib_results_json)
    best_hps = []
    for pa in katib_results_json["currentOptimalTrial"]["parameterAssignments"]:
        if pa["name"] == "learning_rate":
            best_hps.append("--tf-learning-rate=" + pa["value"])
        elif pa["name"] == "batch_size":
            best_hps.append("--tf-batch-size=" + pa["value"])
    print("Best Hyperparameters: {}".format(best_hps))
    return " ".join(best_hps)

In [70]:
# You should define the TFJob name, namespace, number of training steps, output of Katib and model volume tasks in the arguments.
def create_tfjob_task(tfjob_name, tfjob_namespace, training_steps, katib_op, model_volume_op):
    import json
    # Get parameters from the Katib Experiment.
    # Parameters are in the format "--tf-learning-rate=0.01 --tf-batch-size=100"
    convert_katib_results_op = components.func_to_container_op(convert_katib_results)
    best_hp_op = convert_katib_results_op(katib_op.output)
    best_hps = str(best_hp_op.output)

    # Create the TFJob Chief and Worker specification with the best Hyperparameters.
    # TODO (andreyvelich): Use community image for the mnist example.
    tfjob_chief_spec = {
        "replicas": 1,
        "restartPolicy": "OnFailure",
        "template": {
            "metadata": {
                "annotations": {
                    "sidecar.istio.io/inject": "false"
                }
            },
            "spec": {
                "containers": [
                    {
                        "name": "tensorflow",
                        "image": "docker.io/liuhougangxa/tf-estimator-mnist",
                        "command": [
                            "sh",
                            "-c"
                        ],
                        "args": [
                            "python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps={} {}".format(training_steps, best_hps)
                        ],
                        "volumeMounts": [
                            {
                                "mountPath": "/mnt/export",
                                "name": "model-volume"
                            }
                        ]
                    }
                ],
                "volumes": [
                    {
                        "name": "model-volume",
                        "persistentVolumeClaim": {
                            "claimName": str(model_volume_op.outputs["name"])
                        }
                    }
                ]
            }
        }
    }

    tfjob_worker_spec = {
        "replicas": 1,
        "restartPolicy": "OnFailure",
        "template": {
            "metadata": {
                "annotations": {
                    "sidecar.istio.io/inject": "false"
                }
            },
            "spec": {
                "containers": [
                    {
                        "name": "tensorflow",
                        "image": "docker.io/liuhougangxa/tf-estimator-mnist",
                        "command": [
                            "sh",
                            "-c",
                        ],
                        "args": [
                          "python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps={} {}".format(training_steps, best_hps) 
                        ],
                    }
                ],
            }
        }
    }

    # Create the KFP task for the TFJob.
    tfjob_launcher_op = components.load_component_from_url(
        "https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/launcher/component.yaml")
    op = tfjob_launcher_op(
        name=tfjob_name,
        namespace=tfjob_namespace,
        chief_spec=json.dumps(tfjob_chief_spec),
        worker_spec=json.dumps(tfjob_worker_spec),
        tfjob_timeout_minutes=60,
        delete_finished_tfjob=False)
    return op

### Step 3. KServe inference

Create the Kubeflow Pipelines task for the KServe inference.

In [71]:
def create_serving_task(model_name, model_namespace, tfjob_op, model_volume_op):

    api_version = 'serving.kserve.io/v1beta1'
    serving_component_url = 'https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kserve/component.yaml'

    # Uncomment the following two lines if you are using KFServing v0.6.x or v0.5.x
    # api_version = 'serving.kubeflow.org/v1beta1'
    # serving_component_url = 'https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/kfserving/component.yaml'

    inference_service = '''
apiVersion: "{}"
kind: "InferenceService"
metadata:
  name: {}
  namespace: {}
  annotations:
    "sidecar.istio.io/inject": "false"
spec:
  predictor:
    tensorflow:
      storageUri: "pvc://{}/"
'''.format(api_version, model_name, model_namespace, str(model_volume_op.outputs["name"]))

    serving_launcher_op = components.load_component_from_url(serving_component_url)
    serving_launcher_op(action="apply", inferenceservice_yaml=inference_service).after(tfjob_op)

## Run the Kubeflow Pipeline

You should create the Kubeflow Pipeline from the above tasks.

In [72]:
name="mnist-e2e-test-bayesian-3"
namespace="kubeflow-user-example-com"
training_steps="200"

@dsl.pipeline(
    name="End to End Pipeline",
    description="An end to end mnist example including hyperparameter tuning, train and inference"
)
def mnist_pipeline(name=name, namespace=namespace, training_steps=training_steps):
    katib_op = create_katib_experiment_task(name, namespace, training_steps)

    model_volume_op = dsl.VolumeOp(
        name="model-volume",
        resource_name="model-volume",
        size="1Gi",
        modes=dsl.VOLUME_MODE_RWO
    )

    # Run the distributive training with TFJob.
    tfjob_op = create_tfjob_task(name, namespace, training_steps, katib_op, model_volume_op)

    # Create the KServe inference.
    create_serving_task(name, namespace, tfjob_op, model_volume_op)

pipeline_func = mnist_pipeline
experiment_name = 'mnist-e2e-test-bayesian-3'
kfp.compiler.Compiler().compile(pipeline_func, '{}.zip'.format(experiment_name))    


In [73]:
!unzip -o mnist-e2e-test-bayesian-3.zip -d test-bayesian

Archive:  mnist-e2e-test-bayesian-3.zip
  inflating: test-bayesian/pipeline.yaml  


In [74]:
from kfp_tekton import TektonClient

KUBEFLOW_PUBLIC_ENDPOINT_URL = 'https://kubeflow-cml-group-projec-4f27b99c6360f285c2c732f9adc614f1-0003.us-east.containers.appdomain.cloud'
KUBEFLOW_PROFILE_NAME = f'kubeflow-user-example-com'
SESSION_COOKIE = f'authservice_session=MTY1MTcyMjM1NHxOd3dBTkU5U1VqSlNUVWRWTXpWVlRqZEhTemRMV1VOYVNFeFBXRFZYVFRSUU5sRkVXRXhKUXpKYVVUTkhXak5FU1RWWlZrcERUMUU9fKvYlgJLPelGI4L4q0361S30P-PeZOY3BRSMvYQt3rJM'

client = TektonClient(host=f'{KUBEFLOW_PUBLIC_ENDPOINT_URL}/pipeline',
                     cookies=SESSION_COOKIE)

EXPERIMENT_NAME = 'Test E2E Experiments - Bayesian'
experiment = client.create_experiment(name=EXPERIMENT_NAME, namespace=KUBEFLOW_PROFILE_NAME)
run = client.run_pipeline(experiment.id, 'pipeline-e2e-bayesian-3', 'test-bayesian/pipeline.yaml')

# # Run the Kubeflow Pipeline in the user's namespace.
# KUBEFLOW_PUBLIC_ENDPOINT_URL = 'https://kubeflow-cml-group-projec-4f27b99c6360f285c2c732f9adc614f1-0002.us-east.containers.appdomain.cloud/'
# KUBEFLOW_PROFILE_NAME = f'kubeflow-user-example-com'
# SESSION_COOKIE = f'authservice_session=MTY1MTY5ODU1MnxOd3dBTkRKTFdWUXpWa1ZZVDBnM1J6VkxXRXBIU2sxQlMxRktNelJPVmpNelZFbExURmRJUVZsTFRVdExNMHBTUkV4U1MwNDFVa0U9fOG--YDfspvVKSAYKI0OjfdMnxkyP8spF4ksOUS2mpXY'

# client = TektonClient(host=f'{KUBEFLOW_PUBLIC_ENDPOINT_URL}/pipeline',
#                      cookies=SESSION_COOKIE)

# kfp_client=kfp.Client()
# run_id = kfp_client.create_run_from_pipeline_func(mnist_pipeline, namespace=namespace, arguments={}).run_id
# print("Run ID: ", run_id)

## Predict from the trained model

Once Kubeflow Pipeline is finished, you are able to call the API endpoint with [mnist image](https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/kubeflow-pipelines/images/9.bmp) to predict from the trained model.

**Note**: If you are using Kubeflow + Dex setup and runing this Notebook outside of your Kubernetes cluster, follow [this guide](https://github.com/kserve/kserve/tree/master/docs/samples/istio-dex#authentication) to get Session ID for the API requests.

In [52]:
%%time
import numpy as np
from PIL import Image
import requests
import time

run_id = run.id
# Pipeline Run should be succeeded.
kfp_run = client.get_run(run_id=run_id)
if kfp_run.run.status == "Succeeded":
    # print("Run {} has been Succeeded\n".format(run_id))

    url = "http://{}-predictor-default.{}.svc.cluster.local/v1/models/{}:predict".format(name, namespace, name)

    start = time.time()
    image_url = "https://i.imgur.com/6qsCz2W.png"
    image = Image.open(requests.get(image_url, stream=True).raw)
    data = np.array(image.convert('L').resize((28, 28))).astype(np.float).reshape(-1, 28, 28, 1)
    data_formatted = np.array2string(data, separator=",", formatter={"float": lambda x: "%.1f" % x})
    json_request = '{{ "instances" : {} }}'.format(data_formatted)

    response = requests.post(url, data=json_request)
    print("Prediction for the image")
    print(response.json())
    display(image)
    end = time.time()
    print(end - start)


CPU times: user 1.43 ms, sys: 3.77 ms, total: 5.2 ms
Wall time: 43.2 ms


In [53]:
import numpy as np
from PIL import Image
import requests
import time

runs=[1, 4, 8, 16, 32, 64, 100, 128, 256, 512]
start = time.time()
image_url = "https://i.imgur.com/6qsCz2W.png"
url = "http://mnist-e2e-predictor-default.kubeflow-user-example-com.svc.cluster.local/v1/models/mnist-e2e:predict"
for i in range(runs[9]):
    image = Image.open(requests.get(image_url, stream=True).raw)
    data = np.array(image.convert('L').resize((28, 28))).astype(np.float).reshape(-1, 28, 28, 1)
    data_formatted = np.array2string(data, separator=",", formatter={"float": lambda x: "%.1f" % x})
    json_request = '{{ "instances" : {} }}'.format(data_formatted)

    # Specify the prediction URL. If you are runing this notebook outside of Kubernetes cluster, you should set the Cluster IP.
    
    response = requests.post(url, data=json_request)
    #print("Prediction for the image")
    #print(response.json())
end = time.time()
print(end - start)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  data = np.array(image.convert('L').resize((28, 28))).astype(np.float).reshape(-1, 28, 28, 1)


20.36825132369995
