# About this Jupyter Notebook

@author: Yingding Wang\
@updated: 26.09.2022

This notebook demonstrate example of kubeflow pipeline with python function

## Install KFP Python SDK to build a V1 pipeline
* Build KF pipeline with python SDK: https://www.kubeflow.org/docs/components/pipelines/sdk/build-pipeline/
* Current KFP python SDK version on pypi.org: https://pypi.org/project/kfp/ 

In [1]:
import sys

In [2]:
!{sys.executable} -m pip install --upgrade --user kfp==1.8.14



## Restart the Kernal

After the installation of KFP python SDK, the notebook kernel must be restarted.

## Getting familiar with Jupyter Notebook ENV 

In [3]:
from platform import python_version
print (f"current platform python version: {python_version()}")

current platform python version: 3.8.10


In [4]:
# run kubectl command line to see the quota in the name space
!kubectl describe quota

I0926 16:22:20.729154   25758 request.go:668] Waited for 1.076901146s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/sources.knative.dev/v1alpha1?timeout=32s
Name:                                                         kf-resource-quota
Namespace:                                                    kubeflow-kindfor
Resource                                                      Used    Hard
--------                                                      ----    ----
basic-csi.storageclass.storage.k8s.io/persistentvolumeclaims  4       5
basic-csi.storageclass.storage.k8s.io/requests.storage        16Gi    50Gi
cpu                                                           3250m   128
longhorn.storageclass.storage.k8s.io/persistentvolumeclaims   0       10
longhorn.storageclass.storage.k8s.io/requests.storage         0       500Gi
memory                                                        4782Mi  512Gi


In [5]:
# examing the kfp python sdk version inside a KubeFlow v1.5.1
!{sys.executable} -m pip list | grep kfp

kfp                          1.8.14
kfp-pipeline-spec            0.1.16
kfp-server-api               1.8.2


## Setup global variables

In [6]:
import kfp
client = kfp.Client()
NAMESPACE = client.get_user_namespace()
EXPERIMENT_NAME = 'demo' # Name of the experiment in the KF webapp UI
EXPERIMENT_DESC = 'this kf experiments loads iris data from tf dataset and build models'

print(NAMESPACE)

kubeflow-kindfor


In [7]:
from collections import namedtuple
Settings = namedtuple('Settings', [
    'tf_io', 
    'tf_datasets',
    'pandas_version',
    'jinja2_version',
    'sklearn_version',
    'base_tf_image',
    'base_python_image'
])
# the base images are from the dockerhub https://hub.docker.com/_/python
settings = Settings(
    tf_io="0.27.0", 
    tf_datasets="4.6.0",
    pandas_version="1.5.0",
    jinja2_version="3.1.2",
    sklearn_version="1.1.2", # scikit-learn
    base_tf_image="tensorflow/tensorflow:2.10.0",
    base_python_image="python:3.8.14"
) 
print(f"{settings}")

Settings(tf_io='0.27.0', tf_datasets='4.6.0', pandas_version='1.5.0', jinja2_version='3.1.2', sklearn_version='1.1.2', base_tf_image='tensorflow/tensorflow:2.10.0', base_python_image='python:3.8.14')


### Creating KubeFlow component from python function

* Creating model with iris dataset: https://medium.com/@nutanbhogendrasharma/tensorflow-deep-learning-model-with-iris-dataset-8ec344c49f91

In [8]:
# import kfp dsl components
import kfp.dsl as dsl
from functools import partial
from kfp.dsl import (
    pipeline,
    ContainerOp
)
from kfp.components import (
    InputPath,
    OutputPath,
    create_component_from_func
)

#### Create download component

In [9]:
@partial(
    create_component_from_func,
    output_component_file='demo_download_component.yaml',
    base_image=settings.base_tf_image, # use tf base image
    packages_to_install=[
        f"tensorflow-datasets=={settings.tf_datasets}",
        f"pandas=={settings.pandas_version}",
        f"Jinja2=={settings.jinja2_version}", # needed by tf dataset
    ] # adding additional libs
)
def download_data(output_path: OutputPath("CSV")):
    # https://www.tensorflow.org/datasets/keras_example
    # something about iris dataset
    # https://www.tensorflow.org/datasets/catalog/iris
    import tensorflow_datasets as tfds
    import tensorflow as tf
    
    (ds_train), ds_info = tfds.load(
        'iris',
        split=tfds.Split.TRAIN,
        shuffle_files=True,
        as_supervised=True,
        with_info=True)
    # assert type
    assert isinstance(ds_train, tf.data.Dataset)
    size = ds_train.cardinality().numpy()
    
    # convert to pandas dataframe
    df = tfds.as_dataframe(ds_train.take(size), ds_info)
    
    # export csv data without index
    with open(output_path, "w+", encoding="utf-8") as f:
        df.to_csv(f, index=False, header=True, encoding="utf-8")

#### Create data processing component

In [10]:
@partial(
    create_component_from_func,
    output_component_file='process_iris_data_component.yaml',
    base_image=settings.base_python_image, # use python base image
    packages_to_install=[
        f"pandas=={settings.pandas_version}",
        f"scikit-learn=={settings.sklearn_version}",
    ] # adding additional libs
)
def process_data(label_col_name: str, input_path: InputPath("CSV"), train_output_path: OutputPath("CSV"), test_output_path: OutputPath("CSV")):
    import pandas as pd
    from sklearn.model_selection import train_test_split
    
    df = pd.read_csv(input_path, sep=",", header=0, index_col=None)
    
    print("input csv dataframe")
    print(df)
    print(df.shape)
    
    all_feature_cols_mask = ~df.columns.isin([label_col_name])
    X_train, X_test, y_train, y_test = train_test_split(
        df.loc[:, all_feature_cols_mask], df.loc[:, [label_col_name]], test_size=0.2, random_state=0)
    
    # join on index
    df_train = X_train.join(y_train) 
    df_test = X_test.join(y_test)
    print(f"df_train.shape {df_train.shape}")
    print(f"df_test.shape {df_test.shape}")
    
    # get row by index label
    # print(df_train.loc[137])
    
    # output training set
    with open(train_output_path, "w+", encoding="utf-8") as f:
        df_train.to_csv(f, index=False, header=True, encoding="utf-8")
    
    # output test set
    with open(test_output_path, "w+", encoding="utf-8") as f:
        df_test.to_csv(f, index=False, header=True, encoding="utf-8")       

### Define Helper Function

In [11]:
def pod_resource_transformer(op: ContainerOp, mem_req="200Mi", cpu_req="2000m", mem_lim="2000Mi", cpu_lim='2000m'):
    """
    this function helps to set the resource limit for container operators
    op.set_memory_limit('1000Mi') = 1GB
    op.set_cpu_limit('1000m') = 1 cpu core
    """
    return op.set_memory_request(mem_req)\
            .set_memory_limit(mem_lim)\
            .set_cpu_request(cpu_req)\
            .set_cpu_limit(cpu_lim)

## Define Pipeline
* Intro Kubeflow pipeline: https://v1-5-branch.kubeflow.org/docs/components/pipelines/introduction/
* Kubeflow pipeline SDK v1: https://v1-5-branch.kubeflow.org/docs/components/pipelines/sdk/sdk-overview/

In [12]:
@pipeline(
    name = EXPERIMENT_NAME,
    description = EXPERIMENT_DESC
)
def custom_pipeline():
    download_task = download_data()
    # 200 MB ram and 1 cpu
    download_task = pod_resource_transformer(download_task, mem_req="500Mi", cpu_req="1000m")
    # set the download caching to be 1day, disable caching with P0D
    download_task.execution_options.caching_strategy.max_cache_staleness = "P1D"
    download_task.set_display_name("download iris data")
    
    # variable name "output_path", all "_path" will be removed by sysem
    process_data_task = process_data("label", download_task.outputs["output"])
    process_data_task = pod_resource_transformer(process_data_task, mem_req="500Mi", cpu_req="1000m")
    process_data_task.execution_options.caching_strategy.max_cache_staleness = "P0D"
    process_data_task.set_display_name("split iris data")   

### (optional) pipeline compile step
use the following command to compile the pipeline to 

In [13]:
PIPE_LINE_FILE_NAME="kfp_iris_demo_pipeline"
kfp.compiler.Compiler().compile(custom_pipeline, f"{PIPE_LINE_FILE_NAME}.yaml")

### Create Experiment Run

create run label with current data time
```python
from datetime import datetime
from pytz import timezone as ptimezone
ts = datetime.strftime(datetime.now(ptimezone("Europe/Berlin")), "%Y-%m-%d %H-%M-%S")
print(ts)
```

Reference:
* https://stackoverflow.com/questions/25837452/python-get-current-time-in-right-timezone/25887393#25887393

In [14]:
from datetime import datetime
from pytz import timezone as ptimezone

def get_local_time_str(target_tz_str: str = "Europe/Berlin", format_str: str = "%Y-%m-%d %H-%M-%S") -> str:
    """
    this method is created since the local timezone is miss configured on the server
    @param: target timezone str default "Europe/Berlin"
    @param: "%Y-%m-%d %H-%M-%S" returns 2022-07-07 12-08-45
    """
    target_tz = ptimezone(target_tz_str) # create timezone, in python3.9 use standard lib ZoneInfo
    # utc_dt = datetime.now(datetime.timezone.utc)
    target_dt = datetime.now(target_tz)
    return datetime.strftime(target_dt, format_str)

### Config pipeline run
* Setting imagePullSecretes for Pipeline with SDK: https://github.com/kubeflow/pipelines/issues/5843#issuecomment-859799181

In [15]:
# from kubernetes import client as k8s_client
pipeline_config = dsl.PipelineConf()

# pipeline_config.set_image_pull_secrets([k8s_client.V1ObjectReference(name=K8_GIT_SECRET_NAME, namespace=NAME_SPACE)])
# pipeline_config.set_image_pull_policy("Always")
pipeline_config.set_image_pull_policy("IfNotPresent")

<kfp.dsl._pipeline.PipelineConf at 0x7f6842d773d0>

In [16]:
RUN_NAME = f"kfp_iris_demo {get_local_time_str()}"

# client = kfp.Client()
client.create_run_from_pipeline_func(
    pipeline_func=custom_pipeline,
    arguments = {}, #arguments_pipeline,
    run_name = RUN_NAME,
    pipeline_conf=pipeline_config,
    experiment_name=EXPERIMENT_NAME,
    namespace=NAMESPACE,
)

RunPipelineResult(run_id=4f2cb26e-4a67-4b50-954e-da232d06e52f)