## About this Jupyter Notebook

@author: Yingding Wang\
@updated: 11.09.2023

This notebook demonstrate an example of passing NamedTuple List object between components

## Install KFP Python SDK to build a V1 pipeline

Build KF pipeline with python SDK: https://www.kubeflow.org/docs/components/pipelines/sdk/build-pipeline/
Current KFP python SDK version on pypi.org: https://pypi.org/project/kfp/

In [1]:
import sys
!{sys.executable} -m pip install --upgrade --user kfp==1.8.22



## Restart the Kernel
After the installation of KFP python SDK, the notebook kernel must be restarted.

In [2]:
from platform import python_version
print (f"current platform python version: {python_version()}")

current platform python version: 3.8.10


In [3]:
# run kubectl command line to see the quota in the name space
!kubectl describe quota

Name:                                                         kf-resource-quota
Namespace:                                                    kubeflow-kindfor
Resource                                                      Used   Hard
--------                                                      ----   ----
basic-csi.storageclass.storage.k8s.io/persistentvolumeclaims  4      15
basic-csi.storageclass.storage.k8s.io/requests.storage        115Gi  150Gi
cpu                                                           290m   128
longhorn.storageclass.storage.k8s.io/persistentvolumeclaims   1      15
longhorn.storageclass.storage.k8s.io/requests.storage         250Gi  500Gi
memory                                                        902Mi  512Gi
requests.nvidia.com/mig-1g.10gb                               0      2
requests.nvidia.com/mig-1g.20gb                               0      1
requests.nvidia.com/mig-2g.20gb                               0      1


## Getting familiar with Jupyter Notebook ENV 

In [4]:
# examing the kfp python sdk version inside a KubeFlow v1.5.1
!{sys.executable} -m pip list | grep kfp

kfp                      1.8.22
kfp-pipeline-spec        0.1.16
kfp-server-api           1.8.5


## Define global variable

In [5]:
import kfp
client = kfp.Client()
NAMESPACE = client.get_user_namespace()
EXPERIMENT_NAME = 'default' # Name of the experiment in the KF webapp UI
EXPERIMENT_DESC = 'pass list obj between components'
PREFIX = "demo_"

print(NAMESPACE)

kubeflow-kindfor


In [6]:
from dataclasses import dataclass
@dataclass
class Config:
    # python 3.8
    base_image: str = "python:3.8.18"
    
config = Config()

## Creating KubeFlow component from python function

In [7]:
import kfp.dsl as dsl
from functools import partial
from kfp.dsl import (
    pipeline,
    ContainerOp
)
from kfp.components import (
    InputPath,
    OutputPath,
    create_component_from_func
)

## List Write Component

In [8]:
from typing import NamedTuple
@partial(
    create_component_from_func,
    output_component_file=f"{PREFIX}list_write_component.yaml",
    base_image=config.base_image,
    packages_to_install=[
    ] # adding additional libs,
)
def upstream_write_task() -> NamedTuple("output", [('mylist', list)]):
    from collections import namedtuple
    content_list = [f"string numer {i}" for i in range(10)]
    output = namedtuple('output',['mylist'])  
    return output(content_list)
    
    
    

In [9]:
# must use the default list type, not typing.List 
@partial(
    create_component_from_func,
    output_component_file=f"{PREFIX}list_read_component.yaml",
    base_image=config.base_image,
    packages_to_install=[
    ] # adding additional libs,
)
def downstream_read_task(my_list: list):
    print(my_list)

## Define Helper Function
Difference between 2Gi and 2G
https://stackoverflow.com/questions/50804915/kubernetes-size-definitions-whats-the-difference-of-gi-and-g/50805048#50805048

In [10]:
def pod_resource_transformer(op: ContainerOp, mem_req="200Mi", cpu_req="2000m", mem_lim="4000Mi", cpu_lim='4000m'):
    """
    this function helps to set the resource limit for container operators
    op.set_memory_limit('1000Mi') = 1GB
    op.set_cpu_limit('1000m') = 1 cpu core
    """
    return op.set_memory_request(mem_req)\
            .set_memory_limit(mem_lim)\
            .set_cpu_request(cpu_req)\
            .set_cpu_limit(cpu_lim)

## Define Pipeline
* Intro Kubeflow pipeline: https://v1-5-branch.kubeflow.org/docs/components/pipelines/introduction/
* Kubeflow pipeline SDK v1: https://v1-5-branch.kubeflow.org/docs/components/pipelines/sdk/sdk-overview/

In [11]:
@pipeline(
    name = EXPERIMENT_NAME,
    description = EXPERIMENT_DESC
)
def custom_pipeline(epochs: int):
    '''local variable'''
    no_artifact_cache = "P0D"
    # artifact_cache_today = "P1D"
    cache_setting = no_artifact_cache
    batch_size = 50
    # epochs = 100
    
    '''pipeline'''   
    task1 = upstream_write_task() 
    task1 = pod_resource_transformer(task1, mem_req="1000Mi", cpu_req="1000m")
    task1.execution_options.caching_strategy.max_cache_staleness = cache_setting
    task1.set_display_name("create list")
    
    task2 = downstream_read_task(task1.outputs['mylist']) 
    task2 = pod_resource_transformer(task2, mem_req="1000Mi", cpu_req="1000m")
    task2.execution_options.caching_strategy.max_cache_staleness = cache_setting
    task2.set_display_name("read list")

### (optional) pipeline compile step
use the following command to compile the pipeline

In [12]:
PIPE_LINE_FILE_NAME=f"{PREFIX}kfp_pass_list_pipeline"
kfp.compiler.Compiler().compile(custom_pipeline, f"{PIPE_LINE_FILE_NAME}.yaml")

### Create Experiment Run

create run label with current data time
```python
from datetime import datetime
from pytz import timezone as ptimezone
ts = datetime.strftime(datetime.now(ptimezone("Europe/Berlin")), "%Y-%m-%d %H-%M-%S")
print(ts)
```

Reference:
* https://stackoverflow.com/questions/25837452/python-get-current-time-in-right-timezone/25887393#25887393

In [13]:
from datetime import datetime
from pytz import timezone as ptimezone

def get_local_time_str(target_tz_str: str = "Europe/Berlin", format_str: str = "%Y-%m-%d %H-%M-%S") -> str:
    """
    this method is created since the local timezone is miss configured on the server
    @param: target timezone str default "Europe/Berlin"
    @param: "%Y-%m-%d %H-%M-%S" returns 2022-07-07 12-08-45
    """
    target_tz = ptimezone(target_tz_str) # create timezone, in python3.9 use standard lib ZoneInfo
    # utc_dt = datetime.now(datetime.timezone.utc)
    target_dt = datetime.now(target_tz)
    return datetime.strftime(target_dt, format_str)

In [14]:
# from kubernetes import client as k8s_client
pipeline_config = dsl.PipelineConf()

# pipeline_config.set_image_pull_secrets([k8s_client.V1ObjectReference(name=K8_GIT_SECRET_NAME, namespace=NAME_SPACE)])
# pipeline_config.set_image_pull_policy("Always")
pipeline_config.set_image_pull_policy("IfNotPresent")

pipeline_args = {}

In [15]:
RUN_NAME = f"{PREFIX}kfp_pass_list_pipeline {get_local_time_str()}"

# client = kfp.Client()
client.create_run_from_pipeline_func(
    pipeline_func=custom_pipeline,
    arguments = pipeline_args, #{}
    run_name = RUN_NAME,
    pipeline_conf=pipeline_config,
    experiment_name=EXPERIMENT_NAME,
    namespace=NAMESPACE,
)

RunPipelineResult(run_id=14e5ed59-4a6a-4db3-ae7c-ba53071d1dde)