# Creating A Pipeline Using MLRUN

Pipeline is being used to autoamte data science workflow. users can define the various steps in the pipeline including <br>
as well as the relationship between those steps. a typical pipeline comprises of data collection, training / hyperparameter training, selecting the best fit model, deploy in a test env, run tests and then deploy in a production system <br>
In this notebook we'll deomonstrate how to create a simple pipeline that comprises of two steps: "training" and "validation" <br> once the pipeline has been created users can go to the main iguazio dashboard and look for "Pipeline" at the top left menu items. In the pipeline view you should be able to see the new pipeline called "mlrun demo 1" along with historical experiments should they exist


In [1]:
%env MLRUN_PACKAGE_PATH=git+https://github.com/mlrun/mlrun.git
%env MLRUN_META_DBPATH=/User/experiment-tracking
import kfp
from kfp import dsl
from mlrun.platforms import mount_v3io
from mlrun.builder import build_image
from mlrun import mlrun_op, get_run_db

env: MLRUN_PACKAGE_PATH=git+https://github.com/mlrun/mlrun.git
env: MLRUN_META_DBPATH=/User/experiment-tracking


<b> Test/Debug the code locally, verify its working <b>

In [2]:
!python -m mlrun run -p p1=5 -s file=secrets.txt --out-path /User/mlrun/ training.py

[mlrun] 2019-09-11 21:13:16,104 starting run None uid=c2dc5fe6d4904729a9f662c64959cbd2
Run: train (uid=c2dc5fe6d4904729a9f662c64959cbd2)
Params: p1=5, p2=a-string
accesskey = 456
file
b"i'm a local input file"


[mlrun] 2019-09-11 21:13:29,923 run executed, status=created


## Build & Run a KubeFlow Pipeline 

This example is using iguazio shared FS (v3io), the `/User` dir is the "Home" for the user and the Jupyter notebook<br>
the code is mounted into the pipeline containers (no need to rebuild containers when the code changes and the runtime have access to the user local files)

MLRUN has a DB specified in the `db_path` argument (this example is using files to store runs and artifacts)<br>
the result artifacts are versioned and stored under the specified location, each workflow have a unique artifacts directory (`/<path>/{{workflow.uid}}/`)

Artifact and DB paths can use file paths or URLs for supported datastores (prefixed with s3://, v3io://, ..), <br>
Notes: file store artifacts cannot be viewed by KFP (use object URLs), URL based stores may requieres secrets passing 

In [3]:
this_path = '/User/experiment-tracking'
db_path = this_path
#artifacts_path = this_path + '/data/{{workflow.uid}}/'
artifacts_path = 'v3io:///bigdata/mlrun/{{workflow.uid}}/'
image_name = 'mlrun/mypipe:latest'

In [4]:
# set the docker registry to the local registry
from os import environ
img = 'docker-registry.{}:80/{}'.format(environ.get('IGZ_NAMESPACE_DOMAIN'), image_name)
img

'docker-registry.default-tenant.app.vpxomsanigin.iguazio-cd2.com:80/mlrun/mypipe:latest'

# Build a container image

Before running a pipeline job we need to either create an image with the relevant code and libraries we are using or use an existing one. Here is an example of creating an image using MLrun. We would use this image as a base image where we can run the code on top of it

In [5]:
# build an image with MLRUN (dev branch vs master), can add other packages to it 
# since registry secret was not specified it will push to the local cluster registry 
x = build_image(image_name, 
                  base_image='python:3.6')

FROM python:3.6
WORKDIR /run
RUN pip install git+https://github.com/mlrun/mlrun.git
ENV PYTHONPATH /run
[mlrun] 2019-09-11 21:13:37,533 using in-cluster config.
[mlrun] 2019-09-11 21:13:37,551 Pod mlrun-build-tjghq created
..
[36mINFO[0m[0000] Downloading base image python:3.6            
2019/09/11 21:13:39 No matching credentials were found, falling back on anonymous
[36mINFO[0m[0000] Unpacking rootfs as cmd RUN pip install git+https://github.com/mlrun/mlrun.git requires it. 
[36mINFO[0m[0013] Taking snapshot of full filesystem...        
[36mINFO[0m[0017] Skipping paths under /kaniko, as it is a whitelisted directory 
[36mINFO[0m[0017] Skipping paths under /empty, as it is a whitelisted directory 
[36mINFO[0m[0017] Skipping paths under /var/run, as it is a whitelisted directory 
[36mINFO[0m[0017] Skipping paths under /proc, as it is a whitelisted directory 
[36mINFO[0m[0017] Skipping paths under /sys, as it is a whitelisted directory 
[36mINFO[0m[0017] Skipping pat

In [6]:
x

'succeeded'

## Creating two steps workflow (training, validation)

* 1st step: Execute training job with parameters <b>p1</b> and <b>p2</b>, log results and various artifacts including model (see [training.py](training.py))
* 2nd step: take the <b>modelfile</b> from the 1st step and run validation on it (see [validation.py](validation.py))

### Building a Pipeline with Hyperparams and Parallel Execution
We may want to run the same training job with multiple parameter options, we can lavarage MLRUN paralelism<br>
, instead or running each run in a seperate container with extra start and stop times we can use a pool of serverless functions<br>
or containers which will run the workload in parallel.

We extend our pipeline to use hyper parameters, the training Job will accept a list per parameter and will run all the parameter<br>
combinations (GridSearch), involving the fixes parameters `params` and the expended parameters (from `hyperparams`)<br>
we can apply a `selector` which will return the best result based on the criteria e.g. `max.accuracy`.

Parameter combinations can also be provided using the `param_file` option which reads the parameter values per iteration<br>
from a CSV file (where the first row hold the parameter names and following rows hold param values).<br>
the use of `hyperparams` and `param_file` can be extended to many tasks including data and ETL tasks<br>
e.g. create a list of text or image file paths in a CSV file and run a step which process all those files in paralell. 

In [7]:
# run training using params p1 and p2, generate 2 registered outputs (model, dataset) to be listed in the pipeline UI
# user can specify the target path per output e.g. 'model.txt':'<some-path>', or leave blank to use the default out_path
def mlrun_train(p1, p2):
    return mlrun_op('training', 
                    image = img,
                    command = this_path + '/training.py', 
                    params = {'p2':p2},
                    hyperparams = {'p1': p1},
                    selector = 'max.accuracy',
                    in_path='/User/experiment-tracking',
                    outputs = {'model.txt':'', 'dataset.csv':''},
                    out_path = artifacts_path)
                    
# use data (model) from the first step as an input
def mlrun_validate(modelfile):
    return mlrun_op('validation', 
                    image = img,
                    command = this_path + '/validation.py', 
                    inputs = {'model.txt':modelfile},
                    in_path='/User/experiment-tracking',
                    out_path = artifacts_path)

<b> Create a Kubeflow Pipelines DSL (execution graph/DAG)</b>

In [8]:
@dsl.pipeline(
    name='My MLRUN pipeline',
    description='Shows how to use mlrun.'
)
def mlrun_pipeline(
   p1 = [5, 6, 2] , p2 = '"text"'
):
    # create a train step, apply v3io mount to it (will add the /User mount to the container)
    train = mlrun_train(p1, p2).apply(mount_v3io())
    
    # feed 1st step results into the secound step
    # Note: the '.' in model.txt must be substituted with '-'
    validate = mlrun_validate(train.outputs['model-txt']).apply(mount_v3io())

<b> Create a KFP client, Experiment and run the pipeline with custom parameter </b>

In [9]:
client = kfp.Client(namespace='default-tenant')
arguments = {'p1': [5, 7, 3]}
run_result = client.create_run_from_pipeline_func(mlrun_pipeline, arguments, run_name='mlrun demo 1', experiment_name='mlrun demo')

# for debug you can see the pipeline as a yaml file, use
# kfp.compiler.Compiler().compile(mlrun_pipeline, 'mlrunpipe.yaml')

<b> See the run status and results in the run database </b>

In [10]:
# connect to the run db 
db = get_run_db(db_path).connect()

In [12]:
# query the DB with filter on workflow ID (only show this workflow) 
db.list_runs('', labels=f'workflow={run_result.run_id}').show()

uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...6758d1,3,Sep 11 21:15:36,completed,training,workflow=d45e4d70-0e8c-4be1-8426-16a486e77f5dkind=localowner=roothost=my-mlrun-pipeline-mw5cd-1328661568framework=sklearn,infile.txt,p2=textp1=3,accuracy=6loss=9,model.txtresults.htmldataset.csvchart.html
...6758d1,2,Sep 11 21:15:35,completed,training,workflow=d45e4d70-0e8c-4be1-8426-16a486e77f5dkind=localowner=roothost=my-mlrun-pipeline-mw5cd-1328661568framework=sklearn,infile.txt,p2=textp1=7,accuracy=14loss=21,model.txtresults.htmldataset.csvchart.html
...6758d1,1,Sep 11 21:15:33,completed,training,workflow=d45e4d70-0e8c-4be1-8426-16a486e77f5dkind=localowner=roothost=my-mlrun-pipeline-mw5cd-1328661568framework=sklearn,infile.txt,p2=textp1=5,accuracy=10loss=15,model.txtresults.htmldataset.csvchart.html
...6758d1,0,Sep 11 21:15:32,completed,training,workflow=d45e4d70-0e8c-4be1-8426-16a486e77f5dkind=localowner=root,,p2=text,best_iteration=2accuracy=14loss=21,model.txtresults.htmldataset.csvchart.htmliteration_results.csv
