# ML Pipeline example - XGBoost Training

Pipeline is being used to automate data science workflow. users can define the various steps in the pipeline as well as the relationship between those steps. <br>
A typical pipeline comprises of data collection, training / hyperparameter training, selecting the best fit model, deploy in a test env, run tests and then deploy in a production system<br>
In this notebook we'll demonstrate how to create a  pipeline that comprises of three steps: training, deploying the mode that was created by the training step as a function and in addition plotting<br>
Once the pipeline has been created users can go to the main Iguazio dashboard and look for "Pipeline" at the top left menu items. <br>
In the pipeline view you should be able to see the new experiment named "xgb 1" along with historical experiments should they exist


In [1]:
# nuclio: ignore
# if the nuclio-jupyter package is not installed run !pip install nuclio-jupyter
import nuclio 

### Install and register package dependencied and build commands
Those will convert to container build instructions 

In [2]:
%%nuclio cmd 
pip install sklearn
pip install xgboost
pip install matplotlib



In [3]:
%nuclio config spec.build.baseImage = "python:3.6-jessie"
#%nuclio config spec.image = ".mlrun/xgb:latest"

%nuclio: setting spec.build.baseImage to 'python:3.6-jessie'


## ML Training code

In [4]:
import xgboost as xgb
import os
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import accuracy_score

dtrain = dtest = Y_test = None

def load_dataset():
    global dtrain, dtest, Y_test
    iris = load_iris()
    y = iris['target']
    X = iris['data']
    X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)
    dtrain = xgb.DMatrix(X_train, label=Y_train)
    dtest = xgb.DMatrix(X_test, label=Y_test)


def xgb_train(context, model_name='model.bst',
            max_depth=6,
            num_class=10,
            eta=0.2,
            gamma=0.1,
            steps=20):
    global dtrain, dtest, Y_test

    if dtrain is None:
        load_dataset()

    # Get params from event
    param = {"max_depth": max_depth,
             "eta": eta, "nthread": 4,
             "num_class": num_class,
             "gamma": gamma,
             "objective": "multi:softprob"}

    # Train model
    xgb_model = xgb.train(param, dtrain, steps)

    preds = xgb_model.predict(dtest)
    best_preds = np.asarray([np.argmax(line) for line in preds])

    context.log_result('accuracy', float(accuracy_score(Y_test, best_preds)))

    os.makedirs('models', exist_ok=True)
    model_file = model_name #os.path.join('models', model_name)
    xgb_model.save_model(model_file)
    context.log_artifact('model', src_path=model_file, labels={'framework': 'xgboost'})

from mlrun.artifacts import PlotArtifact
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from io import BytesIO

def plot_iter(context, iterations, col='accuracy', num_bins=10):
    df = pd.read_csv(BytesIO(iterations.get()))
    x = df['output.{}'.format(col)]
    fig, ax = plt.subplots(figsize=(6,6))
    n, bins, patches = ax.hist(x, num_bins, density=1)
    ax.set_xlabel('Accuraccy')
    ax.set_ylabel('Count')
    context.log_artifact(PlotArtifact('myfig', body=fig))

In [5]:
# nuclio: end-code
# (end-code marker tells nuclio to stop parsing the notebook from this cell)

In [6]:
from mlrun import new_function, code_to_function, NewTask, get_run_db, mlconf, mount_v3io, new_model_server
mlconf.dbpath = '/User/mlrun'
import kfp
from kfp import dsl

## Test the code locally 

In [7]:
task = NewTask(handler=xgb_train, out_path='/User/mlrun/data').with_hyper_params({'eta': [0.1, 0.2, 0.3]}, selector='max.accuracy')
run = new_function().run(task)

[mlrun] 2019-10-29 12:08:51,608 starting run None uid=bc617472b91b4301ba8579dc2e099f29  -> /User/mlrun


uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...099f29,0,Oct 29 12:08:51,completed,,kind=handlerowner=iguazio,,,best_iteration=1accuracy=0.8666666666666667,modeliteration_results


type result.show() to see detailed results/progress or use CLI:
!mlrun get run --uid bc617472b91b4301ba8579dc2e099f29 
[mlrun] 2019-10-29 12:08:52,304 run executed, status=completed


## Create a containerized function from the notebook code

We create a function object which defined the code, metadata, execution and build instructions <br>
Before running a pipeline job we need to either create an image with the relevant code and libraries we are using or use an existing one <br>
Here is an example of creating an image using MLRun. We would use this image as a base image where we can run the code on top of it <br>
The image can be built once. no need to repeat it for every run

In [8]:
# create a job from the notebook, attache it to iguazio data fabric (v3io)
fn = code_to_function('training')

In [9]:
fn.build(image='mlrun/xgb:latest')

[mlrun] 2019-10-29 12:09:13,978 building image (mlrun/xgb:latest)
FROM python:3.6-jessie
WORKDIR /run
RUN pip install sklearn
RUN pip install xgboost
RUN pip install matplotlib
RUN pip install mlrun
ENV PYTHONPATH /run
[mlrun] 2019-10-29 12:09:13,980 using in-cluster config.
[mlrun] 2019-10-29 12:09:13,998 Pod mlrun-build-mr2fx created
..
[36mINFO[0m[0000] Resolved base name python:3.6-jessie to python:3.6-jessie 
[36mINFO[0m[0000] Resolved base name python:3.6-jessie to python:3.6-jessie 
[36mINFO[0m[0000] Downloading base image python:3.6-jessie     
[36mINFO[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:0318d80cb241983eda20b905d77fa0bfb06e29e5aabf075c7941ea687f1c125a: no such file or directory 
[36mINFO[0m[0000] Downloading base image python:3.6-jessie     
[36mINFO[0m[0000] Built cross stage deps: map[]                
[36mINFO[0m[0000] Downloading base image python:3.6-jessie     
[36mINFO[0m[0000] Error while retrieving im

<mlrun.runtimes.local.LocalRuntime at 0x7f980e0f3f98>

## Create and run the pipeline

### Creating two steps workflow (training, deployment & creating an html chart artifact)
* 1st step: Execute training job with hyper parameters for eta and gamma.  log results and various artifacts including model into the artifact path
* 2nd step: running two steps in parallel: Take the model file from the first step and deploy it as a nucliio function and also plot an html chart showing count of accuracy <br>
note that we are using the v3io mount in order to save artifacts and then the nuclio function can access and use the model file that was created in the first step

In [10]:
artifacts_path = 'v3io:///users/admin/mlrun/kfp/{{workflow.uid}}/'

In [11]:
@dsl.pipeline(
    name='My XGBoost training pipeline',
    description='Shows how to use mlrun.'
)
def xgb_pipeline(
   eta = [0.1, 0.2, 0.3], gamma = [0.0, 0.1, 0.2, 0.3]
):
    fn.with_code()  # update the code from notebook
    train = fn.to_step(
        NewTask(handler='xgb_train', out_path=artifacts_path, outputs=['model'])\
                .with_hyper_params({'eta': eta, 'gamma': gamma}, selector='max.accuracy'),
        name='xgb_train').apply(mount_v3io())
    
    # deploy the model using nuclio functions
    srvfn = new_model_server('mysrv3', model_class='XGBoostModel', filename='nuclio_serving.ipynb')
    deploy = srvfn.with_v3io('User','~/').deploy_step(project = 'xgb', models={'netops_v1': train.outputs['model']}, dashboard='http://10.233.3.121:8070')
    
    # feed 1st step results into the secound step
    plot = fn.to_step(
        NewTask(handler='plot_iter', out_path=artifacts_path, 
                inputs={'iterations': train.outputs['iteration_results']}), 
        name='plot').apply(mount_v3io()) 
    

### Create a KubeFlow client and submit the pipeline with parameters

In [12]:
# for debug generate the pipeline dsl
#kfp.compiler.Compiler().compile(xgb_pipeline, 'mlrunpipe.yaml')

In [13]:
client = kfp.Client(namespace='default-tenant')
arguments = {'eta': [0.05, 0.10, 0.20, 0.30], 'gamma': [0.0, 0.1, 0.2, 0.3]}
run_result = client.create_run_from_pipeline_func(xgb_pipeline, arguments, run_name='xgb 1', experiment_name='xgb')



### See the run status and results in the run database

In [14]:
# connect to the run db 
db = get_run_db().connect()

In [15]:
# query the DB with filter on workflow ID (only show this workflow) 
db.list_runs('', labels=f'workflow={run_result.run_id}').show()

uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...a4007a,0,Oct 29 12:15:27,completed,plot,workflow=4fe894c2-2fa3-4aea-8adb-ec8cb30c173akind=localowner=roothost=my-xgboost-training-pipeline-vv5c2-3468883816,iterations,,,myfig.html
...b5d719,16,Oct 29 12:15:18,completed,xgb_train,workflow=4fe894c2-2fa3-4aea-8adb-ec8cb30c173akind=localowner=roothost=my-xgboost-training-pipeline-vv5c2-645664411,,eta=0.3gamma=0.3,accuracy=0.9666666666666667,model
...b5d719,15,Oct 29 12:15:18,completed,xgb_train,workflow=4fe894c2-2fa3-4aea-8adb-ec8cb30c173akind=localowner=roothost=my-xgboost-training-pipeline-vv5c2-645664411,,eta=0.2gamma=0.3,accuracy=0.9666666666666667,model
...b5d719,14,Oct 29 12:15:18,completed,xgb_train,workflow=4fe894c2-2fa3-4aea-8adb-ec8cb30c173akind=localowner=roothost=my-xgboost-training-pipeline-vv5c2-645664411,,eta=0.1gamma=0.3,accuracy=1.0,model
...b5d719,13,Oct 29 12:15:18,completed,xgb_train,workflow=4fe894c2-2fa3-4aea-8adb-ec8cb30c173akind=localowner=roothost=my-xgboost-training-pipeline-vv5c2-645664411,,eta=0.05gamma=0.3,accuracy=0.9333333333333333,model
...b5d719,12,Oct 29 12:15:18,completed,xgb_train,workflow=4fe894c2-2fa3-4aea-8adb-ec8cb30c173akind=localowner=roothost=my-xgboost-training-pipeline-vv5c2-645664411,,eta=0.3gamma=0.2,accuracy=0.9333333333333333,model
...b5d719,11,Oct 29 12:15:18,completed,xgb_train,workflow=4fe894c2-2fa3-4aea-8adb-ec8cb30c173akind=localowner=roothost=my-xgboost-training-pipeline-vv5c2-645664411,,eta=0.2gamma=0.2,accuracy=1.0,model
...b5d719,10,Oct 29 12:15:17,completed,xgb_train,workflow=4fe894c2-2fa3-4aea-8adb-ec8cb30c173akind=localowner=roothost=my-xgboost-training-pipeline-vv5c2-645664411,,eta=0.1gamma=0.2,accuracy=0.9666666666666667,model
...b5d719,9,Oct 29 12:15:17,completed,xgb_train,workflow=4fe894c2-2fa3-4aea-8adb-ec8cb30c173akind=localowner=roothost=my-xgboost-training-pipeline-vv5c2-645664411,,eta=0.05gamma=0.2,accuracy=0.9666666666666667,model
...b5d719,8,Oct 29 12:15:17,completed,xgb_train,workflow=4fe894c2-2fa3-4aea-8adb-ec8cb30c173akind=localowner=roothost=my-xgboost-training-pipeline-vv5c2-645664411,,eta=0.3gamma=0.1,accuracy=0.9333333333333333,model
