# Running Dask on the cluster with mlrun

The dask frameworks enables users to parallelize their python code and run it as a distributed process on Iguazio cluster and dramatically accelerate their performance. <br>
In this notebook you'll learn how to create a dask cluster and then an mlrun function running as a dask client. <br>
It also demonstrates how to run parallelize custom algorithm using Dask Delayed option

For more information on dask over kubernetes: https://kubernetes.dask.org/en/latest/

## Basic configuration

Import mlrun and dask. nuclio is used just to convert the code into an mlrun function

In [1]:
# nuclio: ignore
import nuclio 

In [2]:
%nuclio config kind = "dask"
%nuclio config spec.image = "mlrun/ml-models"

%nuclio: setting kind to 'dask'
%nuclio: setting spec.image to 'mlrun/ml-models'


In [3]:
from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem
import mlrun

from dask.distributed import Client
from dask import delayed
from dask import dataframe as dd

import warnings
import numpy as np
import os

artifact_path = mlrun.set_environment(api_path = 'http://mlrun-api:8080',
                                      artifact_path = os.path.abspath('./'))

warnings.filterwarnings("ignore")

### Create a python function

This simple function reads a csv file using dask dataframe and run group by and describe function on the dataset <br>
It also shows how to use the dask delayed function to run a python API that is not natively supported by Dask and leverage dask to run it as a distributed process . <br>
In this case we run numpy asmatrix which Interpret the input as a matrix. Using Dask Delayed it runs it in parallel

In [4]:
def test_dask(context: MLClientCtx,
              dataset: DataItem,
              dask_scheduler: str=None) -> None:
    
    if hasattr(context, "dask_client"):
        dask_client = context.dask_client
        
    elif dask_scheduler != None:
        dask_client = Client(dask_scheduler)
        
    else:
        dask_client = Client()
        
    context.dask_client = dask_client
        
    temp = dataset._url
    df = dd.read_csv(temp)
    df_describe = df.describe().compute()
    df_grpby = df.groupby("VendorID").count().compute()
    df_matrix = delayed(np.asmatrix)(df).compute()

In [5]:
# nuclio: end-code

### Convert the code to MLrun function

Use code_to_function to convert the code to MLrun and specify the configuration for the dask process (e.g. replicas, memory etc) <br>
Note that the resource configurations are per worker

In [6]:
fn = mlrun.code_to_function("test_dask",  kind='dask', handler="test_dask").apply(mlrun.mount_v3io())

> 2020-11-04 13:23:02,573 [info] using in-cluster config.


In [7]:
fn.spec.remote = True
fn.spec.replicas = 4
fn.spec.max_replicas = 4
fn.spec.service_type = "NodePort"
fn.with_requests(mem='4G', cpu='2')

In [8]:
DATA_URL = 'https://s3.wasabisys.com/iguazio/data/Taxi/yellow_tripdata_2019-01_subset.csv'

### Run the function

When running the function you would see a link as part of the result. click on this link takes you to the dask monitoring dashboard

In [9]:
fn.run(handler = test_dask,
       inputs={"dataset": DATA_URL})

> 2020-11-04 13:23:05,845 [info] starting run test-dask-test_dask uid=7876386b891041f18f797c22ba1fff88  -> http://mlrun-api:8080
> 2020-11-04 13:23:10,889 [info] trying dask client at: tcp://mlrun-test-dask-4c70401c-5.default-tenant:8786
> 2020-11-04 13:23:10,926 [info] using remote dask scheduler (mlrun-test-dask-4c70401c-5) at: tcp://mlrun-test-dask-4c70401c-5.default-tenant:8786


final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...ba1fff88,0,Nov 04 13:23:05,completed,test-dask-test_dask,v3io_user=testkind=daskowner=testhost=jupyter-test-655d8b5fcd-vnfzp,dataset,,,


to track results use .show() or .logs() or in CLI: 
!mlrun get run 7876386b891041f18f797c22ba1fff88 --project default , !mlrun logs 7876386b891041f18f797c22ba1fff88 --project default
> 2020-11-04 13:24:23,551 [info] run executed, status=completed


<mlrun.model.RunObject at 0x7f829528ea90>