# Running Dask on the cluster with mlrun

The dask frameworks enables users to parallelize their python code and run it as a distributed process on Iguazio cluster and dramatically accelerate their performance. <br>
In this notebook you'll learn how to create a dask cluster and then an mlrun function running as a dask client. <br>
It also demonstrates how to run parallelize custom algorithm using Dask Delayed option

For more information on dask over kubernetes: https://kubernetes.dask.org/en/latest/

## Basic configuration

Import mlrun and dask. nuclio is used just to convert the code into an mlrun function

In [1]:
#Make sure thar mlrun is installed. if it's already installed then skip this step
#to instlal mlrun run the following

!/User/align_mlrun.sh

Both server & client are aligned (0.6.0rc13).


## Load sample data

In [2]:
!mkdir -p /User/examples/

In [3]:
%%sh
CSV_PATH="/User/examples/ytrip.csv"
curl -L "https://s3.wasabisys.com/iguazio/data/Taxi/yellow_tripdata_2019-01_subset.csv" > ${CSV_PATH}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 84.9M  100 84.9M    0     0  8206k      0  0:00:10  0:00:10 --:--:-- 8639k


In [4]:
# nuclio: ignore
import nuclio 

In [5]:
# nuclio: start-code
%nuclio config kind = "job"
%nuclio config spec.image = "mlrun/ml-models"

%nuclio: setting kind to 'job'
%nuclio: setting spec.image to 'mlrun/ml-models'


In [None]:
from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem

from dask.distributed import Client
from dask import delayed
from dask import dataframe as dd

import warnings
import numpy as np
import os

warnings.filterwarnings("ignore")

### Create a python function

This simple function reads a csv file using dask dataframe and run group by and describe function on the dataset <br>
It also shows how to use the dask delayed function to run a python API that is not natively supported by Dask and leverage dask to run it as a distributed process . <br>
In this case we run numpy asmatrix which Interpret the input as a matrix. Using Dask Delayed it runs it in parallel

In [None]:
def test_dask(context: MLClientCtx,
              dataset: DataItem,
              dask_client: str=None) -> None:
    
    if dask_client:
        client = Client(dask_client)
    else:
        client = Client()
        
    df = dataset.as_df(df_module=dd)
    df_describe = df.describe().compute()
    df_grpby = df.groupby("VendorID").count().compute()
    df_matrix = delayed(np.asmatrix)(df).compute()

In [None]:
# nuclio: end-code

### Set up the enviroment

In [None]:
import mlrun
artifact_path = mlrun.set_environment(api_path = mlrun.mlconf.dbpath or 'http://mlrun-api:8080',
                                      artifact_path = os.path.abspath('./'))

### Convert the code to MLrun function

Use code_to_function to convert the code to MLrun and specify the configuration for the dask process (e.g. replicas, memory etc) <br>
Note that the resource configurations are per worker

In [None]:
fn = mlrun.code_to_function("test_dask",  kind='job', handler="test_dask").apply(mlrun.mount_v3io())

### Init dask cluster

In [None]:
dsf = mlrun.new_function("dask_init", kind='dask', image='mlrun/ml-models').apply(mlrun.mount_v3io())

In [None]:
dsf.spec.remote = True
dsf.spec.replicas = 2
dsf.spec.max_replicas = 4
dsf.spec.service_type = "NodePort"
dsf.with_requests(mem='2G', cpu='2')

In [None]:
client = dsf.client
client

## Replace the Dask_Clinet with the client scheduler address (see above)

In [None]:
DATA_URL = '/User/examples/ytrip.csv'
DASK_CLIENT = client.scheduler.address
# e.g. DASK_CLIENT = 'tcp://mlrun-dask-init-9d8122b2-b.default-tenant:8786'

### Run the function

When running the function you would see a link as part of the result. click on this link takes you to the dask monitoring dashboard

In [None]:
fn.run(handler = test_dask,
       inputs={"dataset": DATA_URL},
       params={"dask_client": DASK_CLIENT})

## Track the progress in the UI

Users can view the progress and detailed information in the mlrun UI by clicking on the uid above. <br>
Also, to track the dask progress in the dask UI click on the "dashboard link" above the "client" section