# Running Dask on the cluster with mlrun

The dask frameworks enables users to parallelize their python code and run it as a distributed process on Iguazio cluster and dramatically accelerate their performance. <br>
In this notebook you'll learn how to create a dask cluster and then an mlrun function running as a dask client. <br>
It also demonstrates how to run parallelize custom algorithm using Dask Delayed option

For more information on dask over kubernetes: https://kubernetes.dask.org/en/latest/

## Basic configuration

Import mlrun and dask. nuclio is used just to convert the code into an mlrun function

## Load sample data

In [86]:
!mkdir -p /User/examples/

In [87]:
import mlrun
import requests

source_url = mlrun.get_sample_path("/data/Taxi/yellow_tripdata_2019-01_subset.csv")
response = requests.get(source_url, allow_redirects=True)
with open('/User/examples/ytrip.csv', 'wb') as csv_file:
    csv_file.write(response.content)

In [88]:
# mlrun: start-code
from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem

from dask.distributed import Client
from dask import delayed
from dask import dataframe as dd

import warnings
import numpy as np
import os

warnings.filterwarnings("ignore")

### Create a python function

This simple function reads a csv file using dask dataframe and run group by and describe function on the dataset <br>
It also shows how to use the dask delayed function to run a python API that is not natively supported by Dask and leverage dask to run it as a distributed process . <br>
In this case we run numpy asmatrix which Interpret the input as a matrix. Using Dask Delayed it runs it in parallel

In [89]:
def test_dask(context,
              dataset: DataItem,
              dask_client: str=None) -> None:
    
    if dask_client:
        client = Client(dask_client)
    else:
        client = Client()
        
    df = dataset.as_df(df_module=dd)
    df_describe = df.describe().compute()
    df_grpby = df.groupby("VendorID").count().compute()
    df_matrix = delayed(np.asmatrix)(df).compute()

In [90]:
# mlrun: end-code

### Set up the enviroment

In [91]:
import mlrun
artifact_path = mlrun.set_environment(api_path = mlrun.mlconf.dbpath or 'http://mlrun-api:8080',
                                      artifact_path = os.path.abspath('./'))

> 2022-06-07 14:18:56,952 [info] loaded project default from MLRun DB


### Convert the code to MLRun function

Use code_to_function to convert the code to MLRun and specify the configuration for the dask process (e.g. replicas, memory etc) <br>
Note that the resource configurations are per worker

### Init dask cluster

In [92]:
dsf = mlrun.new_function("dask_init", kind='dask', image='mlrun/ml-models').apply(mlrun.mount_v3io())

In [93]:
dsf.spec.remote = True
dsf.spec.replicas = 2
dsf.spec.max_replicas = 4
dsf.spec.service_type = "NodePort"
dsf.with_requests(mem='2G', cpu='2')

In [94]:
client = dsf.client
client

> 2022-06-07 14:18:57,008 [info] trying dask client at: tcp://mlrun-dask-init-d0ef8acd-8.default-tenant:8786
> 2022-06-07 14:19:37,599 [info] using remote dask scheduler (mlrun-dask-init-d1179375-8) at: tcp://mlrun-dask-init-d1179375-8.default-tenant:8786


0,1
Connection method: Direct,
Dashboard: http://mlrun-dask-init-d1179375-8.default-tenant:8787/status,

0,1
Comm: tcp://10.200.108.145:8786,Workers: 0
Dashboard: http://10.200.108.145:8787/status,Total threads: 0
Started: Just now,Total memory: 0 B


## Replace the Dask_Clinet with the client scheduler address (see above)

In [95]:
DATA_URL = "/User/examples/ytrip.csv"
DASK_CLIENT = str(client.scheduler.address)
# e.g. DASK_CLIENT = 'tcp://mlrun-dask-init-9d8122b2-b.default-tenant:8786'

In [96]:
DASK_CLIENT

'tcp://mlrun-dask-init-d1179375-8.default-tenant:8786'

In [97]:
fn = mlrun.code_to_function("test_dask",  kind='job', handler="test_dask", image='mlrun/ml-base').apply(mlrun.mount_v3io())

### Run the function

When running the function you would see a link as part of the result. click on this link takes you to the dask monitoring dashboard

In [101]:
fn.run(name ='dasking',
       handler = 'test_dask',
       inputs={"dataset": DATA_URL},
       params={"dask_client": DASK_CLIENT}
      )

> 2022-06-07 14:21:19,759 [info] starting run dasking uid=38f5b7d118be4b5988ee8d767322e6bc DB=http://mlrun-api:8080
> 2022-06-07 14:21:19,935 [info] Job is running in the background, pod: dasking-4tmzj
> 2022-06-07 14:21:37,411 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...7322e6bc,0,Jun 07 14:21:25,completed,dasking,v3io_user=aviakind=jobowner=aviamlrun/client_version=1.0.2host=dasking-4tmzj,dataset,dask_client=tcp://mlrun-dask-init-d1179375-8.default-tenant:8786,,





> 2022-06-07 14:21:39,307 [info] run executed, status=completed


<mlrun.model.RunObject at 0x7fb2d16b57d0>

## Track the progress in the UI

Users can view the progress and detailed information in the mlrun UI by clicking on the uid above. <br>
Also, to track the dask progress in the dask UI click on the "dashboard link" above the "client" section