# Run Dask Distribution in Azure Machine Learning

To run this notebook,

1. Create new "Machine Learning" resource in [Azure Portal](https://portal.azure.com/).
2. Install Azure Machine Learning SDK (core package) as follows

```
pip install azureml-core
```

Here we use built-in Dask ML function (```dask_ml.linear_model.LinearRegression```) in this tutorial, but you can run **scikit-learn compliant functions and jobs** in distributed manners with Dask cluster.<br>
For dask distribution, see "[Run Distributed Dask on Azure Kubernetes Service](https://tsmatz.wordpress.com/2021/05/17/dask-distributed-on-azure-kubernetes/)" for details.

## 1. Create script for Distributed Dask training (Regression Example)

Create a directory for saving your script.

In [1]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

Save a script file (```dask_distributed_example.py```) for Distributed Dask training.<br>
Here we run 4 nodes with the following roles.

- Rank 0 : Dask Scheduler
- Rank 1 : Dask Worker
- Rank 2 : Dask Worker
- Rank 3 : Dask Client

In this example, we only output the evaluation score, but you can also save the generated model as AML outputs.

In [2]:
%%writefile script/dask_distributed_example.py
import asyncio
from dask.distributed import Scheduler, Worker, Client
from dask_ml.datasets import make_regression
from dask_ml.model_selection import train_test_split
from dask_ml.linear_model import LinearRegression
from mpi4py import MPI
import socket
import time

mpi_comm = MPI.COMM_WORLD
mpi_rank = mpi_comm.Get_rank()
if mpi_rank == 0 :
    #
    # Rank 0 is scheduler
    #

    async def f_scheduler():
        # Start scheduler
        s = Scheduler(port=8786)
        # Send address to workers
        ipaddr = socket.gethostbyname(socket.gethostname())
        scheduler_info = {
            "address"  : ipaddr + ":8786"
        }
        scheduler_info = mpi_comm.bcast(scheduler_info, root=0)
        # Wait requests
        s = await s
        # Finalize
        await s.finished()
    asyncio.get_event_loop().run_until_complete(f_scheduler())

elif mpi_rank == 3 :
    #
    # Rank 3 is client
    #

    # Wait for scheduler's message (with address info)
    scheduler_info = mpi_comm.bcast(None, root=0)
    scheduler_address = scheduler_info["address"]

    # Wait for worker's ready message
    req = mpi_comm.irecv(source=1, tag=1)
    data = req.wait()
    req = mpi_comm.irecv(source=2, tag=2)
    data = req.wait()

    # Create client
    time.sleep(3)
    c = Client(scheduler_address)

    # Run program !
    X, y = make_regression(
      n_samples=100000,
      n_features=4,
      chunks=50)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    print('##### Trained Model Result #####')
    print('Score : {}'.format(lr.score(X_test, y_test)))

    # simple test
    # y = c.submit(lambda x: x + 1, 10)
    # print('The result is {}'.format(y.result()))

    # Stop scheduler and wokers
    c.shutdown()

else :
    #
    # Others (Rank 1, 2) are workers
    #

    # Wait for scheduler's message (with address info)
    scheduler_info = mpi_comm.bcast(None, root=0)
    scheduler_address = scheduler_info["address"]
    async def f_worker(scheduler_address):
        # Send ready message to client (rank 3)
        req = mpi_comm.isend('ready', dest=3, tag=mpi_rank)
        req.wait()
        # Start Worker
        w = await Worker(scheduler_address)
        # Wait for worker's complete
        await w.finished()
    asyncio.get_event_loop().run_until_complete(f_worker(scheduler_address))

Overwriting script/dask_distributed_example.py


## 2. Connect to Azure Machine Learning (Create AML config)

Connect to Azure Machine Learning (AML) workspace, which is a resource created above.<br>
Please fill the following workspace name, subscription id, and resource group name. (You can get these values on AML resource blade in Azure Portal.)

In [3]:
from azureml.core import Workspace
import azureml.core

ws = Workspace(
    workspace_name = "{AML WORKSPACE NAME}",
    subscription_id = "{SUBSCRIPTION ID}",
    resource_group = "{RESOURCE GROUP NAME}")

## 3. Create cluster (multiple nodes)

Create a remote cluster with 4 node's CPU VMs.

In [4]:
from azureml.core import Workspace
import azureml.core
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# Create AML compute (or Get existing one)
# (Total 4 : scheduler, worker1, worker2, client)
try:
    compute_target = ComputeTarget(workspace=ws, name='cluster01')
    print('found existing:', compute_target.name)
except ComputeTargetException:
    print('creating new.')
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='Standard_DS2_v2',
        min_nodes=4,
        max_nodes=4)
    compute_target = ComputeTarget.create(ws, 'cluster01', compute_config)
    compute_target.wait_for_completion(show_output=True)

creating new.
Creating.........
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded.................................................................................................................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## 4. Generate config for run

Generate a script run configuration in AML.<br>
We use AML container image with Python 3.8 and Open MPI configured.

In [5]:
from azureml.core import ScriptRunConfig, Experiment, Run
from azureml.core.runconfig import RunConfiguration, DockerConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import MpiConfiguration

conda_dep = CondaDependencies.create()
conda_dep.set_python_version('3.8')
conda_dep.add_pip_package('mpi4py');
conda_dep.add_conda_package('dask');
conda_dep.add_conda_package('distributed');
conda_dep.add_conda_package('dask-ml');
run_config = RunConfiguration(
    framework='python',
    conda_dependencies=conda_dep)
run_config.target = compute_target.name
run_config.docker = DockerConfiguration(use_docker=True)
# (see https://github.com/Azure/AzureML-Containers for AML base images)
run_config.environment.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04'
run_config.communicator = 'OpenMpi'
run_config.node_count = 4
run_config.mpi.process_count_per_node = 1

# getting master node's ip like "10.0.0.4" by $AZ_BATCH_MASTER_NODE
# (or use $AZ_BATCHAI_MPI_MASTER_NODE)
src = ScriptRunConfig(
    source_directory='./script',
    script='dask_distributed_example.py',
    run_config=run_config,
    distributed_job_config=MpiConfiguration(node_count=3))



## 5. Run !

In [6]:
exp = Experiment(workspace=ws, name='dask_distributed')
run = exp.submit(config=src)
run.wait_for_completion(show_output=True)

RunId: dask_distributed_1621128875_af6ca77b
Web View: https://ml.azure.com/runs/dask_distributed_1621128875_af6ca77b?wsid=/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/DASK-ML-TEST/workspaces/ws01&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

Streaming azureml-logs/55_azureml-execution-tvmps_25245805a89cdb5cad28699070cc66036c7dcc6f84b025a9bafdcb317bfa9c25_d.txt

2021-05-16T01:34:59Z Successfully mounted a/an Blobfuse File System at /mnt/batch/tasks/shared/LS_root/jobs/ws01/azureml/dask_distributed_1621128875_af6ca77b/mounts/workspaceblobstore
2021-05-16T01:35:00Z Starting output-watcher...
2021-05-16T01:35:01Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
Login Succeeded
Using default tag: latest
latest: Pulling from azureml/azureml_00000ce8edf1f5f474de6e888439e341
01bf7da0a88c: Pulling fs layer
f3b4a5f15c7a: Pulling fs layer
57ffbe87baa1: Pulling fs layer
86120caa19f5: Pulling fs layer
c0f2d44469de: Pulling fs layer
638bc09d59ce: Pulling fs layer
cec7e


Streaming azureml-logs/75_job_post-tvmps_25245805a89cdb5cad28699070cc66036c7dcc6f84b025a9bafdcb317bfa9c25_d.txt

[2021-05-16T01:38:38.736105] Entering job release
[2021-05-16T01:38:39.791136] job release stage : copy_batchai_cached_logs starting...
[2021-05-16T01:38:39.791193] job release stage : copy_batchai_cached_logs completed...

Execution Summary
RunId: dask_distributed_1621128875_af6ca77b
Web View: https://ml.azure.com/runs/dask_distributed_1621128875_af6ca77b?wsid=/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/DASK-ML-TEST/workspaces/ws01&tid=72f988bf-86f1-41af-91ab-2d7cd011db47



{'runId': 'dask_distributed_1621128875_af6ca77b',
 'target': 'cluster01',
 'status': 'Completed',
 'startTimeUtc': '2021-05-16T01:34:51.95837Z',
 'endTimeUtc': '2021-05-16T01:38:52.894763Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': 'beb8381c-228f-4f5d-ac60-765d46f5b1e8',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'dask_distributed_example.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': [],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'Mpi',
  'target': 'cluster01',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'jobName': None,
  'maxRunDurationSeconds': None,
  'nodeCount': 4,
  'priority': None,
  'credentialPassthrough': False,
  'identity': None,
  'environment': {'name': 'Experiment dask_distributed Environment',
   'version': 'Autos

## 6. See the results

Let's see the output results. These are all managed in Azure Machine Learning experiment's logging.<br>
In this example, we only see AML logs for standard output, but you can also create a model file and save in run's "```outputs```" folder.

In [7]:
# You can see and download results (container logs, model files, etc).
run.get_file_names()

['azureml-logs/55_azureml-execution-tvmps_25245805a89cdb5cad28699070cc66036c7dcc6f84b025a9bafdcb317bfa9c25_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_7f1986873306d9d3a5c4c22886c2f51b6766e032af5ff24afa08251ef60e3af1_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_86ec332c27a655a9669f0eb859c1939385762d7a7f9bc7092eb4d9de35120e29_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_896dedd4b1920b628175db9ef35c574293e50e6b95264e763a9f97c979f74c63_d.txt',
 'azureml-logs/65_job_prep-tvmps_25245805a89cdb5cad28699070cc66036c7dcc6f84b025a9bafdcb317bfa9c25_d.txt',
 'azureml-logs/65_job_prep-tvmps_7f1986873306d9d3a5c4c22886c2f51b6766e032af5ff24afa08251ef60e3af1_d.txt',
 'azureml-logs/65_job_prep-tvmps_86ec332c27a655a9669f0eb859c1939385762d7a7f9bc7092eb4d9de35120e29_d.txt',
 'azureml-logs/65_job_prep-tvmps_896dedd4b1920b628175db9ef35c574293e50e6b95264e763a9f97c979f74c63_d.txt',
 'azureml-logs/70_driver_log_0.txt',
 'azureml-logs/70_driver_log_1.txt',
 'azureml-logs/70_driver_log_2.txt',
 'azu

Now, let's check the validation results in rank 3. You can download these logs in local.<br>
**Make sure to change the following log filename**.

In [8]:
# Download and see Dask Client logs
run.download_file(
    name='azureml-logs/70_driver_log_3.txt',
    output_file_path='remote_logs/70_driver_log_3.txt')

In [14]:
!tail -n 30 remote_logs/70_driver_log_3.txt

[55145dbb0fa747b8bf70cb744ca7ed29000004:00108] select: init of component vader returned failure
[55145dbb0fa747b8bf70cb744ca7ed29000004:00108] mca: base: close: component vader closed
[55145dbb0fa747b8bf70cb744ca7ed29000004:00108] mca: base: close: unloading component vader
[55145dbb0fa747b8bf70cb744ca7ed29000004:00108] mca: bml: Using self btl for send to [[7244,1],3] on node 55145dbb0fa747b8bf70cb744ca7ed29000004
[55145dbb0fa747b8bf70cb744ca7ed29000004:00108] btl:tcp: path from 10.0.0.8 to 10.0.0.6: IPV4 PRIVATE SAME NETWORK
[55145dbb0fa747b8bf70cb744ca7ed29000004:00108] btl:tcp: now connected to 10.0.0.6, process [[7244,1],1]
[55145dbb0fa747b8bf70cb744ca7ed29000004:00108] btl:tcp: path from 10.0.0.8 to 10.0.0.7: IPV4 PRIVATE SAME NETWORK
[55145dbb0fa747b8bf70cb744ca7ed29000004:00108] btl:tcp: now connected to 10.0.0.7, process [[7244,1],2]
##### Trained Model Result #####
Score : 0.9999999997470762
tornado.application - ERROR - Exception in callback <bound method Client._h

## 7. Remove cluster (Clean-up)

In [15]:
# Delete cluster (nodes) in AML workspace
mycompute = AmlCompute(workspace=ws, name='cluster01')
mycompute.delete()