# Run Dask Distribution in Azure Machine Learning

To run this notebook,

1. Create new "Machine Learning" resource in [Azure Portal](https://portal.azure.com/).
2. Install Azure Machine Learning SDK (core package) as follows

```
pip install azureml-core
```

Here we use built-in Dask ML function (```dask_ml.linear_model.LinearRegression```) in this tutorial, but you can run **scikit-learn compliant functions and jobs** in distributed manners with Dask cluster.<br>
For dask distribution, see "[Run Distributed Dask on Azure Kubernetes Service](https://tsmatz.wordpress.com/2021/05/17/dask-distributed-on-azure-kubernetes/)" for details.

## 1. Create script for Distributed Dask training (Regression Example)

Create a directory for saving your script.

In [1]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

Save a script file (```dask_distributed_example.py```) for Distributed Dask training.<br>
Here we run 4 nodes with the following roles.

- Rank 0 : Dask Scheduler
- Rank 1 : Dask Worker
- Rank 2 : Dask Worker
- Rank 3 : Dask Client

In this example, we only output the evaluation score, but you can also save the generated model as AML outputs.

In [2]:
%%writefile script/dask_distributed_example.py
import asyncio
from dask.distributed import Scheduler, Worker, Client
from dask_ml.datasets import make_regression
from dask_ml.model_selection import train_test_split
from dask_ml.linear_model import LinearRegression
from mpi4py import MPI
import socket
import time

mpi_comm = MPI.COMM_WORLD
mpi_rank = mpi_comm.Get_rank()
if mpi_rank == 0 :
    #
    # Rank 0 is scheduler
    #

    async def f_scheduler():
        # Start scheduler
        s = Scheduler(port=8786)
        # Send address to workers
        ipaddr = socket.gethostbyname(socket.gethostname())
        scheduler_info = {
            "address"  : ipaddr + ":8786"
        }
        scheduler_info = mpi_comm.bcast(scheduler_info, root=0)
        # Wait requests
        s = await s
        # Finalize
        await s.finished()
    asyncio.get_event_loop().run_until_complete(f_scheduler())

elif mpi_rank == 3 :
    #
    # Rank 3 is client
    #

    # Wait for scheduler's message (with address info)
    scheduler_info = mpi_comm.bcast(None, root=0)
    scheduler_address = scheduler_info["address"]

    # Wait for worker's ready message
    req = mpi_comm.irecv(source=1, tag=1)
    data = req.wait()
    req = mpi_comm.irecv(source=2, tag=2)
    data = req.wait()

    # Create client
    time.sleep(3)
    c = Client(scheduler_address)

    # Run program !
    X, y = make_regression(
      n_samples=100000,
      n_features=4,
      chunks=50)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    print('##### Trained Model Result #####')
    print('Score : {}'.format(lr.score(X_test, y_test)))

    # simple test
    # y = c.submit(lambda x: x + 1, 10)
    # print('The result is {}'.format(y.result()))

    # Stop scheduler and wokers
    c.shutdown()

else :
    #
    # Others (Rank 1, 2) are workers
    #

    # Wait for scheduler's message (with address info)
    scheduler_info = mpi_comm.bcast(None, root=0)
    scheduler_address = scheduler_info["address"]
    async def f_worker(scheduler_address):
        # Send ready message to client (rank 3)
        req = mpi_comm.isend('ready', dest=3, tag=mpi_rank)
        req.wait()
        # Start Worker
        w = await Worker(scheduler_address)
        # Wait for worker's complete
        await w.finished()
    asyncio.get_event_loop().run_until_complete(f_worker(scheduler_address))

Writing script/dask_distributed_example.py


## 2. Connect to Azure Machine Learning (Create AML config)

Connect to Azure Machine Learning (AML) workspace, which is a resource created above.<br>
Please fill the following workspace name, subscription id, and resource group name. (You can get these values on AML resource blade in Azure Portal.)

In [3]:
from azureml.core import Workspace
import azureml.core
ws = Workspace(
    workspace_name = "{AML WORKSPACE NAME}",
    subscription_id = "{SUBSCRIPTION ID}",
    resource_group = "{RESOURCE GROUP NAME}")

## 3. Create cluster (multiple nodes)

Create a remote cluster with 4 node's CPU VMs.

In [4]:
from azureml.core import Workspace
import azureml.core
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# Create AML compute (or Get existing one)
# (Total 4 : scheduler, worker1, worker2, client)
try:
    compute_target = ComputeTarget(workspace=ws, name='cluster01')
    print('found existing:', compute_target.name)
except ComputeTargetException:
    print('creating new.')
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='Standard_DS2_v2',
        min_nodes=4,
        max_nodes=4)
    compute_target = ComputeTarget.create(ws, 'cluster01', compute_config)
    compute_target.wait_for_completion(show_output=True)

creating new.
InProgress......
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded.................................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## 4. Generate config for run

Generate a script run configuration in AML.<br>
We use AML container image with Python 3.8 and Open MPI configured.

In [5]:
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.environment import Environment
from azureml.core import Run, ScriptRunConfig
from azureml.core.runconfig import DockerConfiguration, MpiConfiguration

# create environment
env = Environment('test-dask-env')
conda_dep = CondaDependencies.create()
conda_dep.set_python_version('3.8')
conda_dep.add_pip_package('mpi4py');
conda_dep.add_conda_package('dask');
conda_dep.add_conda_package('distributed');
conda_dep.add_conda_package('dask-ml');
env.python.conda_dependencies = conda_dep
# (see https://github.com/Azure/AzureML-Containers for AML base images)
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04'

# register environment to re-use later
env.register(workspace=ws)
## # speed up by using the existing environment
## env = Environment.get(ws, name='test-dask-env')

# create script run config
src = ScriptRunConfig(
    source_directory='./script',
    script='dask_distributed_example.py',
    compute_target=compute_target,
    environment=env,
    docker_runtime_config=DockerConfiguration(use_docker=True),
    distributed_job_config=MpiConfiguration(process_count_per_node=1, node_count=4))

## 5. Run !

In [6]:
from azureml.core import Experiment
exp = Experiment(workspace=ws, name='dask_distributed')
run = exp.submit(config=src)
run.wait_for_completion(show_output=True)

RunId: dask_distributed_1626669413_36c83737
Web View: https://ml.azure.com/runs/dask_distributed_1626669413_36c83737?wsid=/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/TEST20210719/workspaces/ws01&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

Streaming azureml-logs/55_azureml-execution-tvmps_6959661f1ceda8af9798d81f4cf1128d58dc4c4abf5b142ae22a98d8339acf12_d.txt

2021-07-19T04:37:12Z Successfully mounted a/an Blobfuse File System at /mnt/batch/tasks/shared/LS_root/jobs/ws01/azureml/dask_distributed_1626669413_36c83737/mounts/workspaceblobstore
2021-07-19T04:37:13Z The vmsize standard_ds2_v2 is not a GPU VM, skipping get GPU count by running nvidia-smi command.
2021-07-19T04:37:13Z Starting output-watcher...
2021-07-19T04:37:13Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
Login Succeeded
Using default tag: latest
latest: Pulling from azureml/azureml_14bdc1c83869140b43750e17452d1b9f
Digest: sha256:7a13b213d6a0f7dbc9957b577a410b98d548b76d9da681e4edcb3f74e


Streaming azureml-logs/75_job_post-tvmps_6959661f1ceda8af9798d81f4cf1128d58dc4c4abf5b142ae22a98d8339acf12_d.txt

[2021-07-19T04:39:31.606511] Entering job release
[2021-07-19T04:39:32.318658] job release stage : copy_batchai_cached_logs starting...
[2021-07-19T04:39:32.318697] job release stage : copy_batchai_cached_logs completed...

Execution Summary
RunId: dask_distributed_1626669413_36c83737
Web View: https://ml.azure.com/runs/dask_distributed_1626669413_36c83737?wsid=/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/TEST20210719/workspaces/ws01&tid=72f988bf-86f1-41af-91ab-2d7cd011db47



{'runId': 'dask_distributed_1626669413_36c83737',
 'target': 'cluster01',
 'status': 'Completed',
 'startTimeUtc': '2021-07-19T04:37:08.452057Z',
 'endTimeUtc': '2021-07-19T04:39:44.98936Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '59f1c1aa-a302-4c4e-a418-8ae6158426de',
  'azureml.git.repository_uri': 'https://github.com/tsmatz/azureml-samples.git',
  'mlflow.source.git.repoURL': 'https://github.com/tsmatz/azureml-samples.git',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': '2cfb88db961caedb534362729bc36ba0189a4380',
  'mlflow.source.git.commit': '2cfb88db961caedb534362729bc36ba0189a4380',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json',
  'azureml.RuntimeType': 'Hosttools'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'dask_distributed_example.py',
  'command': '',
  '

## 6. See the results

Let's see the output results. These are all managed in Azure Machine Learning experiment's logging.<br>
In this example, we only see AML logs for standard output, but you can also create a model file and save in run's "```outputs```" folder.

In [7]:
# You can see and download results (container logs, model files, etc).
run.get_file_names()

['azureml-logs/55_azureml-execution-tvmps_6959661f1ceda8af9798d81f4cf1128d58dc4c4abf5b142ae22a98d8339acf12_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_7ecdb980a2ef2d4e987861600c55988d191d71941080b87e2a357fdcbe69437c_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_d74c68ec2690131ffeab23cdc5b5909a754c8665852faac2fe1b1ebb40c18f89_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_ffa5ae788dcbeb91223ca7bbc58f1b38ddccbd006db98527ba10dcbf2db685a6_d.txt',
 'azureml-logs/65_job_prep-tvmps_6959661f1ceda8af9798d81f4cf1128d58dc4c4abf5b142ae22a98d8339acf12_d.txt',
 'azureml-logs/65_job_prep-tvmps_7ecdb980a2ef2d4e987861600c55988d191d71941080b87e2a357fdcbe69437c_d.txt',
 'azureml-logs/65_job_prep-tvmps_d74c68ec2690131ffeab23cdc5b5909a754c8665852faac2fe1b1ebb40c18f89_d.txt',
 'azureml-logs/65_job_prep-tvmps_ffa5ae788dcbeb91223ca7bbc58f1b38ddccbd006db98527ba10dcbf2db685a6_d.txt',
 'azureml-logs/70_driver_log_0.txt',
 'azureml-logs/70_driver_log_1.txt',
 'azureml-logs/70_driver_log_2.txt',
 'azu

Now, let's check the validation results in rank 3. You can download these logs in local.<br>
**Make sure to change the following log filename**.

In [8]:
# Download and see Dask Client logs
run.download_file(
    name='azureml-logs/70_driver_log_3.txt',
    output_file_path='remote_logs/70_driver_log_3.txt')

In [9]:
!tail -n 30 remote_logs/70_driver_log_3.txt

[fbc67941082e4509aec2527bf8612700000003:00113] btl:tcp: Attempting to bind to AF_INET port 1024
[fbc67941082e4509aec2527bf8612700000003:00113] btl:tcp: Successfully bound to AF_INET port 1024
[fbc67941082e4509aec2527bf8612700000003:00113] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[fbc67941082e4509aec2527bf8612700000003:00113] btl:tcp: examining interface eth0
[fbc67941082e4509aec2527bf8612700000003:00113] btl:tcp: using ipv6 interface eth0
[fbc67941082e4509aec2527bf8612700000003:00113] select: init of component tcp returned success
[fbc67941082e4509aec2527bf8612700000003:00113] select: initializing btl component self
[fbc67941082e4509aec2527bf8612700000003:00113] select: init of component self returned success
[fbc67941082e4509aec2527bf8612700000003:00113] select: initializing btl component vader
[fbc67941082e4509aec2527bf8612700000003:00113] select: init of component vader returned failure
[fbc67941082e4509aec2527bf8612700000003:00113] mca: base: close: component vader

## 7. Remove cluster (Clean-up)

In [10]:
# Delete cluster (nodes) in AML workspace
mycompute = AmlCompute(workspace=ws, name='cluster01')
mycompute.delete()