# MXNet Distributed Training (MNIST Sample) in Azure Machine Learning

To run this notebook,

1. Create new "Machine Learning" resource in [Azure Portal](https://portal.azure.com/).<br>
Make sure that you should specify location (region) which supports NC-series (K80 GPU) virtual machines in creation wizard (such as "East US").<br>
See [here](https://azure.microsoft.com/en-us/global-infrastructure/services/?products=virtual-machines) for NC-series supported regions.
2. Install Azure Machine Learning SDK (core package) as follows

```
pip install azureml-core
```

See "[MXNet Distributed Training Example for Azure ML service](https://tsmatz.wordpress.com/2019/01/17/azure-machine-learning-service-custom-amlcompute-and-runconfig-for-mxnet-distributed-training/)" for details.

## 1. Create script for MXNet distributed training (mnist_distributed.py)

Save a script file (mnist_distributed.py) for MXNet distributed training.

> Note : Use commented lines on your debugging in local with 1 CPU device.

In [1]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

In [2]:
%%writefile script/mnist_distributed.py
import os, random
import mxnet as mx
from mxnet import kv, gluon, autograd, nd
from mxnet.gluon import nn

store = kv.create('dist')

gpus_per_machine = 1
batch_size_per_gpu = 64
batch_size = batch_size_per_gpu * gpus_per_machine
num_epochs = 5

ctx = [mx.gpu(i) for i in range(gpus_per_machine)]
# ctx = mx.cpu(0)

class SplitSampler(gluon.data.sampler.Sampler):
    """
    length: Number of examples in the dataset
    num_parts: Partition the data into multiple parts
    part_index: The index of the part to read from
    """
    def __init__(self, length, num_parts, part_index):
        self.part_len = length // num_parts
        self.start = self.part_len * part_index
        self.end = self.start + self.part_len
    def __iter__(self):
        indices = list(range(self.start, self.end))
        random.shuffle(indices)
        return iter(indices)
    def __len__(self):
        return self.part_len

mx.random.seed(42)
def data_xform(data):
    """Move channel axis to the beginning, cast to float32, and normalize to [0, 1]"""
    return nd.moveaxis(data, 2, 0).astype('float32') / 255
train_data = gluon.data.DataLoader(
    gluon.data.vision.MNIST(train=True).transform_first(data_xform),
    batch_size=batch_size,
    sampler=SplitSampler(59904, store.num_workers, store.rank))
test_data = gluon.data.DataLoader(
    gluon.data.vision.MNIST(train=False).transform_first(data_xform),
    batch_size=batch_size,
    shuffle=False)
# train_data = gluon.data.DataLoader(
#     gluon.data.vision.MNIST(train=True, root='./data').transform_first(data_xform),
#     batch_size=batch_size)
# test_data = gluon.data.DataLoader(
#     gluon.data.vision.MNIST(train=False, root='./data').transform_first(data_xform),
#     batch_size=batch_size,
#     shuffle=False)

net = nn.HybridSequential(prefix='MLP_')
with net.name_scope():
    net.add(
        nn.Flatten(),
        nn.Dense(128, activation='relu'),
        nn.Dense(64, activation='relu'),
        nn.Dense(10, activation=None)
    )

net.hybridize()

net.initialize(mx.init.Xavier(), ctx=ctx)

loss_function = gluon.loss.SoftmaxCrossEntropyLoss()

trainer = gluon.Trainer(
    params=net.collect_params(),
    optimizer='sgd',
    optimizer_params={'learning_rate': 0.07},
    kvstore=store)
# trainer = gluon.Trainer(
#     params=net.collect_params(),
#     optimizer='sgd',
#     optimizer_params={'learning_rate': 0.07},
# )

for epoch in range(num_epochs):
    """ Train ! """
    for batch in train_data:
        inputs = gluon.utils.split_and_load(batch[0], ctx)
        labels = gluon.utils.split_and_load(batch[1], ctx)
        # inputs = batch[0].as_in_context(ctx)
        # labels = batch[1].as_in_context(ctx)
        with autograd.record():
            loss = [loss_function(net(X), Y) for X, Y in zip(inputs, labels)]
            # loss = loss_function(net(inputs), labels)
        for l in loss:
            l.backward()
        # loss.backward()
        trainer.step(batch_size=batch[0].shape[0])
    """ Evaluate and Output ! """
    metric = mx.metric.Accuracy()
    for i, (test_input, test_label) in enumerate(test_data):
        test_input = test_input.as_in_context(ctx[0])
        test_label = test_label.as_in_context(ctx[0])
        # test_input = test_input.as_in_context(ctx)
        # test_label = test_label.as_in_context(ctx)
        test_output = net(test_input)
        test_pred = nd.argmax(test_output, axis=1)
        metric.update(preds=test_pred, labels=test_label)
    print('Epoch %d: Accuracy %f' % (epoch, metric.get()[1]))

""" Save Model (both architecture and parameters) """
if store.rank == 0:
    os.makedirs('./outputs', exist_ok=True)
    net.export('./outputs/test', epoch=1)
# os.makedirs('./outputs', exist_ok=True)
# net.export('./outputs/test', epoch=1)

Writing script/mnist_distributed.py


## 2. Create script for entry (start_mx_role.py)

Create an entry script for starting each roles in MXNet distributed training.<br>
Here we run 4 nodes with the following roles.

- Rank 0 : Scheduler
- Rank 1 : Parameter Server
- Rank 2 : Worker
- Rank 3 : Worker

In [3]:
%%writefile script/start_mx_role.py
import argparse
import os
from mpi4py import MPI

parser = argparse.ArgumentParser()
parser.add_argument(
    '--num_workers',
    type=int,
    default=0,
    help='Specifies how many worker roles')
parser.add_argument(
    '--num_servers',
    type=int,
    default=0,
    help='Specifies how many server roles')
parser.add_argument(
    '--scheduler_host',
    type=str,
    default='10.0.0.4',
    help='Specifies the IP of the scheduler')
FLAGS, unparsed = parser.parse_known_args()

#
# See https://mxnet.incubator.apache.org/faq/distributed_training.html
#

mpi_comm = MPI.COMM_WORLD
mpi_rank = mpi_comm.Get_rank()
if mpi_rank == 0 :
    # Rank 0 is scheduler
    os.environ['DMLC_ROLE'] = 'scheduler'
elif mpi_rank <= FLAGS.num_servers :
    # Rank 1, ..., FLAGS.num_servers is server
    os.environ['DMLC_ROLE'] = 'server'
else :
    # Others are all workers (The count of workers must equal to FLAGS.num_workers.)
    os.environ['DMLC_ROLE'] = 'worker'
os.environ['DMLC_PS_ROOT_URI'] = FLAGS.scheduler_host
os.environ['DMLC_PS_ROOT_PORT'] = '9092'
os.environ['DMLC_NUM_WORKER'] = str(FLAGS.num_workers)
os.environ['DMLC_NUM_SERVER'] = str(FLAGS.num_servers)

#
# Run previous script !
#
import mnist_distributed

Writing script/start_mx_role.py


## 3. Connect to Azure Machine Learning (Create AML config)

Connect to Azure Machine Learning (AML) workspace, which is a resource created above.<br>
Please fill the following workspace name, subscription id, and resource group name. (You can get these values on AML resource blade in Azure Portal.)

In [4]:
from azureml.core import Workspace
import azureml.core

ws = Workspace(
  workspace_name = "{AML WORKSPACE NAME}",
  subscription_id = "{SUBSCRIPTION ID}",
  resource_group = "{RESOURCE GROUP NAME}")

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code A9C2L7XLE to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.


## 4. Create cluster (multiple nodes)

Create a remote cluster with 4 node's GPU VMs - scheduler, parameter server, worker0, and worker1.

In [5]:
from azureml.core import Workspace
import azureml.core
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
 
# Create AML compute (or Get existing one)
# (Total 4 : scheduler, server, worker1, worker2)
try:
    compute_target = ComputeTarget(workspace=ws, name='cluster01')
    print('found existing:', compute_target.name)
except ComputeTargetException:
    print('creating new.')
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='Standard_NC6',
        min_nodes=4,
        max_nodes=4)
    compute_target = ComputeTarget.create(ws, 'cluster01', compute_config)
    compute_target.wait_for_completion(show_output=True)

creating new.
Creating..........
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded.......................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## 5. Generate config for run

Generate a script run configuration in AML.<br>
Here we use custom container image, in which Open MPI is installed and configured. (See [here](https://tsmatz.wordpress.com/2019/01/17/azure-machine-learning-service-custom-amlcompute-and-runconfig-for-mxnet-distributed-training/) for details.)

In [7]:
from azureml.core import ScriptRunConfig, Experiment, Run
from azureml.core.runconfig import RunConfiguration, DockerConfiguration
from azureml.core.conda_dependencies import CondaDependencies
 
conda_dep = CondaDependencies.create()
conda_dep.add_pip_package('mxnet-cu90');
conda_dep.add_pip_package('mpi4py');
run_config = RunConfiguration(
    framework='python',
    conda_dependencies=conda_dep)
run_config.target = compute_target.name
run_config.docker = DockerConfiguration(use_docker=True)
run_config.environment.docker.base_image = 'tsmatz/azureml-openmpi:0.1.0-gpu'
run_config.communicator = 'OpenMpi'
run_config.node_count = 4
run_config.mpi.process_count_per_node = 1

src = ScriptRunConfig(
    source_directory='./script',
    script='start_mx_role.py',
    run_config=run_config,
    arguments=[
        '--num_workers', 2,
        '--num_servers', 1,
        '--scheduler_host', '`cut -d ":" -f 1 <<< $AZ_BATCH_MASTER_NODE`']) # getting master node's ip like "10.0.0.4" (or use $AZ_BATCHAI_MPI_MASTER_NODE)

## 6. Run !

In [8]:
exp = Experiment(workspace=ws, name='mnist_mxnet_distributed')
run = exp.submit(config=src)
run.wait_for_completion(show_output=True)

RunId: mnist_mxnet_distributed_1619049406_9c858a71
Web View: https://ml.azure.com/runs/mnist_mxnet_distributed_1619049406_9c858a71?wsid=/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/TEST20210422-02/workspaces/ws01&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

Streaming azureml-logs/20_image_build_log.txt

2021/04/21 23:57:04 Downloading source code...
2021/04/21 23:57:06 Finished downloading source code
2021/04/21 23:57:06 Creating Docker network: acb_default_network, driver: 'bridge'
2021/04/21 23:57:07 Successfully set up Docker network: acb_default_network
2021/04/21 23:57:07 Setting up Docker configuration...
2021/04/21 23:57:07 Successfully set up Docker configuration
2021/04/21 23:57:07 Logging in to registry: edf3a08dff924ac4ad0911e6b022f278.azurecr.io
2021/04/21 23:57:09 Successfully logged into edf3a08dff924ac4ad0911e6b022f278.azurecr.io
2021/04/21 23:57:09 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'

  Downloading mpi4py-3.0.3.tar.gz (1.4 MB)
Collecting azureml-dataset-runtime[fuse]~=1.27.0
  Downloading azureml_dataset_runtime-1.27.0-py3-none-any.whl (3.4 kB)
Collecting azureml-model-management-sdk==1.0.1b6.post1
  Downloading azureml_model_management_sdk-1.0.1b6.post1-py2.py3-none-any.whl (130 kB)
Collecting json-logging-py==0.2
  Downloading json-logging-py-0.2.tar.gz (3.6 kB)
Collecting configparser==3.7.4
  Downloading configparser-3.7.4-py2.py3-none-any.whl (22 kB)
Collecting gunicorn==19.9.0
  Downloading gunicorn-19.9.0-py2.py3-none-any.whl (112 kB)
Collecting applicationinsights>=0.11.7
  Downloading applicationinsights-0.11.9-py2.py3-none-any.whl (58 kB)
Collecting werkzeug<=1.0.1,>=0.16.1
  Downloading Werkzeug-1.0.1-py2.py3-none-any.whl (298 kB)
Collecting azureml-core~=1.27.0
  Downloading azureml_core-1.27.0-py3-none-any.whl (2.2 MB)
Collecting flask==1.0.3
  Downloading Flask-1.0.3-py2.py3-none-any.whl (92 kB)
Collecting graphviz<0.9.0,>=0.8.1
  Downloading graphviz-

Removing intermediate container db38a93adde2
 ---> 8c00a80a12fe
Step 9/18 : ENV PATH /azureml-envs/azureml_71486be63d67b01363c50561276303e4/bin:$PATH
 ---> Running in 9980eaba1fbc
Removing intermediate container 9980eaba1fbc
 ---> 0fdc5ee0adf6
Step 10/18 : COPY azureml-environment-setup/send_conda_dependencies.py azureml-environment-setup/send_conda_dependencies.py
 ---> 6caf83ad3247
Step 11/18 : COPY azureml-environment-setup/environment_context.json azureml-environment-setup/environment_context.json
 ---> b59406a431a2
Step 12/18 : RUN python /azureml-environment-setup/send_conda_dependencies.py -p /azureml-envs/azureml_71486be63d67b01363c50561276303e4
 ---> Running in f8b507c8ce0a
Report materialized dependencies for the environment
Reading environment context
Exporting conda environment
Exception occured on getting conda environment details
Failed to send materialized environment details
Removing intermediate container f8b507c8ce0a
 ---> 66c10248bfac
Step 13/18 : ENV AZUREML_CONDA_E


Streaming azureml-logs/55_azureml-execution-tvmps_2420ae3280fee7c35bf313ffa30e1fc6d9264a9305f31b3ae2d121e8a0f8f46f_d.txt

2021-04-22T00:08:42Z Successfully mounted a/an Blobfuse File System at /mnt/batch/tasks/shared/LS_root/jobs/ws01/azureml/mnist_mxnet_distributed_1619049406_9c858a71/mounts/workspaceblobstore
2021-04-22T00:08:42Z Failed to start nvidia-fabricmanager due to exit status 5 with output Failed to start nvidia-fabricmanager.service: Unit nvidia-fabricmanager.service not found.
. Please ignore this if the GPUs don't utilize NVIDIA® NVLink® switches.
2021-04-22T00:08:43Z Starting output-watcher...
2021-04-22T00:08:43Z IsDedicatedCompute == True, won't poll for Low Pri Preemption

Streaming azureml-logs/65_job_prep-tvmps_671f5cd2ff5f357d074859b99abe2a726dca420c23feeb0488f9e59b496cf1aa_d.txt

[2021-04-22T00:09:42.743284] Entering job preparation.

Streaming azureml-logs/65_job_prep-tvmps_37fe55b73e3416702ce9173178ef00bb8f33cf8106ef043683da7138921ee1d0_d.txt

[2021-04-22T00:09

{'runId': 'mnist_mxnet_distributed_1619049406_9c858a71',
 'target': 'cluster01',
 'status': 'Completed',
 'startTimeUtc': '2021-04-22T00:08:35.49867Z',
 'endTimeUtc': '2021-04-22T00:19:56.317711Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '1ca627c3-b0ab-4c15-af44-50ab4c715f8c',
  'azureml.git.repository_uri': 'https://github.com/tsmatz/azureml-samples.git',
  'mlflow.source.git.repoURL': 'https://github.com/tsmatz/azureml-samples.git',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': '6caaeb22950e8e8a3c59e6413648544e20cc122d',
  'mlflow.source.git.commit': '6caaeb22950e8e8a3c59e6413648544e20cc122d',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'start_mx_role.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['

## 7. See the results

Let's see the output results. These are all managed in Azure Machine Learning experiment's logging.<br>
The "```outputs```" folder includes a generated model (both ```outputs/test-0001.params``` and ```outputs/test-symbol.json```) by MXNet.

In [9]:
# You can see and download results (test-symbol.json, test-0001.params).
run.get_file_names()

['azureml-logs/20_image_build_log.txt',
 'azureml-logs/55_azureml-execution-tvmps_2420ae3280fee7c35bf313ffa30e1fc6d9264a9305f31b3ae2d121e8a0f8f46f_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_37fe55b73e3416702ce9173178ef00bb8f33cf8106ef043683da7138921ee1d0_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_671f5cd2ff5f357d074859b99abe2a726dca420c23feeb0488f9e59b496cf1aa_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_b82f0ed4845e507ca2675aa0f4891bd9fb535621cc315269ba1640d672e6d27b_d.txt',
 'azureml-logs/65_job_prep-tvmps_2420ae3280fee7c35bf313ffa30e1fc6d9264a9305f31b3ae2d121e8a0f8f46f_d.txt',
 'azureml-logs/65_job_prep-tvmps_37fe55b73e3416702ce9173178ef00bb8f33cf8106ef043683da7138921ee1d0_d.txt',
 'azureml-logs/65_job_prep-tvmps_671f5cd2ff5f357d074859b99abe2a726dca420c23feeb0488f9e59b496cf1aa_d.txt',
 'azureml-logs/65_job_prep-tvmps_b82f0ed4845e507ca2675aa0f4891bd9fb535621cc315269ba1640d672e6d27b_d.txt',
 'azureml-logs/70_driver_log_0.txt',
 'azureml-logs/70_driver_log_1.txt',
 '

When you want to see the validation results in workers (see above source code), you can download these logs on rank2 and rank3.

In [11]:
run.download_file(
    name='azureml-logs/70_driver_log_2.txt',
    output_file_path='remote_logs/70_driver_log_2.txt')
run.download_file(
    name='azureml-logs/70_driver_log_3.txt',
    output_file_path='remote_logs/70_driver_log_3.txt')

In [12]:
!tail -n 15 remote_logs/70_driver_log_2.txt remote_logs/70_driver_log_3.txt

==> remote_logs/70_driver_log_2.txt <==
Downloading /root/.mxnet/datasets/mnist/train-labels-idx1-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-labels-idx1-ubyte.gz...
Downloading /root/.mxnet/datasets/mnist/t10k-images-idx3-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-images-idx3-ubyte.gz...
Downloading /root/.mxnet/datasets/mnist/t10k-labels-idx1-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-labels-idx1-ubyte.gz...
Epoch 0: Accuracy 0.939300
Epoch 1: Accuracy 0.955400
Epoch 2: Accuracy 0.964300
Epoch 3: Accuracy 0.963500
Epoch 4: Accuracy 0.971200


[2021-04-22T00:18:33.896938] The experiment completed successfully. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
1 items cleaning up...
Cleanup took 0.07186508178710938 seconds
[2021-04-22T00:18:34.209841] Finished context manager

## 8. Remove cluster (Clean-up)

In [13]:
# Delete cluster (nodes) in AML workspace
mycompute = AmlCompute(workspace=ws, name='cluster01')
mycompute.delete()