# MXNet Distributed Training (MNIST Sample) in Azure Machine Learning

To run this notebook,

1. Create new "Machine Learning" resource in [Azure Portal](https://portal.azure.com/).
2. Install Azure Machine Learning SDK (core package) as follows

```
pip install azureml-core
```

See "[MXNet Distributed Training Example for Azure ML service](https://tsmatz.wordpress.com/2019/01/17/azure-machine-learning-service-custom-amlcompute-and-runconfig-for-mxnet-distributed-training/)" for details.

## 1. Create script for MXNet distributed training (mnist_distributed.py)

Save a script file (mnist_distributed.py) for MXNet distributed training.

> Note : Use commented lines on your debugging in local with 1 CPU device.

In [1]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

In [2]:
%%writefile script/mnist_distributed.py
import os, random
import mxnet as mx
from mxnet import kv, gluon, autograd, nd
from mxnet.gluon import nn

store = kv.create('dist')

gpus_per_machine = 1
batch_size_per_gpu = 64
batch_size = batch_size_per_gpu * gpus_per_machine
num_epochs = 5

ctx = [mx.gpu(i) for i in range(gpus_per_machine)]
# ctx = mx.cpu(0)

class SplitSampler(gluon.data.sampler.Sampler):
    """
    length: Number of examples in the dataset
    num_parts: Partition the data into multiple parts
    part_index: The index of the part to read from
    """
    def __init__(self, length, num_parts, part_index):
        self.part_len = length // num_parts
        self.start = self.part_len * part_index
        self.end = self.start + self.part_len
    def __iter__(self):
        indices = list(range(self.start, self.end))
        random.shuffle(indices)
        return iter(indices)
    def __len__(self):
        return self.part_len

mx.random.seed(42)
def data_xform(data):
    """Move channel axis to the beginning, cast to float32, and normalize to [0, 1]"""
    return nd.moveaxis(data, 2, 0).astype('float32') / 255
train_data = gluon.data.DataLoader(
    gluon.data.vision.MNIST(train=True).transform_first(data_xform),
    batch_size=batch_size,
    sampler=SplitSampler(59904, store.num_workers, store.rank))
test_data = gluon.data.DataLoader(
    gluon.data.vision.MNIST(train=False).transform_first(data_xform),
    batch_size=batch_size,
    shuffle=False)
# train_data = gluon.data.DataLoader(
#     gluon.data.vision.MNIST(train=True, root='./data').transform_first(data_xform),
#     batch_size=batch_size)
# test_data = gluon.data.DataLoader(
#     gluon.data.vision.MNIST(train=False, root='./data').transform_first(data_xform),
#     batch_size=batch_size,
#     shuffle=False)

net = nn.HybridSequential(prefix='MLP_')
with net.name_scope():
    net.add(
        nn.Flatten(),
        nn.Dense(128, activation='relu'),
        nn.Dense(64, activation='relu'),
        nn.Dense(10, activation=None)
    )

net.hybridize()

net.initialize(mx.init.Xavier(), ctx=ctx)

loss_function = gluon.loss.SoftmaxCrossEntropyLoss()

trainer = gluon.Trainer(
    params=net.collect_params(),
    optimizer='sgd',
    optimizer_params={'learning_rate': 0.07},
    kvstore=store)
# trainer = gluon.Trainer(
#     params=net.collect_params(),
#     optimizer='sgd',
#     optimizer_params={'learning_rate': 0.07},
# )

for epoch in range(num_epochs):
    """ Train ! """
    for batch in train_data:
        inputs = gluon.utils.split_and_load(batch[0], ctx)
        labels = gluon.utils.split_and_load(batch[1], ctx)
        # inputs = batch[0].as_in_context(ctx)
        # labels = batch[1].as_in_context(ctx)
        with autograd.record():
            loss = [loss_function(net(X), Y) for X, Y in zip(inputs, labels)]
            # loss = loss_function(net(inputs), labels)
        for l in loss:
            l.backward()
        # loss.backward()
        trainer.step(batch_size=batch[0].shape[0])
    """ Evaluate and Output ! """
    metric = mx.metric.Accuracy()
    for i, (test_input, test_label) in enumerate(test_data):
        test_input = test_input.as_in_context(ctx[0])
        test_label = test_label.as_in_context(ctx[0])
        # test_input = test_input.as_in_context(ctx)
        # test_label = test_label.as_in_context(ctx)
        test_output = net(test_input)
        test_pred = nd.argmax(test_output, axis=1)
        metric.update(preds=test_pred, labels=test_label)
    print('Epoch %d: Accuracy %f' % (epoch, metric.get()[1]))

""" Save Model (both architecture and parameters) """
if store.rank == 0:
    os.makedirs('./outputs', exist_ok=True)
    net.export('./outputs/test', epoch=1)
# os.makedirs('./outputs', exist_ok=True)
# net.export('./outputs/test', epoch=1)

Writing script/mnist_distributed.py


## 2. Create script for entry (start_mx_role.py)

Create an entry script for starting each roles in MXNet distributed training.<br>
Here we run 4 nodes with the following roles.

- Rank 0 : Scheduler
- Rank 1 : Parameter Server
- Rank 2 : Worker
- Rank 3 : Worker

In [3]:
%%writefile script/start_mx_role.py
import argparse
import os
from mpi4py import MPI

parser = argparse.ArgumentParser()
parser.add_argument(
    '--num_workers',
    type=int,
    default=0,
    help='Specifies how many worker roles')
parser.add_argument(
    '--num_servers',
    type=int,
    default=0,
    help='Specifies how many server roles')
parser.add_argument(
    '--scheduler_host',
    type=str,
    default='10.0.0.4',
    help='Specifies the IP of the scheduler')
FLAGS, unparsed = parser.parse_known_args()

#
# See https://mxnet.incubator.apache.org/faq/distributed_training.html
#

mpi_comm = MPI.COMM_WORLD
mpi_rank = mpi_comm.Get_rank()
if mpi_rank == 0 :
    # Rank 0 is scheduler
    os.environ['DMLC_ROLE'] = 'scheduler'
elif mpi_rank <= FLAGS.num_servers :
    # Rank 1, ..., FLAGS.num_servers is server
    os.environ['DMLC_ROLE'] = 'server'
else :
    # Others are all workers (The count of workers must equal to FLAGS.num_workers.)
    os.environ['DMLC_ROLE'] = 'worker'
os.environ['DMLC_PS_ROOT_URI'] = FLAGS.scheduler_host
os.environ['DMLC_PS_ROOT_PORT'] = '9092'
os.environ['DMLC_NUM_WORKER'] = str(FLAGS.num_workers)
os.environ['DMLC_NUM_SERVER'] = str(FLAGS.num_servers)

#
# Run previous script !
#
import mnist_distributed

Writing script/start_mx_role.py


## 3. Connect to Azure Machine Learning (Create AML config)

Connect to Azure Machine Learning (AML) workspace, which is a resource created above.<br>
Please fill the following workspace name, subscription id, and resource group name. (You can get these values on AML resource blade in Azure Portal.)

In [4]:
from azureml.core import Workspace
import azureml.core

ws = Workspace(
  workspace_name = "{AML WORKSPACE NAME}",
  subscription_id = "{SUBSCRIPTION ID}",
  resource_group = "{RESOURCE GROUP NAME}")

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code R448ZPH6H to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.


## 4. Create cluster (multiple nodes)

Create a remote cluster with 4 node's GPU VMs - scheduler, parameter server, worker0, and worker1.

For running GPU cluster in Machine Learning, **please check as follows**.

- You should have quota for some dedicated ML GPU cluster in your Azure subscription. If you don't have, please request quota in Azure Portal.
- Please fill the following ```vm_size``` and ```location``` for GPU cluster which you can use.

In [5]:
from azureml.core import Workspace
import azureml.core
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
 
# Create AML compute (or Get existing one)
# (Total 4 : scheduler, server, worker1, worker2)
try:
    compute_target = ComputeTarget(workspace=ws, name='cluster01')
    print('found existing:', compute_target.name)
except ComputeTargetException:
    print('creating new.')
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='Standard_NC4as_T4_v3',
        min_nodes=4,
        max_nodes=4,
        location="eastus")
    compute_target = ComputeTarget.create(ws, 'cluster01', compute_config)
    compute_target.wait_for_completion(show_output=True)

creating new.
InProgress......
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded..........................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## 5. Generate config for run

Generate a script run configuration in AML.<br>
Here we use custom container image, in which Open MPI is installed and configured. (See [here](https://tsmatz.wordpress.com/2019/01/17/azure-machine-learning-service-custom-amlcompute-and-runconfig-for-mxnet-distributed-training/) for details.)

With ```$AZ_BATCH_MASTER_NODE```, it's getting master node's IP, such like "10.0.0.4". (Or use ```$AZ_BATCHAI_MPI_MASTER_NODE```.)

In [6]:
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.environment import Environment
from azureml.core import Run, ScriptRunConfig
from azureml.core.runconfig import DockerConfiguration, MpiConfiguration

# create environment
env = Environment('test-mxnet-gpu-env')
conda_dep = CondaDependencies.create()
conda_dep.add_pip_package('mxnet-cu90');
conda_dep.add_pip_package('mpi4py');
env.python.conda_dependencies = conda_dep
env.docker.base_image = 'tsmatz/azureml-openmpi:0.1.0-gpu'

# register environment to re-use later
env.register(workspace=ws)
## # speed up by using the existing environment
## env = Environment.get(ws, name='test-mxnet-gpu-env')

# create script run config
src = ScriptRunConfig(
    source_directory='./script',
    script='start_mx_role.py',
    arguments=[
        '--num_workers', 2,
        '--num_servers', 1,
        '--scheduler_host', '`cut -d ":" -f 1 <<< $AZ_BATCH_MASTER_NODE`'], 
    compute_target=compute_target,
    environment=env,
    docker_runtime_config=DockerConfiguration(use_docker=True),
    distributed_job_config=MpiConfiguration(process_count_per_node=1, node_count=4))

## 6. Run !

In [7]:
from azureml.core import Experiment
exp = Experiment(workspace=ws, name='mnist_mxnet_distributed')
run = exp.submit(config=src)
run.wait_for_completion(show_output=True)

RunId: mnist_mxnet_distributed_1626666873_c49459f4
Web View: https://ml.azure.com/runs/mnist_mxnet_distributed_1626666873_c49459f4?wsid=/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/TEST20210719/workspaces/ws01&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

Streaming azureml-logs/55_azureml-execution-tvmps_21fb42f5e7a0d9d5743a41c31bda5cd6ef0e8bcd1a6672e91d3e9a44a3a79658_d.txt

2021-07-19T03:54:53Z Successfully mounted a/an Blobfuse File System at /mnt/batch/tasks/shared/LS_root/jobs/ws01/azureml/mnist_mxnet_distributed_1626666873_c49459f4/mounts/workspaceblobstore
2021-07-19T03:54:53Z Failed to start nvidia-fabricmanager due to exit status 5 with output Failed to start nvidia-fabricmanager.service: Unit nvidia-fabricmanager.service not found.
. Please ignore this if the GPUs don't utilize NVIDIA® NVLink® switches.
2021-07-19T03:54:53Z Starting output-watcher...
2021-07-19T03:54:53Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
Login Succeeded
Using defau


Streaming azureml-logs/70_driver_log_0.txt

[2021-07-19T03:55:02.154394] Entering context manager injector.
[2021-07-19T03:55:02.617227] context_manager_injector.py Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError', 'UserExceptions:context_managers.UserExceptions'], invocation=['start_mx_role.py', '--num_workers', '2', '--num_servers', '1', '--scheduler_host', '10.0.0.5'])
This is an MPI job. Rank:0
Script type = None
[2021-07-19T03:55:02.620903] Entering Run History Context Manager.
[2021-07-19T03:55:03.280680] Current directory: /mnt/batch/tasks/shared/LS_root/jobs/ws01/azureml/mnist_mxnet_distributed_1626666873_c49459f4/wd/azureml/mnist_mxnet_distributed_1626666873_c49459f4
[2021-07-19T03:55:03.280921] Preparing to call script [start_mx_role.py] with arguments:['--num_workers', '2', '--num_servers', '1', '--scheduler_host', '10.0.0.5']
[2021-07-19

{'runId': 'mnist_mxnet_distributed_1626666873_c49459f4',
 'target': 'cluster01',
 'status': 'Completed',
 'startTimeUtc': '2021-07-19T03:54:48.930013Z',
 'endTimeUtc': '2021-07-19T03:56:25.697589Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': 'a656e20f-3a58-4752-88ba-585ad7e71ac1',
  'azureml.git.repository_uri': 'https://github.com/tsmatz/azureml-samples.git',
  'mlflow.source.git.repoURL': 'https://github.com/tsmatz/azureml-samples.git',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': '2cfb88db961caedb534362729bc36ba0189a4380',
  'mlflow.source.git.commit': '2cfb88db961caedb534362729bc36ba0189a4380',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json',
  'azureml.RuntimeType': 'Hosttools'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'start_mx_role.py',
  'command': '',
  'use

## 7. See the results

Let's see the output results. These are all managed in Azure Machine Learning experiment's logging.<br>
The "```outputs```" folder includes a generated model (both ```outputs/test-0001.params``` and ```outputs/test-symbol.json```) by MXNet.

In [8]:
# You can see and download results (test-symbol.json, test-0001.params).
run.get_file_names()

['azureml-logs/55_azureml-execution-tvmps_21fb42f5e7a0d9d5743a41c31bda5cd6ef0e8bcd1a6672e91d3e9a44a3a79658_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_4f6aea116ce43cf7eec456d73564f080b26058c117e597f94b7545cc953cf93e_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_b6f63e9eb0f173c6ae1e249728ea934f5a022453bd47eaf1246b5528eca6deb8_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_ddb28de6ba3497cec85dbdfc2d44933be8933d6cde5411fbab69b055558f504b_d.txt',
 'azureml-logs/65_job_prep-tvmps_21fb42f5e7a0d9d5743a41c31bda5cd6ef0e8bcd1a6672e91d3e9a44a3a79658_d.txt',
 'azureml-logs/65_job_prep-tvmps_4f6aea116ce43cf7eec456d73564f080b26058c117e597f94b7545cc953cf93e_d.txt',
 'azureml-logs/65_job_prep-tvmps_b6f63e9eb0f173c6ae1e249728ea934f5a022453bd47eaf1246b5528eca6deb8_d.txt',
 'azureml-logs/65_job_prep-tvmps_ddb28de6ba3497cec85dbdfc2d44933be8933d6cde5411fbab69b055558f504b_d.txt',
 'azureml-logs/70_driver_log_0.txt',
 'azureml-logs/70_driver_log_1.txt',
 'azureml-logs/70_driver_log_2.txt',
 'azu

When you want to see the validation results in workers (see above source code), you can download these logs on rank2 and rank3.

In [9]:
run.download_file(
    name='azureml-logs/70_driver_log_2.txt',
    output_file_path='remote_logs/70_driver_log_2.txt')
run.download_file(
    name='azureml-logs/70_driver_log_3.txt',
    output_file_path='remote_logs/70_driver_log_3.txt')

In [10]:
!tail -n 15 remote_logs/70_driver_log_2.txt remote_logs/70_driver_log_3.txt

==> remote_logs/70_driver_log_2.txt <==
Downloading /root/.mxnet/datasets/mnist/train-labels-idx1-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-labels-idx1-ubyte.gz...
Downloading /root/.mxnet/datasets/mnist/t10k-images-idx3-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-images-idx3-ubyte.gz...
Downloading /root/.mxnet/datasets/mnist/t10k-labels-idx1-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-labels-idx1-ubyte.gz...
Epoch 0: Accuracy 0.937700
Epoch 1: Accuracy 0.955500
Epoch 2: Accuracy 0.963900
Epoch 3: Accuracy 0.968300
Epoch 4: Accuracy 0.971400


[2021-07-19T03:56:07.820678] The experiment completed successfully. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
1 items cleaning up...
Cleanup took 0.07885503768920898 seconds
[2021-07-19T03:56:08.024354] Finished context manager

## 8. Remove cluster (Clean-up)

In [11]:
# Delete cluster (nodes) in AML workspace
mycompute = AmlCompute(workspace=ws, name='cluster01')
mycompute.delete()