# MXNet Distributed Training (MNIST Example) in Azure Machine Learning

In this notebook, we run MXNet distributed training in Azure Machine Learning.<br>
When the training has completed, the computing instances will automatically be scaled down to 0 instances.
To run this notebook,

1. Create new "Machine Learning" resource in [Azure Portal](https://portal.azure.com/).
2. Install Azure Machine Learning CLI v2 on Ubuntu as follows.

```
# install Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
# install AML CLI extension
az extension add --name ml
```

See "[MXNet Distributed Training Example for Azure ML service](https://tsmatz.wordpress.com/2019/01/17/azure-machine-learning-service-custom-amlcompute-and-runconfig-for-mxnet-distributed-training/)" for details.

## 1. Create script for MXNet distributed training (train_mnist.py)

Save a script file (```train_mnist.py```) for MXNet distributed training.

> Note : Use commented lines on your debugging in local with 1 CPU device.

In [2]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

In [3]:
%%writefile script/train_mnist.py
import os, random
import argparse
import mxnet as mx
from mxnet import kv, gluon, autograd, nd
from mxnet.gluon import nn

store = kv.create('dist')

parser = argparse.ArgumentParser()
parser.add_argument("--gpus_per_machine",
    type=int,
    required=False,
    default=1,
    help="number of gpus in each worker")
parser.add_argument("--batch_size_per_gpu",
    type=int,
    required=False,
    default=64,
    help="batch size in each gpu")
parser.add_argument("--epoch",
    type=int,
    required=False,
    default=5,
    help="number of epochs")
args = parser.parse_args()

batch_size = args.batch_size_per_gpu * args.gpus_per_machine

ctx = [mx.gpu(i) for i in range(args.gpus_per_machine)]
# ctx = mx.cpu(0)

# for splitting data into wokers
class SplitSampler(gluon.data.sampler.Sampler):
    """
    length: Number of examples in the dataset
    num_parts: Partition the data into multiple parts
    part_index: The index of the part to read from
    """
    def __init__(self, length, num_parts, part_index):
        self.part_len = length // num_parts
        self.start = self.part_len * part_index
        self.end = self.start + self.part_len
    def __iter__(self):
        indices = list(range(self.start, self.end))
        random.shuffle(indices)
        return iter(indices)
    def __len__(self):
        return self.part_len

mx.random.seed(42)
def data_xform(data):
    """Move channel axis to the beginning, cast to float32, and normalize to [0, 1]"""
    return nd.moveaxis(data, 2, 0).astype('float32') / 255
train_data = gluon.data.DataLoader(
    gluon.data.vision.MNIST(train=True).transform_first(data_xform),
    batch_size=batch_size,
    sampler=SplitSampler(59904, store.num_workers, store.rank))
test_data = gluon.data.DataLoader(
    gluon.data.vision.MNIST(train=False).transform_first(data_xform),
    batch_size=batch_size,
    shuffle=False)
# train_data = gluon.data.DataLoader(
#     gluon.data.vision.MNIST(train=True, root='./data').transform_first(data_xform),
#     batch_size=batch_size)
# test_data = gluon.data.DataLoader(
#     gluon.data.vision.MNIST(train=False, root='./data').transform_first(data_xform),
#     batch_size=batch_size,
#     shuffle=False)

net = nn.HybridSequential(prefix='MLP_')
with net.name_scope():
    net.add(
        nn.Flatten(),
        nn.Dense(128, activation='relu'),
        nn.Dense(64, activation='relu'),
        nn.Dense(10, activation=None)
    )

net.hybridize()

net.initialize(mx.init.Xavier(), ctx=ctx)

loss_function = gluon.loss.SoftmaxCrossEntropyLoss()

trainer = gluon.Trainer(
    params=net.collect_params(),
    optimizer='sgd',
    optimizer_params={'learning_rate': 0.07},
    kvstore=store)
# trainer = gluon.Trainer(
#     params=net.collect_params(),
#     optimizer='sgd',
#     optimizer_params={'learning_rate': 0.07},
# )

for epoch in range(args.epoch):
    """ Train ! """
    for batch in train_data:
        inputs = gluon.utils.split_and_load(batch[0], ctx)
        labels = gluon.utils.split_and_load(batch[1], ctx)
        # inputs = batch[0].as_in_context(ctx)
        # labels = batch[1].as_in_context(ctx)
        with autograd.record():
            loss = [loss_function(net(X), Y) for X, Y in zip(inputs, labels)]
            # loss = loss_function(net(inputs), labels)
        for l in loss:
            l.backward()
        # loss.backward()
        trainer.step(batch_size=batch[0].shape[0])
    """ Evaluate and Output ! """
    metric = mx.metric.Accuracy()
    for i, (test_input, test_label) in enumerate(test_data):
        test_input = test_input.as_in_context(ctx[0])
        test_label = test_label.as_in_context(ctx[0])
        # test_input = test_input.as_in_context(ctx)
        # test_label = test_label.as_in_context(ctx)
        test_output = net(test_input)
        test_pred = nd.argmax(test_output, axis=1)
        metric.update(preds=test_pred, labels=test_label)
    print('Epoch %d: Accuracy %f' % (epoch, metric.get()[1]))

""" Save Model (both architecture and parameters) """
if store.rank == 0:
    os.makedirs('./outputs', exist_ok=True)
    net.export('./outputs/test', epoch=1)
# os.makedirs('./outputs', exist_ok=True)
# net.export('./outputs/test', epoch=1)

Writing script-mxnet/train_mnist.py


## 2. Create shell script for setting up mxnet and running training (run.sh)

Create shell script for starting each roles in MXNet distributed training.<br>
Here we run the following 4 nodes. (The parameter servers can also be distributed, but here we set 1 parameter server.)

- Rank 0 : Scheduler
- Rank 1 : Parameter Server
- Rank 2 : Worker
- Rank 3 : Worker

> Note : The following ```$AZ_BATCHAI_MPI_MASTER_NODE``` is an environment variable for MPI master's host name (such as, ```10.5.0.4```).<br>
> You can also use the following shell command to retreive master's host name. (See [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu) for environment variables in AML distributed cluster.)<br>
> ```cut -d ":" -f 1 <<< $AZ_BATCH_MASTER_NODE```

In [4]:
%%writefile script/run.sh
# setup role
if [ $OMPI_COMM_WORLD_RANK -eq 0 ]
then
    export DMLC_ROLE=scheduler
elif [ $OMPI_COMM_WORLD_RANK -eq 1 ]
then
    export DMLC_ROLE=server
else
    export DMLC_ROLE=worker
fi
export DMLC_PS_ROOT_URI=$AZ_BATCHAI_MPI_MASTER_NODE
export DMLC_PS_ROOT_PORT=9092
export DMLC_NUM_SERVER=1
export DMLC_NUM_WORKER=2

# run training
python train_mnist.py --gpus_per_machine 1 --batch_size_per_gpu 64 --epoch 5

Writing script-mxnet/run.sh


## 3. Submit Job in Azure Machine Learning

### Prepare for connecting to Azure Machine Learning workspace

Login to Azure and prepare for connecting to Azure Machine Learning (AML) workspace.<br>
Please fill the following subscription id, AML workspace name, and resource group name.

In [None]:
!az login

In [None]:
!az account set -s {AZURE_SUBSCRIPTION_ID}

In [None]:
my_resource_group = "{AML_RESOURCE_GROUP_NAME}"
my_workspace = "{AML_WORSPACE_NAME}"

### Create cluster (multiple nodes)

Create a remote cluster with 4 GPU nodes - 1 head node and 2 worker nodes - scheduler, parameter server, worker 0, and worker 1.

For running GPU cluster in Machine Learning, **please check as follows**.

- You should have quota for some dedicated ML GPU cluster in your Azure subscription. If you don't have, please request quota in Azure Portal.
- Please fill the following ```vm_size``` for GPU VM which you can use.

In [5]:
!az ml compute create --name cluster01 \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace \
  --type amlcompute \
  --min-instances 0 \
  --max-instances 4 \
  --size Standard_NC4as_T4_v3

{
  "id": "/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/rg-AML/providers/Microsoft.MachineLearningServices/workspaces/ws01/computes/cluster01",
  "idle_time_before_scale_down": 120,
  "location": "eastus",
  "max_instances": 4,
  "min_instances": 0,
  "name": "cluster01",
  "network_settings": {},
  "provisioning_state": "Succeeded",
  "resourceGroup": "rg-AML",
  "size": "STANDARD_NC6",
  "ssh_public_access_enabled": true,
  "tier": "dedicated",
  "type": "amlcompute"
}
[0m

### Create AML environment for MXNet distribution 

Now we prepare custom environment to run MXNet distributed training.

In [6]:
%%writefile conda_distributed_mxnet.yml
name: mxnet_environment
dependencies:
- python=3.6
- pip:
  - mxnet-cu90
  - mpi4py
channels:
- anaconda
- conda-forge

Writing conda_distributed_mxnet.yml


In [7]:
%%writefile env_distributed_mxnet.yml
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: mxnet-distribution-env
image: tsmatz/azureml-openmpi:0.1.0-gpu
conda_file: conda_distributed_mxnet.yml
description: environment for mxnet distribution

Writing env_distributed_mxnet.yml


In [8]:
!az ml environment create --file env_distributed_mxnet.yml \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

{
  "conda_file": {
    "channels": [
      "anaconda",
      "conda-forge"
    ],
    "dependencies": [
      "python=3.6",
      {
        "pip": [
          "mxnet-cu90",
          "mpi4py"
        ]
      }
    ],
    "name": "mxnet_environment"
  },
  "creation_context": {
    "created_at": "2022-08-24T05:54:33.791916+00:00",
    "created_by": "Tsuyoshi Matsuzaki",
    "created_by_type": "User",
    "last_modified_at": "2022-08-24T05:54:33.791916+00:00",
    "last_modified_by": "Tsuyoshi Matsuzaki",
    "last_modified_by_type": "User"
  },
  "description": "environment for mxnet distribution",
  "id": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/rg-AML/providers/Microsoft.MachineLearningServices/workspaces/ws01/environments/mxnet-distribution-env/versions/1",
  "image": "tsmatz/azureml-openmpi:0.1.0-gpu",
  "name": "mxnet-distribution-env",
  "os_type": "linux",
  "resourceGroup": "rg-AML",
  "tags": {},
  "version": "

### Submit Job

Now let's run distributed MXNet training.

> Note : For the first time to run, it builds new environment and then takes a long time to start training. (Once it's registered, it can speed up to train.)

In [9]:
%%writefile train_distributed_mxnet.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: script
command: bash run.sh
environment: azureml:mxnet-distribution-env@latest
compute: azureml:cluster01
display_name: mxnet_dist_test
experiment_name: mxnet_dist_test
resources:
  instance_count: 4
distribution:
  type: mpi
  process_count_per_instance: 1
description: MXNet distributed training

Writing train_distributed_mxnet.yml


In [10]:
!az ml job create --file train_distributed_mxnet.yml \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

[32mUploading script-mxnet (0.0 MBs): 100%|█| 4631/4631 [00:00<00:00, 154985.13it/s][0m
[39m

{
  "code": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/rg-AML/providers/Microsoft.MachineLearningServices/workspaces/ws01/codes/e545fbff-b6f3-47be-9a21-7ed75cafc161/versions/1",
  "command": "bash run.sh",
  "compute": "azureml:cluster01",
  "creation_context": {
    "created_at": "2022-08-24T05:55:10.280969+00:00",
    "created_by": "Tsuyoshi Matsuzaki",
    "created_by_type": "User"
  },
  "description": "MXNet distributed training",
  "display_name": "mxnet_dist_test",
  "distribution": {
    "process_count_per_instance": 1,
    "type": "mpi"
  },
  "environment": "azureml:mxnet-distribution-env:1",
  "environment_variables": {},
  "experiment_name": "mxnet_dist_test",
  "id": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/rg-AML/providers/Microsoft.MachineLearningServices/workspaces/ws01/jobs/goofy_vulture_qc2j57jmx9",
  "inp

## 4. Check results

Go to [Azure Machine Learning studio](https://ml.azure.com/), and see the output's artifacts.<br>
The "```outputs```" folder includes a generated model, ```outputs/test-0001.params``` and ```outputs/test-symbol.json```, as follows.

![output's artifacts](./output_artifact.jpg)

To evaluate the generated model in this training, now you can download result (generated model) in your local machine.

In [11]:
job_name = "{FILL_JOB_NAME}"
# Example : job_name = "goofy_vulture_qc2j57jmx9"

In [12]:
!az ml job download --name $job_name \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace \
  --download-path mxnet_training_result

Downloading artifact azureml://datastores/workspaceartifactstore/ExperimentRun/dcid.goofy_vulture_qc2j57jmx9 to mxnet_training_result/artifacts
[0m

Let's see the trained outputs in worker logs.

In [13]:
!tail -n 15 mxnet_training_result/artifacts/azureml-logs/70_driver_log_2.txt \
  mxnet_training_result/artifacts/azureml-logs/70_driver_log_3.txt

==> mxnet_training_result/artifacts/azureml-logs/70_driver_log_2.txt <==
Downloading /root/.mxnet/datasets/mnist/train-images-idx3-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-images-idx3-ubyte.gz...
Downloading /root/.mxnet/datasets/mnist/train-labels-idx1-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-labels-idx1-ubyte.gz...
Downloading /root/.mxnet/datasets/mnist/t10k-images-idx3-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-images-idx3-ubyte.gz...
Downloading /root/.mxnet/datasets/mnist/t10k-labels-idx1-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-labels-idx1-ubyte.gz...
Epoch 0: Accuracy 0.938700
Epoch 1: Accuracy 0.951600
Epoch 2: Accuracy 0.961200
Epoch 3: Accuracy 0.966900
Epoch 4: Accuracy 0.970800
[2022-08-24T06:13:41.278757] Command finished with exit code 0

## 5. Remove cluster (Clean-up)

In [15]:
!az ml compute delete --name cluster01 \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace \
  --yes

Deleting compute cluster01 
.................................................................................................................Done.
(9m 30s)

[0m