# Exercise05 : Distributed Training with Curated Environments

Here we change our sample (see "[Exercise03 : Just Train in Your Working Machine](./exercise03_train_simple.ipynb)") for distributed training using multiple machines in Azure Machine Learning.

In this exercise, we use Horovod framework in AML built-in environment. (As you saw in previous [Exercise04](./exercise04_train_remote.ipynb), you can also run distributed training with manually-configured custom environment.)

In this example, we use multiple machines, but you can also configure Horovod distributed training to run on multiple devices (such as, multiple GPUs).

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Variable's Setting

Replace below's branket's string and set the required variables.

> Note : By the following ```az configure --defaults```, you can skip setting for ```--resource-group``` and ```--workspace-name``` options in each ```az ml``` command.<br>
> ```az configure --defaults group=$resource_group workspace=$aml_workspace```

In [1]:
my_resource_group = "{AML-RESOURCE-GROUP-NAME}"
my_workspace = "{AML-WORSPACE-NAME}"

## Save your training script as file (train.py)

Create ```scirpt``` directory.

In [2]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

Change our original source code ```train.py``` (see "[Exercise03 : Just Train in Your Working Machine](./exercise03_train_simple.ipynb)") as follows. (The lines commented "##### modified" is modified lines.)<br>
This source code will then be saved as ```./script/train_horovod.py```.

In [3]:
%%writefile script/train_horovod.py
import os
import argparse
import tensorflow as tf

import horovod.tensorflow.keras as hvd ##### modified

# device test
print("##### List of available GPU #####")
print(tf.config.list_physical_devices("GPU"))

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument(
    "--data_folder",
    type=str,
    default="./data/train",
    help="Folder path for input data")
parser.add_argument(
    "--model_folder",
    type=str,
    default="./outputs",  # AML experiments outputs folder
    help="Folder path for model output")
parser.add_argument(
    "--learning_rate",
    type=float,
    default="0.001",
    help="Learning Rate")
parser.add_argument(
    "--first_layer",
    type=int,
    default="128",
    help="Neuron number for the first hidden layer")
parser.add_argument(
    "--second_layer",
    type=int,
    default="64",
    help="Neuron number for the second hidden layer")
parser.add_argument(
    "--epochs_num",
    type=int,
    default="6",
    help="Number of epochs")
FLAGS, unparsed = parser.parse_known_args()

hvd.init() ##### modified

# Horovod config output
print("##### Horovod config #####")
print("Size {}".format(hvd.size()))
print("Rank {}".format(hvd.rank()))

# build model
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(FLAGS.first_layer, activation="relu"),
    tf.keras.layers.Dense(FLAGS.second_layer, activation="relu"),
    tf.keras.layers.Dense(10)
])
opt = tf.keras.optimizers.Adam(FLAGS.learning_rate)
opt = hvd.DistributedOptimizer(opt) ##### modified
model.compile(
    optimizer=opt,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

# run training
train_data = tf.data.experimental.load(FLAGS.data_folder)
model.fit(
    train_data.shuffle(1000).batch(128).prefetch(tf.data.AUTOTUNE),
    callbacks=[hvd.callbacks.BroadcastGlobalVariablesCallback(0)],  ##### modified
    epochs=FLAGS.epochs_num
)

# save model and variables
if hvd.rank() == 0 : ##### modified
    model_path = os.path.join(FLAGS.model_folder, "mnist_tf_model")
    model.save(model_path)
    print("current working directory : ", os.getcwd())
    print("model folder : ", model_path)

Writing script/train_horovod.py


## Train on multiple machines (Horovod)

Now let's start to integrate with AML and automate distributed training on remote virtual machines.

### Step 1 : Create multiple virtual machines (cluster)

Create your new AML compute for distributed clusters. By enabling auto-scaling from 0 to 3, you can scale distributed workloads and also save money (all nodes are terminated) if it's inactive.

> Note : By setting appropriate time duration in ```--idle-time-before-scale-down``` option, you can prevent scaling-down when the training has finished. (Otherwise, it will scale down in 120 seconds after the training has finished, and the next training will slow to start because of cluster resizing.)

In [4]:
!az ml compute create --name mycluster01 \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace \
  --type amlcompute \
  --min-instances 0 \
  --max-instances 3 \
  --size Standard_D2_v2

[K{\ Finished ..
  "id": "/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/rg-AML/providers/Microsoft.MachineLearningServices/workspaces/ws01/computes/mycluster01",
  "idle_time_before_scale_down": 120,
  "location": "eastus",
  "max_instances": 3,
  "min_instances": 0,
  "name": "mycluster01",
  "network_settings": {},
  "provisioning_state": "Succeeded",
  "resourceGroup": "rg-AML",
  "size": "STANDARD_D2_V2",
  "ssh_public_access_enabled": true,
  "tier": "dedicated",
  "type": "amlcompute"
}
[0m

### Step 2 : Submit a training job

Submit a training job with above compute.<br>
In this training, this job will be distributed on 3 node.

Horovod (with TensorFlow) 0.23.0 is installed in this built-in image, ```AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu```.

In this example, I also use the registered data asset named ```mnist_data``` to mount in your compute target. (Run "[Exercise02 : Prepare Data](./exercise02_prepare_data.ipynb)" for data preparation.)

> Note : In this example, I have used built-in GPU environment (```AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu```) on CPU cluster. If GPU is not available, it will correctly run on CPU.<br>
> When you prefer CPU image, you can also create and configure your own image. (See [Exercise04](./exercise04_train_remote.ipynb).)

In [5]:
%%writefile 05_mnist_distributed_job.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: script
command: >-
  python train_horovod.py
  --data_folder ${{inputs.mnist_tf}}/train
inputs:
  mnist_tf:
    type: uri_folder
    path: azureml:mnist_data@latest
environment: azureml:AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu@latest
compute: azureml:mycluster01
resources:
  instance_count: 3
distribution:
  type: mpi
  process_count_per_instance: 1
display_name: tf_distribued
experiment_name: tf_distribued
description: This is example

Writing 05_mnist_distributed_job.yml


Now let's submit a job with AML CLI.<br>
See the progress and results in job view on [AML Studio](https://ml.azure.com/).

In [6]:
!az ml job create --file 05_mnist_distributed_job.yml \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

[32mUploading script (0.01 MBs): 100%|██████| 6861/6861 [00:00<00:00, 149916.75it/s][0m
[39m

{
  "code": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/rg-AML/providers/Microsoft.MachineLearningServices/workspaces/ws01/codes/7c6cb5a7-40e7-4e84-9627-a794d3e2e9a5/versions/1",
  "command": "python train_horovod.py --data_folder ${{inputs.mnist_tf}}/train",
  "compute": "azureml:mycluster01",
  "creation_context": {
    "created_at": "2022-10-04T05:29:36.109492+00:00",
    "created_by": "Tsuyoshi Matsuzaki",
    "created_by_type": "User"
  },
  "description": "This is example",
  "display_name": "tf_distribued",
  "distribution": {
    "process_count_per_instance": 1,
    "type": "mpi"
  },
  "environment": "azureml:AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu:23",
  "environment_variables": {},
  "experiment_name": "tf_distribued",
  "id": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/rg-AML/providers/Microsoft.MachineLe

You can show the progress and result with the following CLI command.<br>
(**Replace ```nice_stick_z0qlkx99sm``` with your generated job name**. For getting job name, see above output.)

In [7]:
job_name = "nice_stick_z0qlkx99sm"

In [8]:
!az ml job show --name $job_name \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

{
  "code": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/rg-AML/providers/Microsoft.MachineLearningServices/workspaces/ws01/codes/7c6cb5a7-40e7-4e84-9627-a794d3e2e9a5/versions/1",
  "command": "python train_horovod.py --data_folder ${{inputs.mnist_tf}}/train",
  "compute": "azureml:mycluster01",
  "creation_context": {
    "created_at": "2022-10-04T05:29:36.109492+00:00",
    "created_by": "Tsuyoshi Matsuzaki",
    "created_by_type": "User"
  },
  "description": "This is example",
  "display_name": "tf_distribued",
  "distribution": {
    "process_count_per_instance": 1,
    "type": "mpi"
  },
  "environment": "azureml:AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu:23",
  "environment_variables": {},
  "experiment_name": "tf_distribued",
  "id": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/rg-AML/providers/Microsoft.MachineLearningServices/workspaces/ws01/jobs/nice_stick_z0qlkx99sm",
  "inputs": {
   

### Step 3 : Download results and check

Check the generated model in local computer.

By running the following ```az ml job download``` command, the logs and outputs are downloaded in local computer.<br>
The logs are saved in ```artifacts/logs``` and outputs are in ```artifacts/outputs```.<br>

In [9]:
!az ml job download --name $job_name \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

Downloading artifact azureml://datastores/workspaceartifactstore/ExperimentRun/dcid.nice_stick_z0qlkx99sm to /home/tsmatsuz/cli_yaml/artifacts
[0m

Now check the downloaded result.

In [10]:
import tensorflow as tf

test_data = tf.data.Dataset.load("./data/test")

loaded_model = tf.keras.models.load_model("./artifacts/outputs/mnist_tf_model")
for image, true_value in test_data.take(3):
    pred_output = loaded_model(tf.expand_dims(image, axis=0))
    pred_value = tf.math.argmax(pred_output, axis=-1).numpy().item()
    print("Predicted {}, True {}".format(pred_value, true_value))

2022-10-04 05:58:44.192520: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-04 05:58:44.334760: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-10-04 05:58:44.334790: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-10-04 05:58:44.371487: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-10-04 05:58:45.203129: W tensorflow/stream_executor/platform/de

Predicted 7, True 7
Predicted 1, True 2
Predicted 1, True 1


### Step 4 : Remove AML compute

**You don't need to remove your AML compute** for saving money, because the nodes will be automatically terminated, when it's inactive.<br>
But if you want to clean up, please run as follows.

In [11]:
!az ml compute delete --name mycluster01 \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace \
  --yes

Deleting compute mycluster01 
.......Done.
(0m 36s)

[0m