# Exercise05 : Distributed Training with Curated Environments

Here we change our sample (see "[Exercise03 : Just Train in Your Working Machine](./exercise03_train_simple.ipynb)") for distributed training using multiple machines in Azure Machine Learning.

In this exercise, we use Horovod framework in AML built-in environment. (As you saw in previous [Exercise04](./exercise04_train_remote.ipynb), you can also run distributed training with manually-configured custom environment.)

In this example, we use multiple machines, but you can also configure Horovod distributed training to run on multiple devices (such as, multiple GPUs).

> Note : In this example, I manually configure MPI distribution, but you can also use built-in ```azure.ai.ml.TensorFlowDistribution``` for TensorFlow distribution.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Initialize MLClient

Replace below's branket's string with your subscription id, resource group name, and AML workspace name.<br>
(I note that creating ```MLClient``` will not connect to AML workspace, and the client initialization is lazy.)

In [1]:
from azure.ai.ml import MLClient
from azure.identity import DeviceCodeCredential

# When you run on remote
cred = DeviceCodeCredential()

# # When you run on Azure ML Notebook
# from azure.identity import DefaultAzureCredential
# cred = DefaultAzureCredential()

# Get a handle to the workspace
ml_client = MLClient(
    credential=cred,
    subscription_id="{SUBSCRIPTION ID}",
    resource_group_name="{RESOURCE GROUP NAME}",
    workspace_name="{AML WORKSPACE NAME}",
)

## Save your training script as file (train.py)

Create ```scirpt``` directory.

In [2]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

Change our original source code ```train.py``` (see "[Exercise03 : Just Train in Your Working Machine](./exercise03_train_simple.ipynb)") as follows. (The lines commented "##### modified" is modified lines.)<br>
This source code will then be saved as ```./script/train_horovod.py```.

In [3]:
%%writefile script/train_horovod.py
import os
import argparse
import tensorflow as tf

import horovod.tensorflow.keras as hvd ##### modified

# device test
print("##### List of available GPU #####")
print(tf.config.list_physical_devices("GPU"))

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument(
    "--data_folder",
    type=str,
    default="./data/train",
    help="Folder path for input data")
parser.add_argument(
    "--model_folder",
    type=str,
    default="./outputs",  # AML experiments outputs folder
    help="Folder path for model output")
parser.add_argument(
    "--learning_rate",
    type=float,
    default="0.001",
    help="Learning Rate")
parser.add_argument(
    "--first_layer",
    type=int,
    default="128",
    help="Neuron number for the first hidden layer")
parser.add_argument(
    "--second_layer",
    type=int,
    default="64",
    help="Neuron number for the second hidden layer")
parser.add_argument(
    "--epochs_num",
    type=int,
    default="6",
    help="Number of epochs")
FLAGS, unparsed = parser.parse_known_args()

hvd.init() ##### modified

# Horovod config output
print("##### Horovod config #####")
print("Size {}".format(hvd.size()))
print("Rank {}".format(hvd.rank()))

# build model
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(FLAGS.first_layer, activation="relu"),
    tf.keras.layers.Dense(FLAGS.second_layer, activation="relu"),
    tf.keras.layers.Dense(10)
])
opt = tf.keras.optimizers.Adam(FLAGS.learning_rate)
opt = hvd.DistributedOptimizer(opt) ##### modified
model.compile(
    optimizer=opt,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

# run training
train_data = tf.data.experimental.load(FLAGS.data_folder)
model.fit(
    train_data.shuffle(1000).batch(128).prefetch(tf.data.AUTOTUNE),
    callbacks=[hvd.callbacks.BroadcastGlobalVariablesCallback(0)],  ##### modified
    epochs=FLAGS.epochs_num
)

# save model and variables
if hvd.rank() == 0 : ##### modified
    model_path = os.path.join(FLAGS.model_folder, "mnist_tf_model")
    model.save(model_path)
    print("current working directory : ", os.getcwd())
    print("model folder : ", model_path)

Writing script/train_horovod.py


## Train on multiple machines (Horovod)

Now let's start to integrate with AML and automate distributed training on remote virtual machines.

### Step 1 : Create multiple virtual machines (cluster)

Create your new AML compute for distributed clusters. By enabling auto-scaling from 0 to 3, you can scale distributed workloads and also save money (all nodes are terminated) if it's inactive.

> Note : By setting appropriate time duration in ```idle_time_before_scale_down``` parameter, you can prevent scaling-down when the training has finished. (Otherwise, it will scale down in 120 seconds after the training has finished, and the next training will slow to start because of cluster resizing.)

In [4]:
from azure.ai.ml.entities import AmlCompute

try:
    compute_target = ml_client.compute.get("mycluster01")
    print("found existing: ", compute_target.name)
except Exception:
    print("creating new.")
    compute_target = AmlCompute(
        name="mycluster01",
        type="amlcompute",
        size="Standard_D2_v2",
        min_instances=0,
        max_instances=3,
        tier="Dedicated",
    )
    compute_target = ml_client.begin_create_or_update(compute_target)

To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code FKC6KGBWX to authenticate.
creating new.


### Step 2 : Submit a training job

Submit a training job with above compute.<br>
In this training, this job will be distributed on 3 node.

Horovod (with TensorFlow) 0.23.0 is installed in this built-in image, ```AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu```.

In this example, I also use the registered data asset named ```mnist_data``` to mount in your compute target. (Run "[Exercise02 : Prepare Data](./exercise02_prepare_data.ipynb)" for data preparation.)

> Note : In this example, I have used built-in GPU environment (```AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu```) on CPU cluster. If GPU is not available, it will correctly run on CPU.<br>
> When you prefer CPU image, you can also create and configure your own image. (See [Exercise04](./exercise04_train_remote.ipynb).)

See the progress and results in job view on [AML Studio](https://ml.azure.com/).

In [6]:
from azure.ai.ml import command, Input, MpiDistribution
from azure.ai.ml.entities import JobResourceConfiguration

# create the command
job = command(
    code="./script",
    command="python train_horovod.py --data_folder ${{inputs.mnist_tf}}/train",
    inputs={
        "mnist_tf": Input(
            type="uri_folder",
            path="mnist_data@latest",
        ),
    },
    environment="AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu@latest",
    compute="mycluster01",
    resources=JobResourceConfiguration(instance_count=3),
    distribution=MpiDistribution(process_count_per_instance=1),
    display_name="tf_distribued",
    experiment_name="tf_distribued",
    description="This is example",
)

# submit the command
returned_job = ml_client.create_or_update(job)

[32mUploading script (0.0 MBs): 100%|██████████████████████████████████████████| 4135/4135 [00:00<00:00, 92418.53it/s][0m
[39m



Please wait until the job is completed.

You can see current status (progress) with [AML studio UI](https://ml.azure.com/) (see "Jobs" pane) or with the following CLI command.

In [7]:
ml_client.jobs.get(returned_job.name)

Experiment,Name,Type,Status,Details Page
tf_distribued,ivory_map_2092zc6s7t,command,Completed,Link to Azure Machine Learning studio


### Step 3 : Remove AML compute

**You don't need to remove your AML compute** for saving money, because the nodes will be automatically terminated, when it's inactive.<br>
But if you want to clean up, please run as follows.

In [7]:
ml_client.compute.begin_delete("mycluster01")

Deleting compute mycluster01 


.......

Done.
(0m 36s)

