# TensorFlow avec Horovod

<img src='https://github.com/retkowsky/images/blob/master/AzureMLservicebanniere.png?raw=true'>

The TensorFlow estimator also supports distributed training across CPU and GPU clusters. You can easily run distributed TensorFlow jobs and Azure Machine Learning will manage the orchestration for you.

Azure Machine Learning supports two methods of distributed training in TensorFlow:

- **MPI-based distributed training** using the Horovod framework<br>
- **Native distributed TensorFlow** using the parameter server method

**Horovod**
Horovod is an open-source framework for distributed training developed by Uber. It offers an easy path to distributed GPU TensorFlow jobs.

To use Horovod, specify an **MpiConfiguration object** for the distributed_training parameter in the TensorFlow constructor. This parameter ensures that Horovod library is installed for you to use in your training script.

## 1. Infos

In [1]:
import sys
sys.version

'3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) \n[GCC 7.3.0]'

In [2]:
import datetime
maintenant = datetime.datetime.now()
print('Date :', maintenant)

Date : 2020-04-17 13:46:08.288250


In [3]:
import azureml.core
print("Azure ML version :", azureml.core.VERSION)

Azure ML version : 1.2.0


In [4]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Workspace name: workshopAML2020
Azure region: westeurope
Resource group: workshopAML2020-rg


## 2. Création compute GPU

In [5]:
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, " - " , ct.type, " - ", ct.provisioning_state)

instance-aks  -  AKS  -  Succeeded
instance  -  ComputeInstance  -  Succeeded
gpuNC6  -  AmlCompute  -  Succeeded


In [6]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "gpuclusterNC6"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           min_nodes=1,
                                                           max_nodes=8)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

print(compute_target.status.serialize())

Creating a new compute target...
Creating
Succeeded..................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 1, 'targetNodeCount': 1, 'nodeStateCounts': {'preparingNodeCount': 1, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2020-04-17T13:51:53.131000+00:00', 'errors': None, 'creationTime': '2020-04-17T13:47:02.870613+00:00', 'modifiedTime': '2020-04-17T13:50:19.008373+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 1, 'maxNodeCount': 8, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


In [7]:
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, " - " , ct.type, " - ", ct.provisioning_state)

instance-aks  -  AKS  -  Succeeded
gpuNC6  -  AmlCompute  -  Deleting
instance  -  ComputeInstance  -  Succeeded
gpuclusterNC6  -  AmlCompute  -  Succeeded


In [8]:
print("Status du cluster")
compute_target.get_status().serialize()

Status du cluster


{'currentNodeCount': 1,
 'targetNodeCount': 1,
 'nodeStateCounts': {'preparingNodeCount': 1,
  'runningNodeCount': 0,
  'idleNodeCount': 0,
  'unusableNodeCount': 0,
  'leavingNodeCount': 0,
  'preemptedNodeCount': 0},
 'allocationState': 'Steady',
 'allocationStateTransitionTime': '2020-04-17T13:51:53.131000+00:00',
 'errors': None,
 'creationTime': '2020-04-17T13:47:02.870613+00:00',
 'modifiedTime': '2020-04-17T13:50:19.008373+00:00',
 'provisioningState': 'Succeeded',
 'provisioningStateTransitionTime': None,
 'scaleSettings': {'minNodeCount': 1,
  'maxNodeCount': 8,
  'nodeIdleTimeBeforeScaleDown': 'PT120S'},
 'vmPriority': 'Dedicated',
 'vmSize': 'STANDARD_NC6'}

## 3. Chargement des données

In [9]:
import os
import urllib

os.makedirs('./data', exist_ok=True)
download_url = 'http://mattmahoney.net/dc/text8.zip'
urllib.request.urlretrieve(download_url, filename='./data/text8.zip')

('./data/text8.zip', <http.client.HTTPMessage at 0x7f57d9db0550>)

In [10]:
ds = ws.get_default_datastore()
print(ds.datastore_type, ds.account_name, ds.container_name)

AzureBlob workshopaml2027584246021 azureml-blobstore-1696467d-5136-4ed9-9f78-4c69eaff7896


In [11]:
ds.upload(src_dir='data', target_path='data', overwrite=True, show_progress=True)

Uploading an estimated of 5 files
Uploading data/mnist/test-images.gz
Uploading data/mnist/test-labels.gz
Uploading data/mnist/train-images.gz
Uploading data/mnist/train-labels.gz
Uploading data/text8.zip
Uploaded data/mnist/train-labels.gz, 1 files out of an estimated total of 5
Uploaded data/mnist/test-labels.gz, 2 files out of an estimated total of 5
Uploaded data/mnist/test-images.gz, 3 files out of an estimated total of 5
Uploaded data/mnist/train-images.gz, 4 files out of an estimated total of 5
Uploaded data/text8.zip, 5 files out of an estimated total of 5
Uploaded 5 files


$AZUREML_DATAREFERENCE_d53a98effa00406faafd69b7bdf7c126

In [12]:
path_on_datastore = 'data/text8.zip'
ds_data = ds.path(path_on_datastore)
print(ds_data)

$AZUREML_DATAREFERENCE_6b27e129b423478aabb213fd51ffb869


## 4. Apprentissage du modèle

In [13]:
import os

project_folder = './tf-distr-hvd'
os.makedirs(project_folder, exist_ok=True)

In [14]:
import shutil

shutil.copy('tf_horovod_word2vec.py', project_folder)

'./tf-distr-hvd/tf_horovod_word2vec.py'

In [15]:
from azureml.core import Experiment

experiment_name = 'Exemple13-TFHorovod'
experiment = Experiment(ws, name=experiment_name)

### 4.1 Estimateur TensorFlow
The AML SDK's TensorFlow estimator enables you to easily submit TensorFlow training jobs for both single-node and distributed runs. For more information on the TensorFlow estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-tensorflow).

In [16]:
from azureml.train.dnn import TensorFlow

script_params={
    '--input_data': ds_data
}

estimator= TensorFlow(source_directory=project_folder,
                      compute_target=compute_target,
                      script_params=script_params,
                      entry_script='tf_horovod_word2vec.py',
                      node_count=2,
                      process_count_per_node=1,
                      distributed_training='mpi', #Pour horovod
                      use_gpu=True)



The above code specifies that we will run our training script on `2` nodes, with one worker per node. In order to execute a distributed run using MPI/Horovod, you must provide the argument `distributed_backend='mpi'`. Using this estimator with these settings, TensorFlow, Horovod and their dependencies will be installed for you. However, if your script also uses other packages, make sure to install them via the `TensorFlow` constructor's `pip_packages` or `conda_packages` parameters.

Note that we passed our training data reference `ds_data` to our script's `--input_data` argument. This will 1) mount our datastore on the remote compute and 2) provide the path to the data zip file on our datastore.

### 4.2 Run

> Nécessite 10 minutes de temps de traitement

In [17]:
tags = {"Type": "test" , "Langage" : "Python" , "Framework" : "Tensorflow Horovod"}

In [18]:
run = experiment.submit(estimator, tags=tags)
print(run)

Run(Experiment: Exemple13-TFHorovod,
Id: Exemple13-TFHorovod_1587131608_050d1784,
Type: azureml.scriptrun,
Status: Queued)


In [19]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

In [38]:
# Progression du run
run.get_details()

{'runId': 'Exemple13-TFHorovod_1587131608_050d1784',
 'target': 'gpuclusterNC6',
 'status': 'Completed',
 'startTimeUtc': '2020-04-17T14:05:44.706017Z',
 'endTimeUtc': '2020-04-17T14:09:30.012269Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': 'c13610ad-10f4-4a42-8d95-4f3cacddc4d9',
  'azureml.git.repository_uri': 'https://github.com/retkowsky/WorkshopAML2020',
  'mlflow.source.git.repoURL': 'https://github.com/retkowsky/WorkshopAML2020',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': 'eb05ad565a41b5121d26c6fda4b1c6398a9243d7',
  'mlflow.source.git.commit': 'eb05ad565a41b5121d26c6fda4b1c6398a9243d7',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'runDefinition': {'script': 'tf_horovod_word2vec.py',
  'useAbsolutePath': False,
  'arguments': ['--input_data',
   '$AZUREML_DATAREF

In [39]:
# Statut
compute_target.list_nodes()

[{'nodeId': 'tvmps_69002b1a04ce866541dd014fc8db1e72c8cbb932b960296992f8ff60b3a70d89_d',
  'port': 50000,
  'publicIpAddress': '51.138.26.205',
  'privateIpAddress': '10.0.0.4',
  'nodeState': 'idle'},
 {'nodeId': 'tvmps_c207d86753a4849ae1b2ccc19eb052d0ff248f4ca6b43b97285da811abead036_d',
  'port': 50001,
  'publicIpAddress': '51.138.26.205',
  'privateIpAddress': '10.0.0.5',
  'nodeState': 'idle'}]

## 4.3 Résultats

In [40]:
run.wait_for_completion(show_output=True)

RunId: Exemple13-TFHorovod_1587131608_050d1784
Web View: https://ml.azure.com/experiments/Exemple13-TFHorovod/runs/Exemple13-TFHorovod_1587131608_050d1784?wsid=/subscriptions/70b8f39e-8863-49f7-b6ba-34a80799550c/resourcegroups/workshopAML2020-rg/workspaces/workshopAML2020

Execution Summary
RunId: Exemple13-TFHorovod_1587131608_050d1784
Web View: https://ml.azure.com/experiments/Exemple13-TFHorovod/runs/Exemple13-TFHorovod_1587131608_050d1784?wsid=/subscriptions/70b8f39e-8863-49f7-b6ba-34a80799550c/resourcegroups/workshopAML2020-rg/workspaces/workshopAML2020



{'runId': 'Exemple13-TFHorovod_1587131608_050d1784',
 'target': 'gpuclusterNC6',
 'status': 'Completed',
 'startTimeUtc': '2020-04-17T14:05:44.706017Z',
 'endTimeUtc': '2020-04-17T14:09:30.012269Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': 'c13610ad-10f4-4a42-8d95-4f3cacddc4d9',
  'azureml.git.repository_uri': 'https://github.com/retkowsky/WorkshopAML2020',
  'mlflow.source.git.repoURL': 'https://github.com/retkowsky/WorkshopAML2020',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': 'eb05ad565a41b5121d26c6fda4b1c6398a9243d7',
  'mlflow.source.git.commit': 'eb05ad565a41b5121d26c6fda4b1c6398a9243d7',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'runDefinition': {'script': 'tf_horovod_word2vec.py',
  'useAbsolutePath': False,
  'arguments': ['--input_data',
   '$AZUREML_DATAREF

In [41]:
run.get_metrics()

{'Loss': [268.85772705078125, 118.6477854642868, 55.62814997577667]}

In [42]:
experiment

Name,Workspace,Report Page,Docs Page
Exemple13-TFHorovod,workshopAML2020,Link to Azure Machine Learning studio,Link to Documentation


## 5. Suppression ressource

In [43]:
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, " - " , ct.type, " - ", ct.provisioning_state)

instance-aks  -  AKS  -  Succeeded
pipelines  -  AmlCompute  -  Deleting
cpupipelines  -  AmlCompute  -  Deleting
instance  -  ComputeInstance  -  Succeeded
gpuclusterNC6  -  AmlCompute  -  Succeeded


In [44]:
#Pour supprimer le compute server
compute_target.delete()

In [45]:
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, " - " , ct.type, " - ", ct.provisioning_state)

instance-aks  -  AKS  -  Succeeded
gpuclusterNC6  -  AmlCompute  -  Deleting
pipelines  -  AmlCompute  -  Deleting
cpupipelines  -  AmlCompute  -  Deleting
instance  -  ComputeInstance  -  Succeeded


<img src="https://github.com/retkowsky/images/blob/master/Powered-by-MS-Azure-logo-v2.png?raw=true" height="300" width="300">