# TensorFlow avec Horovod

<img src='https://github.com/retkowsky/images/blob/master/AzureMLservicebanniere.png?raw=true'>

The TensorFlow estimator also supports distributed training across CPU and GPU clusters. You can easily run distributed TensorFlow jobs and Azure Machine Learning will manage the orchestration for you.

Azure Machine Learning supports two methods of distributed training in TensorFlow:

- **MPI-based distributed training** using the Horovod framework<br>
- **Native distributed TensorFlow** using the parameter server method

**Horovod**
Horovod is an open-source framework for distributed training developed by Uber. It offers an easy path to distributed GPU TensorFlow jobs.

To use Horovod, specify an **MpiConfiguration object** for the distributed_training parameter in the TensorFlow constructor. This parameter ensures that Horovod library is installed for you to use in your training script.

## 1. Infos

In [1]:
import sys
sys.version

'3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) \n[GCC 7.3.0]'

In [2]:
import datetime
maintenant = datetime.datetime.now()
print('Date :', maintenant)

Date : 2020-09-16 13:39:16.962904


In [3]:
import azureml.core
print("Azure ML version :", azureml.core.VERSION)

Azure ML version : 1.13.0


In [4]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Workspace name: workshopAML2020
Azure region: westeurope
Resource group: workshopAML2020-rg


## 2. Création compute GPU

In [5]:
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, " - " , ct.type, " - ", ct.provisioning_state)

cpupipelines  -  AmlCompute  -  Deleting
gpuclusterstdnc6  -  AmlCompute  -  Deleting
AutoML  -  AmlCompute  -  Succeeded
Designer  -  AmlCompute  -  Succeeded
notebooksjupyter  -  ComputeInstance  -  Succeeded
cpucluster  -  AmlCompute  -  Succeeded
AutoMLsdk  -  AmlCompute  -  Succeeded
clustergpuNC6  -  AmlCompute  -  Succeeded
cpu-standardd4  -  AmlCompute  -  Succeeded


In [6]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "gpuclustNC6"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           min_nodes=1,
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)


Creating a new compute target...
Creating
Succeeded.................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [7]:
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, "-" , ct.type, "=", ct.provisioning_state)

gpuclusterstdnc6 - AmlCompute = Deleting
AutoML - AmlCompute = Succeeded
Designer - AmlCompute = Succeeded
notebooksjupyter - ComputeInstance = Succeeded
cpucluster - AmlCompute = Succeeded
AutoMLsdk - AmlCompute = Succeeded
clustergpuNC6 - AmlCompute = Succeeded
cpu-standardd4 - AmlCompute = Succeeded
gpuclustNC6 - AmlCompute = Succeeded


## 3. Chargement des données

In [8]:
import os
import urllib

os.makedirs('./data', exist_ok=True)
download_url = 'http://mattmahoney.net/dc/text8.zip'
urllib.request.urlretrieve(download_url, filename='./data/text8.zip')

('./data/text8.zip', <http.client.HTTPMessage at 0x7f53bc5fc630>)

In [9]:
ds = ws.get_default_datastore()
print(ds.datastore_type, ds.account_name, ds.container_name)

AzureBlob workshopaml2027584246021 azureml-blobstore-1696467d-5136-4ed9-9f78-4c69eaff7896


In [10]:
ds.upload(src_dir='data', target_path='data', overwrite=True, show_progress=True)

Uploading an estimated of 5 files
Uploading data/mnist/test-images.gz
Uploaded data/mnist/test-images.gz, 1 files out of an estimated total of 5
Uploading data/mnist/test-labels.gz
Uploaded data/mnist/test-labels.gz, 2 files out of an estimated total of 5
Uploading data/mnist/train-labels.gz
Uploaded data/mnist/train-labels.gz, 3 files out of an estimated total of 5
Uploading data/mnist/train-images.gz
Uploaded data/mnist/train-images.gz, 4 files out of an estimated total of 5
Uploading data/text8.zip
Uploaded data/text8.zip, 5 files out of an estimated total of 5
Uploaded 5 files


$AZUREML_DATAREFERENCE_bb1b00b71cb54f13b95b902d0812729e

In [11]:
path_on_datastore = 'data/text8.zip'
ds_data = ds.path(path_on_datastore)
print(ds_data)

$AZUREML_DATAREFERENCE_737eb72bbb104ab6a521f2bc9d6f58ec


## 4. Apprentissage du modèle

In [12]:
import os

project_folder = './tf-distr-hvd'
os.makedirs(project_folder, exist_ok=True)

In [13]:
import shutil

shutil.copy('tf_horovod_word2vec.py', project_folder)

'./tf-distr-hvd/tf_horovod_word2vec.py'

In [14]:
from azureml.core import Experiment

experiment_name = 'Exemple13-TFHorovod'
experiment = Experiment(ws, name=experiment_name)

### 4.1 Estimateur TensorFlow
The AML SDK's TensorFlow estimator enables you to easily submit TensorFlow training jobs for both single-node and distributed runs. For more information on the TensorFlow estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-tensorflow).

In [15]:
from azureml.train.dnn import TensorFlow

script_params={
    '--input_data': ds_data
}

estimator= TensorFlow(source_directory=project_folder,
                      compute_target=compute_target,
                      script_params=script_params,
                      entry_script='tf_horovod_word2vec.py',
                      node_count=2,
                      process_count_per_node=1,
                      distributed_training='mpi', #Pour horovod
                      use_gpu=True)



The above code specifies that we will run our training script on `2` nodes, with one worker per node. In order to execute a distributed run using MPI/Horovod, you must provide the argument `distributed_backend='mpi'`. Using this estimator with these settings, TensorFlow, Horovod and their dependencies will be installed for you. However, if your script also uses other packages, make sure to install them via the `TensorFlow` constructor's `pip_packages` or `conda_packages` parameters.

Note that we passed our training data reference `ds_data` to our script's `--input_data` argument. This will 1) mount our datastore on the remote compute and 2) provide the path to the data zip file on our datastore.

### 4.2 Run

> Nécessite 10 minutes de temps de traitement

In [16]:
tags = {"Type": "test" , "Langage" : "Python" , "Framework" : "Tensorflow Horovod"}

In [17]:
run = experiment.submit(estimator, tags=tags)
print(run)



Run(Experiment: Exemple13-TFHorovod,
Id: Exemple13-TFHorovod_1600263765_02dcf98d,
Type: azureml.scriptrun,
Status: Starting)


In [18]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

In [23]:
print("Status du run :", run.get_status())

Status du run : Completed


In [24]:
# Progression du run
run.get_details()

{'runId': 'Exemple13-TFHorovod_1600263765_02dcf98d',
 'target': 'gpuclustNC6',
 'status': 'Completed',
 'startTimeUtc': '2020-09-16T13:45:05.369093Z',
 'endTimeUtc': '2020-09-16T13:47:53.049781Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '08928909-7f59-4f8b-8a6a-9a61cb7ab8d8',
  'azureml.git.repository_uri': 'https://github.com/retkowsky/WorkshopAML2020',
  'mlflow.source.git.repoURL': 'https://github.com/retkowsky/WorkshopAML2020',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': 'eb05ad565a41b5121d26c6fda4b1c6398a9243d7',
  'mlflow.source.git.commit': 'eb05ad565a41b5121d26c6fda4b1c6398a9243d7',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'tf_horovod_word2vec.py',
  'scriptType': None,
  'useAbsolutePath': False,
  'argume

In [25]:
# Statut
compute_target.list_nodes()

[{'nodeId': 'tvmps_1a14da1138fbb63d8c3f1bfcb96efae9003c90593dca6239c5991ac35bc9b3cf_d',
  'port': 50001,
  'publicIpAddress': '51.124.6.12',
  'privateIpAddress': '10.0.0.5',
  'nodeState': 'idle'},
 {'nodeId': 'tvmps_1b3e9678d9e3ff8ebd84377325c7c904c7f79066b011c2ac01d27eb032515be8_d',
  'port': 50000,
  'publicIpAddress': '51.124.6.12',
  'privateIpAddress': '10.0.0.4',
  'nodeState': 'idle'}]

## 4.3 Résultats

In [26]:
run.wait_for_completion(show_output=True)

RunId: Exemple13-TFHorovod_1600263765_02dcf98d
Web View: https://ml.azure.com/experiments/Exemple13-TFHorovod/runs/Exemple13-TFHorovod_1600263765_02dcf98d?wsid=/subscriptions/70b8f39e-8863-49f7-b6ba-34a80799550c/resourcegroups/workshopAML2020-rg/workspaces/workshopAML2020

Execution Summary
RunId: Exemple13-TFHorovod_1600263765_02dcf98d
Web View: https://ml.azure.com/experiments/Exemple13-TFHorovod/runs/Exemple13-TFHorovod_1600263765_02dcf98d?wsid=/subscriptions/70b8f39e-8863-49f7-b6ba-34a80799550c/resourcegroups/workshopAML2020-rg/workspaces/workshopAML2020



{'runId': 'Exemple13-TFHorovod_1600263765_02dcf98d',
 'target': 'gpuclustNC6',
 'status': 'Completed',
 'startTimeUtc': '2020-09-16T13:45:05.369093Z',
 'endTimeUtc': '2020-09-16T13:47:53.049781Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '08928909-7f59-4f8b-8a6a-9a61cb7ab8d8',
  'azureml.git.repository_uri': 'https://github.com/retkowsky/WorkshopAML2020',
  'mlflow.source.git.repoURL': 'https://github.com/retkowsky/WorkshopAML2020',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': 'eb05ad565a41b5121d26c6fda4b1c6398a9243d7',
  'mlflow.source.git.commit': 'eb05ad565a41b5121d26c6fda4b1c6398a9243d7',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'tf_horovod_word2vec.py',
  'scriptType': None,
  'useAbsolutePath': False,
  'argume

In [27]:
run.get_metrics()

{'Loss': [263.4669189453125, 118.24787282180786, 55.1748961482048]}

In [28]:
experiment

Name,Workspace,Report Page,Docs Page
Exemple13-TFHorovod,workshopAML2020,Link to Azure Machine Learning studio,Link to Documentation


## 5. Suppression ressource

In [29]:
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, " - " , ct.type, " - ", ct.provisioning_state)

AutoML  -  AmlCompute  -  Succeeded
Designer  -  AmlCompute  -  Succeeded
notebooksjupyter  -  ComputeInstance  -  Succeeded
cpucluster  -  AmlCompute  -  Succeeded
AutoMLsdk  -  AmlCompute  -  Succeeded
clustergpuNC6  -  AmlCompute  -  Succeeded
cpu-standardd4  -  AmlCompute  -  Succeeded
gpuclustNC6  -  AmlCompute  -  Succeeded


In [30]:
#Pour supprimer le compute server
#compute_target.delete()

In [31]:
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, " - " , ct.type, " - ", ct.provisioning_state)

AutoML  -  AmlCompute  -  Succeeded
Designer  -  AmlCompute  -  Succeeded
notebooksjupyter  -  ComputeInstance  -  Succeeded
cpucluster  -  AmlCompute  -  Succeeded
AutoMLsdk  -  AmlCompute  -  Succeeded
clustergpuNC6  -  AmlCompute  -  Succeeded
cpu-standardd4  -  AmlCompute  -  Succeeded
gpuclustNC6  -  AmlCompute  -  Succeeded


<img src="https://github.com/retkowsky/images/blob/master/Powered-by-MS-Azure-logo-v2.png?raw=true" height="300" width="300">