# TensorFlow GPU avec Azure ML

<img src='https://github.com/retkowsky/images/blob/master/AzureMLservicebanniere.png?raw=true'>

In [1]:
import sys
sys.version

'3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) \n[GCC 7.3.0]'

In [2]:
import utils

In [3]:
import datetime
maintenant = datetime.datetime.now()
print("Date du run = ", maintenant)

Date du run =  2020-03-10 15:15:26.283425


In [4]:
# Check core SDK version number
import azureml.core

print("Version Azure ML = ", azureml.core.VERSION)

Version Azure ML =  1.0.83


## 1. Workspace

In [5]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

Workspace name: AzureMLWorkshop
Azure region: westeurope
Subscription id: 70b8f39e-8863-49f7-b6ba-34a80799550c
Resource group: AzureMLWorkshopRG


## 2. AML Compute GPU

Les tailles NC, NCv2 et NCv3 sont optimisées pour les algorithmes et les applications nécessitant beaucoup de ressources réseau et de calculs. 

En voici quelques exemples : les applications et les simulations CUDA et OpenCL, l’intelligence artificielle et l’apprentissage profond. Équipée du GPU Tesla V100 de NVIDIA, la série NCv3 est axée sur les charges de travail informatiques à hautes performances. 

- La série NC utilise le processeur Intel Xeon E5-2690 v3 2.60GHz v3 (Haswell) et les machines virtuelles de la série NCv2 et NCv3 sont dotées du processeur Intel Xeon E5-2690 v4 (Broadwell).

- ND et NDv2La série ND est destinée à l’exécution de scénarios d’apprentissage et d’inférence pour le Deep Learning. Elle utilise le GPU NVIDIA Tesla P40 et le processeur Intel Xeon E5-2690 v4 (Broadwell). La série NDv2 utilise le processeur Intel Xeon Platinum 8168 (Skylake).

- Les tailles NV et NVv3 sont optimisées et conçues pour la visualisation à distance, la diffusion en continu, les jeux, l’encodage et les scénarios de VDI utilisant des infrastructures comme OpenGL ou DirectX. Ces machines virtuelles reposent sur le GPU Tesla M60 de NVIDIA.

- Les tailles NVv4 sont optimisées et conçues pour l’infrastructure VDI et la visualisation à distance. Avec des GPU partitionnés, NVv4 offre la taille adaptée aux charges de travail nécessitant des ressources GPU plus petites. Ces machines virtuelles sont associées au GPU AMD Radeon Instinct MI25.

https://docs.microsoft.com/fr-fr/azure/virtual-machines/windows/sizes-gpu

In [6]:
from azureml.core.compute import ComputeTarget, AmlCompute

AmlCompute.supported_vmsizes(workspace = ws)

[{'name': 'Standard_D1_v2',
  'vCPUs': 1,
  'gpus': 0,
  'memoryGB': 3.5,
  'maxResourceVolumeMB': 51200},
 {'name': 'Standard_D2_v2',
  'vCPUs': 2,
  'gpus': 0,
  'memoryGB': 7.0,
  'maxResourceVolumeMB': 102400},
 {'name': 'Standard_D3_v2',
  'vCPUs': 4,
  'gpus': 0,
  'memoryGB': 14.0,
  'maxResourceVolumeMB': 204800},
 {'name': 'Standard_D4_v2',
  'vCPUs': 8,
  'gpus': 0,
  'memoryGB': 28.0,
  'maxResourceVolumeMB': 409600},
 {'name': 'Standard_D11_v2',
  'vCPUs': 2,
  'gpus': 0,
  'memoryGB': 14.0,
  'maxResourceVolumeMB': 102400},
 {'name': 'Standard_D12_v2',
  'vCPUs': 4,
  'gpus': 0,
  'memoryGB': 28.0,
  'maxResourceVolumeMB': 204800},
 {'name': 'Standard_D13_v2',
  'vCPUs': 8,
  'gpus': 0,
  'memoryGB': 56.0,
  'maxResourceVolumeMB': 409600},
 {'name': 'Standard_D14_v2',
  'vCPUs': 16,
  'gpus': 0,
  'memoryGB': 112.0,
  'maxResourceVolumeMB': 819200},
 {'name': 'Standard_DS1_v2',
  'vCPUs': 1,
  'gpus': 0,
  'memoryGB': 3.5,
  'maxResourceVolumeMB': 7168},
 {'name': 'Standar

In [7]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "gpuclusterNC6"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           min_nodes=1,
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

Creating a new compute target...
Creating
Succeeded........................
AmlCompute wait for completion finished
Minimum number of nodes requested have been provisioned
{'currentNodeCount': 1, 'targetNodeCount': 1, 'nodeStateCounts': {'preparingNodeCount': 1, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2020-03-10T15:17:48.774000+00:00', 'errors': None, 'creationTime': '2020-03-10T15:15:35.695125+00:00', 'modifiedTime': '2020-03-10T15:15:51.519596+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 1, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


In [8]:
cts = ws.compute_targets
for ct in cts:
    print(ct)

pipeline
aks-exemple
gpu-cluster2
gpuclusterNC6
pipeline-cpu


## 3. Données

In [9]:
from azureml.core.dataset import Dataset
web_paths = ['http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
             'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
             'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
             'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
            ]
dataset = Dataset.File.from_files(path = web_paths)

In [10]:
dataset = dataset.register(workspace = ws,
                           name = 'mnist dataset',
                           description='training and test dataset',
                           create_new_version=True)

In [11]:
dataset.to_path()

array(['/http/yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
       '/http/yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
       '/http/yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
       '/http/yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'],
      dtype=object)

## 4. Création projet et expérimentation

In [12]:
import os
script_folder = './tf-resume-training'
os.makedirs(script_folder, exist_ok=True)

In [13]:
from azureml.core import Experiment

experiment_name = 'Exemple11-TensorFlow'
experiment = Experiment(ws, name=experiment_name)

## 5. Création et exécution estimator TensorFlow

### Visualisation du code python :

In [14]:
with open(os.path.join(script_folder, './tf_mnist_with_checkpoint.py'), 'r') as f:
    print(f.read())

# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

import numpy as np
import utils
import argparse
import os
import re
import tensorflow as tf
import glob

from azureml.core import Run
from utils import load_data

print("TensorFlow version:", tf.__version__)

parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')

parser.add_argument('--resume-from', type=str, default=None,
                    help='location of the model or checkpoint files from where to resume the training')
args = parser.parse_args()


previous_model_location = args.resume_from
# You can also use environment variable to get the model/checkpoint files location
# previous_model_location = os.path.expandvars(os.getenv("AZUREML_DATAREFERENCE_MODEL_LOCATION", None))

data_folder = args.data_folder
print('Data folder:', data_folder)

# load train and test set into numpy arrays
# note we scale the pi

In [15]:
from azureml.train.dnn import TensorFlow

script_params={
    '--data-folder': dataset.as_named_input('mnist').as_mount()
}

estimator= TensorFlow(source_directory=script_folder,
                      compute_target=compute_target,
                      script_params=script_params,
                      entry_script='tf_mnist_with_checkpoint.py',
                      use_gpu=True,
                      pip_packages=['azureml-dataprep[pandas,fuse]'])



### Exécution du Run

In [16]:
run = experiment.submit(estimator)
print(run)

Run(Experiment: Exemple11-TensorFlow,
Id: Exemple11-TensorFlow_1583853521_94783f0b,
Type: azureml.scriptrun,
Status: Starting)


### Widget pour suivi de l'exécution du run

In [17]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

In [21]:
run.get_details()

{'runId': 'Exemple11-TensorFlow_1583853521_94783f0b',
 'target': 'gpuclusterNC6',
 'status': 'Completed',
 'startTimeUtc': '2020-03-10T15:19:00.990505Z',
 'endTimeUtc': '2020-03-10T15:22:22.994874Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': 'fbbb369a-2230-4e7a-a8c3-2fbbe3bf6943',
  'azureml.git.repository_uri': 'https://github.com/retkowsky/WorkshopAML2020',
  'mlflow.source.git.repoURL': 'https://github.com/retkowsky/WorkshopAML2020',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': '92bcd73fc9ec1037078710902a207fd495a95825',
  'mlflow.source.git.commit': '92bcd73fc9ec1037078710902a207fd495a95825',
  'azureml.git.dirty': 'False',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '35f275fa-1eaa-4895-87a3-2445433be7f0'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'mnist', 'mechanism'

### Liste des métriques du run

In [22]:
run.get_metrics()

{}

### Visualisation des métriques dans l'expérimentation depuis Azure ML Studio

In [23]:
experiment

Name,Workspace,Report Page,Docs Page
Exemple11-TensorFlow,AzureMLWorkshop,Link to Azure Machine Learning studio,Link to Documentation


<img src="https://github.com/retkowsky/images/blob/master/metriques.jpg?raw=true">

<img src="https://github.com/retkowsky/images/blob/master/Powered-by-MS-Azure-logo-v2.png?raw=true" height="300" width="300">