# Azure ML Compute

<img src='https://github.com/retkowsky/images/blob/master/AzureMLservicebanniere.png?raw=true'>

Documentation:<br>
https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target <br>
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-training-targets


**Azure Machine Learning Compute** is a **managed-compute infrastructure** that allows the user to easily create a single or multi-node compute. The compute is created within your workspace region as a resource that can be shared with other users in your workspace. The compute **scales up automatically when a job is submitted**, and can be put in an Azure Virtual Network. The compute executes in a containerized environment and packages your model dependencies in a **Docker container**.

You can use Azure Machine Learning Compute to distribute the training process across a cluster of **CPU or GPU** compute nodes in the cloud. For more information on the VM sizes that include GPUs, see GPU-optimized virtual machine sizes.

Azure Machine Learning Compute has default limits, such as the number of cores that can be allocated. For more information, see Manage and request quotas for Azure resources.

You can create an Azure Machine Learning compute environment **on demand** when you schedule a run, or as a **persistent resource**.


## 1. Intro

In [None]:
import sys
sys.version

In [2]:
import datetime
now = datetime.datetime.now()
print(now)

2020-10-13 09:35:07.147158


In [3]:
import azureml.core
print("Version Azure ML :", azureml.core.VERSION)

Version Azure ML : 1.15.0


## 2. Workspace

Initialize a workspace object from persisted configuration

In [4]:
from azureml.core import Workspace
ws = Workspace.from_config()

## 3. Expérimentation

**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments.

In [5]:
from azureml.core import Experiment
experiment_name = 'Workshop-amlcompute'
experiment = Experiment(workspace = ws, name = experiment_name)

## 4. Introduction AmlCompute

> https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets

## Liste des compute servers définis

In [6]:
print("Compute instances in your workspace Azure ML:")
cts = ws.compute_targets
for ct in cts:
    print('-', ct)

Compute instances in your workspace Azure ML:
- monclustergpu
- nbookinstance
- monclustercpu
- cpucluster
- cpuclusterd2
- clustergpu


### 4.1 Liste serveurs AML Compute disponibles

In [7]:
from azureml.core.compute import ComputeTarget, AmlCompute

AmlCompute.supported_vmsizes(workspace = ws)

[{'name': 'Standard_D1_v2',
  'vCPUs': 1,
  'gpus': 0,
  'memoryGB': 3.5,
  'maxResourceVolumeMB': 51200},
 {'name': 'Standard_D2_v2',
  'vCPUs': 2,
  'gpus': 0,
  'memoryGB': 7.0,
  'maxResourceVolumeMB': 102400},
 {'name': 'Standard_D3_v2',
  'vCPUs': 4,
  'gpus': 0,
  'memoryGB': 14.0,
  'maxResourceVolumeMB': 204800},
 {'name': 'Standard_D4_v2',
  'vCPUs': 8,
  'gpus': 0,
  'memoryGB': 28.0,
  'maxResourceVolumeMB': 409600},
 {'name': 'Standard_D11_v2',
  'vCPUs': 2,
  'gpus': 0,
  'memoryGB': 14.0,
  'maxResourceVolumeMB': 102400},
 {'name': 'Standard_D12_v2',
  'vCPUs': 4,
  'gpus': 0,
  'memoryGB': 28.0,
  'maxResourceVolumeMB': 204800},
 {'name': 'Standard_D13_v2',
  'vCPUs': 8,
  'gpus': 0,
  'memoryGB': 56.0,
  'maxResourceVolumeMB': 409600},
 {'name': 'Standard_D14_v2',
  'vCPUs': 16,
  'gpus': 0,
  'memoryGB': 112.0,
  'maxResourceVolumeMB': 819200},
 {'name': 'Standard_DS1_v2',
  'vCPUs': 1,
  'gpus': 0,
  'memoryGB': 3.5,
  'maxResourceVolumeMB': 7168},
 {'name': 'Standar

In [8]:
import os
import shutil

project_folder = './train-on-amlcompute'
os.makedirs(project_folder, exist_ok=True)
shutil.copy('train_aml.py', project_folder)

'./train-on-amlcompute/train_aml.py'

In [9]:
with open(os.path.join('./train-on-amlcompute/train_aml.py'), 'r') as f:
    print(f.read())

# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license.

from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from azureml.core.run import Run
from sklearn.externals import joblib
import os
import numpy as np

os.makedirs('./outputs', exist_ok=True)

X, y = load_diabetes(return_X_y=True)

run = Run.get_context()

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=0)
data = {"train": {"X": X_train, "y": y_train},
        "test": {"X": X_test, "y": y_test}}

# list of numbers from 0.0 to 1.0 with a 0.05 interval
alphas = np.arange(0.0, 1.0, 0.05)

for alpha in alphas:
    # Use Ridge algorithm to create a regression model
    reg = Ridge(alpha=alpha)
    reg.fit(data["train"]["X"], data["train"]["y"

In [10]:
import sklearn
print('Version scikit-learn =', sklearn.__version__)

Version scikit-learn = 0.22.2.post1


### 4.3 Environnement

In [13]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

myenv = Environment("myenv")

myenv.docker.enabled = True

myenv.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn==0.20.3'])

> Documentation : https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute<br>
> Pricing : https://azure.microsoft.com/en-us/pricing/details/machine-learning/

In [14]:
%%time
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Nom
cpu_cluster_name = "cpuclusterd2v2"

try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           min_nodes = 1, #Mettre à 0 pour automatic shutdown
                                                           max_nodes = 4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Creating
Succeeded.................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
CPU times: user 350 ms, sys: 32.2 ms, total: 383 ms
Wall time: 1min 49s


In [15]:
#Liste des compute servers disponibles
listecomputeservers = ws.compute_targets
for listesrv in listecomputeservers:
    print(listesrv)

nbookinstance
monclustercpu
cpucluster
cpuclusterd2
clustergpu
cpuclusterd2v2


In [16]:
cpu_cluster.get_status().serialize()

{'currentNodeCount': 1,
 'targetNodeCount': 1,
 'nodeStateCounts': {'preparingNodeCount': 0,
  'runningNodeCount': 0,
  'idleNodeCount': 1,
  'unusableNodeCount': 0,
  'leavingNodeCount': 0,
  'preemptedNodeCount': 0},
 'allocationState': 'Steady',
 'allocationStateTransitionTime': '2020-10-13T09:39:58.164000+00:00',
 'errors': None,
 'creationTime': '2020-10-13T09:38:22.248494+00:00',
 'modifiedTime': '2020-10-13T09:38:37.962911+00:00',
 'provisioningState': 'Succeeded',
 'provisioningStateTransitionTime': None,
 'scaleSettings': {'minNodeCount': 1,
  'maxNodeCount': 4,
  'nodeIdleTimeBeforeScaleDown': 'PT120S'},
 'vmPriority': 'Dedicated',
 'vmSize': 'STANDARD_D2_V2'}

In [17]:
# Statut du compute server
cpu_cluster.list_nodes()

[{'nodeId': 'tvmps_89a740fde66671fb6cbc40857379dc2c5b625d649f2027cfea1bbc63c0f024ee_d',
  'port': 50000,
  'publicIpAddress': '51.105.167.166',
  'privateIpAddress': '10.0.0.4',
  'nodeState': 'idle'}]

### 4.4 Configuration et exécution du run

In [18]:
# Fichier Python à exécuter
!ls train_aml.py -l

-rwxrwxrwx 1 root root 1538 Oct 13 08:24 train_aml.py


In [22]:
with open(os.path.join('train_aml.py'), 'r') as f:
    print(f.read())

# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license.

from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from azureml.core.run import Run
from sklearn.externals import joblib
import os
import numpy as np

os.makedirs('./outputs', exist_ok=True)

X, y = load_diabetes(return_X_y=True)

run = Run.get_context()

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=0)
data = {"train": {"X": X_train, "y": y_train},
        "test": {"X": X_test, "y": y_test}}

# list of numbers from 0.0 to 1.0 with a 0.05 interval
alphas = np.arange(0.0, 1.0, 0.05)

for alpha in alphas:
    # Use Ridge algorithm to create a regression model
    reg = Ridge(alpha=alpha)
    reg.fit(data["train"]["X"], data["train"]["y"

In [23]:
from azureml.core import ScriptRunConfig
from azureml.core.runconfig import DEFAULT_CPU_IMAGE

src = ScriptRunConfig(source_directory=project_folder, script='train_aml.py')

# Set compute target to the one created in previous step
src.run_config.target = cpu_cluster.name

# Set environment
src.run_config.environment = myenv

In [24]:
# Définition de tags pour le run
tagsdurun = {"Type": "test" , "Langage" : "Python" , "Framework" : "Scikit-Learn", "Team" : "DataScience" , "Pays" : "France"}

> C'est parti ! On exécute le run

In [25]:
# Execution run
run = experiment.submit(config=src, tags=tagsdurun)
run

Experiment,Id,Type,Status,Details Page,Docs Page
Workshop-amlcompute,Workshop-amlcompute_1602582161_46313f6a,azureml.scriptrun,Starting,Link to Azure Machine Learning studio,Link to Documentation


### 4.5 Widget disponible pour suivre l'avancement du run

In [26]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

### 4.6 Informations additionnelles

> **run.get_details** pour suivre **l'avancement du run**. <br>Si le cluster est inactif, cela peut nécessiter + de temps de traitement.

In [35]:
# Statut du run
run.get_status()

'Completed'

In [36]:
# Détails du run
run.get_details()

{'runId': 'Workshop-amlcompute_1602582161_46313f6a',
 'target': 'cpuclusterd2v2',
 'status': 'Completed',
 'startTimeUtc': '2020-10-13T09:49:25.688789Z',
 'endTimeUtc': '2020-10-13T09:53:06.664352Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '03322f09-e6b5-428a-b8a7-875deafbfd44',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'train_aml.py',
  'command': [],
  'useAbsolutePath': False,
  'arguments': [],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'cpuclusterd2v2',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'jobName': None,
  'maxRunDurationSeconds': 2592000,
  'nodeCount': 1,
  'priority': None,
  'environment': {'name': 'myenv',
   'version': 'Autosave_2020-10-13T09:42:45Z_daf56b1b',
   'python': {'interpreterPath': 'python',
    'u

In [37]:
# Statut
cpu_cluster.list_nodes()

[{'nodeId': 'tvmps_89a740fde66671fb6cbc40857379dc2c5b625d649f2027cfea1bbc63c0f024ee_d',
  'port': 50000,
  'publicIpAddress': '51.105.167.166',
  'privateIpAddress': '10.0.0.4',
  'nodeState': 'idle'}]

> Pour voir les métriques de l'expérimentation (uniquement en fin de run). <br>Les métriques sont aussi visibles dans le portail Azure.

In [38]:
print("Liste des métriques :")
run.get_metrics()

Liste des métriques :


{'alpha': [0.0,
  0.05,
  0.1,
  0.15000000000000002,
  0.2,
  0.25,
  0.30000000000000004,
  0.35000000000000003,
  0.4,
  0.45,
  0.5,
  0.55,
  0.6000000000000001,
  0.65,
  0.7000000000000001,
  0.75,
  0.8,
  0.8500000000000001,
  0.9,
  0.9500000000000001],
 'mse': [3424.3166882137343,
  3408.9153122589296,
  3372.649627810032,
  3345.14964347419,
  3325.294679467878,
  3311.5562509289744,
  3302.6736334017264,
  3297.658733944204,
  3295.74106435581,
  3296.316884705676,
  3298.9096058070622,
  3303.140055527517,
  3308.7042707723226,
  3315.3568399622573,
  3322.898314903962,
  3331.1656169285875,
  3340.024662032161,
  3349.364644348603,
  3359.093569748443,
  3369.1347399130477]}

### Informations sur le compute server:

In [39]:
print("Status du cluster :")
cpu_cluster.get_status().serialize()

Status du cluster :


{'currentNodeCount': 1,
 'targetNodeCount': 1,
 'nodeStateCounts': {'preparingNodeCount': 0,
  'runningNodeCount': 0,
  'idleNodeCount': 1,
  'unusableNodeCount': 0,
  'leavingNodeCount': 0,
  'preemptedNodeCount': 0},
 'allocationState': 'Steady',
 'allocationStateTransitionTime': '2020-10-13T09:39:58.164000+00:00',
 'errors': None,
 'creationTime': '2020-10-13T09:38:22.248494+00:00',
 'modifiedTime': '2020-10-13T09:38:37.962911+00:00',
 'provisioningState': 'Succeeded',
 'provisioningStateTransitionTime': None,
 'scaleSettings': {'minNodeCount': 1,
  'maxNodeCount': 4,
  'nodeIdleTimeBeforeScaleDown': 'PT120S'},
 'vmPriority': 'Dedicated',
 'vmSize': 'STANDARD_D2_V2'}

In [40]:
print("Noeuds du cluster :")
cpu_cluster.list_nodes()

Noeuds du cluster :


[{'nodeId': 'tvmps_89a740fde66671fb6cbc40857379dc2c5b625d649f2027cfea1bbc63c0f024ee_d',
  'port': 50000,
  'publicIpAddress': '51.105.167.166',
  'privateIpAddress': '10.0.0.4',
  'nodeState': 'idle'}]

### On peut changer la configuration du compute server :

In [41]:
cpu_cluster.update(min_nodes=0) # On passe à 0 min node

In [42]:
cpu_cluster.get_status().serialize()

{'currentNodeCount': 1,
 'targetNodeCount': 1,
 'nodeStateCounts': {'preparingNodeCount': 0,
  'runningNodeCount': 0,
  'idleNodeCount': 1,
  'unusableNodeCount': 0,
  'leavingNodeCount': 0,
  'preemptedNodeCount': 0},
 'allocationState': 'Steady',
 'allocationStateTransitionTime': '2020-10-13T09:39:58.164000+00:00',
 'errors': None,
 'creationTime': '2020-10-13T09:38:22.248494+00:00',
 'modifiedTime': '2020-10-13T09:54:41.274807+00:00',
 'provisioningState': 'Succeeded',
 'provisioningStateTransitionTime': None,
 'scaleSettings': {'minNodeCount': 0,
  'maxNodeCount': 4,
  'nodeIdleTimeBeforeScaleDown': 'PT120S'},
 'vmPriority': 'Dedicated',
 'vmSize': 'STANDARD_D2_V2'}

In [43]:
cpu_cluster.update(max_nodes=10)

In [44]:
cpu_cluster.update(idle_seconds_before_scaledown=1200) # On change le timeout

In [45]:
print("Status du cluster")
cpu_cluster.get_status().serialize()

Status du cluster


{'currentNodeCount': 1,
 'targetNodeCount': 1,
 'nodeStateCounts': {'preparingNodeCount': 0,
  'runningNodeCount': 0,
  'idleNodeCount': 1,
  'unusableNodeCount': 0,
  'leavingNodeCount': 0,
  'preemptedNodeCount': 0},
 'allocationState': 'Steady',
 'allocationStateTransitionTime': '2020-10-13T09:39:58.164000+00:00',
 'errors': None,
 'creationTime': '2020-10-13T09:38:22.248494+00:00',
 'modifiedTime': '2020-10-13T09:54:49.548267+00:00',
 'provisioningState': 'Succeeded',
 'provisioningStateTransitionTime': None,
 'scaleSettings': {'minNodeCount': 0,
  'maxNodeCount': 10,
  'nodeIdleTimeBeforeScaleDown': 'PT1200S'},
 'vmPriority': 'Dedicated',
 'vmSize': 'STANDARD_D2_V2'}

In [46]:
cpu_cluster.update(min_nodes=2, max_nodes=4, idle_seconds_before_scaledown=600)

### Suppression du compute server :

In [47]:
#Pour supprimer le compute server
cpu_cluster.delete()

In [48]:
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, " - " , ct.type, " - ", ct.provisioning_state)

cpuclusterd2v2  -  AmlCompute  -  Deleting
nbookinstance  -  ComputeInstance  -  Succeeded
monclustercpu  -  AmlCompute  -  Succeeded
cpucluster  -  AmlCompute  -  Succeeded
cpuclusterd2  -  AmlCompute  -  Succeeded
clustergpu  -  AmlCompute  -  Succeeded


<img src="https://github.com/retkowsky/images/blob/master/Powered-by-MS-Azure-logo-v2.png?raw=true" height="300" width="300">