Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/training/train-on-amlcompute/train-on-amlcompute.png)

#  Train using Azure Machine Learning Compute

* Initialize a Workspace
* Create an Experiment
* Introduction to AmlCompute
* Submit an AmlCompute run in a few different ways
    - Provision as a persistent compute target (Basic)
    - Provision as a persistent compute target (Advanced)
* Additional operations to perform on AmlCompute
* Find the best model in the run

## Prerequisites
If you are using an Azure Machine Learning Notebook VM, you are all set.  Otherwise, go through the [configuration](../../../configuration.ipynb) Notebook first if you haven't already to establish your connection to the AzureML Workspace.

In [1]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

SDK version: 1.6.0


## Initialize a Workspace

Initialize a workspace object from persisted configuration

In [2]:
from azureml.core import Workspace

ws = Workspace.from_config('../../../config/config.json')
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

avadevitsmlsvc
RG-ITSMLTeam-Dev
westus2
ff2e23ae-7d7c-4cbd-99b8-116bb94dca6e


## Create An Experiment

**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments.

In [3]:
from azureml.core import Experiment
experiment_name = 'train-on-amlcompute'
experiment = Experiment(workspace = ws, name = experiment_name)

## Introduction to AmlCompute

Azure Machine Learning Compute is managed compute infrastructure that allows the user to easily create single to multi-node compute of the appropriate VM Family. It is created **within your workspace region** and is a resource that can be used by other users in your workspace. It autoscales by default to the max_nodes, when a job is submitted, and executes in a containerized environment packaging the dependencies as specified by the user. 

Since it is managed compute, job scheduling and cluster management are handled internally by Azure Machine Learning service. 

For more information on Azure Machine Learning Compute, please read [this article](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)

If you are an existing BatchAI customer who is migrating to Azure Machine Learning, please read [this article](https://aka.ms/batchai-retirement)

**Note**: As with other Azure services, there are limits on certain resources (for eg. AmlCompute quota) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.


The training script `train.py` is already created for you. Let's have a look.

### Create project directory

Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script, and any additional files your training script depends on

In [6]:
import os
import shutil

project_folder = './tsne_code'
output_dir = "outputs"
os.makedirs(output_dir, exist_ok=True)

## Running Local

In [24]:
from azureml.core import Environment, ScriptRunConfig


# Editing a run configuration property on-fly.
user_managed_env = Environment("user-managed-env")

# option 1: pass an already created conda env
# user_managed_env.python.user_managed_dependencies = True
# # this is a conda env I made that also has the given requirements
# user_managed_env.python.interpreter_path = '/Users/anders.swanson/opt/anaconda3/envs/mlnb/bin/python'

# option 2: have it made for me by azure ml
user_managed_env = Environment.from_pip_requirements("myenv", '../../../config/requirements.txt')


src_local = ScriptRunConfig(source_directory=project_folder, script='tsne.py')
src_local.run_config.environment = user_managed_env

In [25]:
run_local = experiment.submit(config=src_local)
run_local

Experiment,Id,Type,Status,Details Page,Docs Page
train-on-amlcompute,train-on-amlcompute_1591492728_0539ef5f,azureml.scriptrun,Running,Link to Azure Machine Learning studio,Link to Documentation


In [None]:
%%time
# Shows output of the run on stdout.
run_local.wait_for_completion(show_output=True)

RunId: train-on-amlcompute_1591492728_0539ef5f
Web View: https://ml.azure.com/experiments/train-on-amlcompute/runs/train-on-amlcompute_1591492728_0539ef5f?wsid=/subscriptions/ff2e23ae-7d7c-4cbd-99b8-116bb94dca6e/resourcegroups/RG-ITSMLTeam-Dev/workspaces/avadevitsmlsvc

Streaming azureml-logs/70_driver_log.txt

Entering context manager injector. Current time:2020-06-06T18:18:54.670991
Starting the daemon thread to refresh tokens in background for process with pid = 16891
Entering Run History Context Manager.
Preparing to call script [ tsne.py ] with arguments: []
After variable expansion, calling script [ tsne.py ] with arguments: []

Azure ML version: 1.6.0
gensim version: 3.8.3
sklearn version: 0.19.2
scipy version: 1.1.0
numpy version: 1.17.2
pandas version: 1.0.4
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 1000 samples in 0.001s...
[t-SNE] Computed neighbors for 1000 samples in 0.029s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1000
[t-SNE] Mean s

## Running Remote

### Create environment

Create Docker based environment with scikit-learn installed.

In [16]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

# method 1
myenv = Environment.from_pip_requirements("myenv", '../../../config/requirements.txt')

# method 2
# myenv = Environment("myenv")
# myenv.docker.enabled = True
# myenv.python.conda_dependencies = CondaDependencies.create(
#     pip_packages=[
#         'azureml-sdk==1.6.0', 'gensim==3.8.3', 'scikit-learn==0.19.2',
#         'matplotlib==2.2.3', 'numpy==1.17.2', 'scipy==1.1.0','ipykernel==5.3.0'
#     ]
# )


print(list(myenv.python.conda_dependencies.pip_packages))

['azureml-sdk==1.6.0', 'gensim==3.8.3', 'scikit-learn==0.19.2', 'matplotlib==2.2.3', 'numpy==1.17.2', 'scipy==1.1.0', 'ipykernel==5.3.0', 'pandas==1.0.4', 'ipython==7.15.0']


### Submit an AmlCompute run in a few different ways

First lets check which VM families are available in your region. Azure is a regional service and some specialized SKUs (especially GPUs) are only available in certain regions. Since AmlCompute is created in the region of your workspace, we will use the supported_vms () function to see if the VM family we want to use ('STANDARD_D2_V2') is supported.

You can also pass a different region to check availability and then re-create your workspace in that region through the [configuration notebook](../../../configuration.ipynb)

In [5]:
from azureml.core.compute import ComputeTarget, AmlCompute

# AmlCompute.supported_vmsizes(workspace = ws)
#AmlCompute.supported_vmsizes(workspace = ws, location='southcentralus')

### Provision as a persistent compute target (Basic)

You can provision a persistent AmlCompute resource by simply defining two parameters thanks to smart defaults. By default it autoscales from 0 nodes and provisions dedicated VMs to run your job in a container. This is useful when you want to continously re-use the same target, debug it between jobs or simply share the resource with other users of your workspace.

* `vm_size`: VM family of the nodes provisioned by AmlCompute. Simply choose from the supported_vmsizes() above
* `max_nodes`: Maximum nodes to autoscale to while running a job on AmlCompute

In [17]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


### Configure & Run

In [21]:
from azureml.core import ScriptRunConfig
from azureml.core.runconfig import DEFAULT_CPU_IMAGE

src = ScriptRunConfig(source_directory=project_folder, script='tsne.py')

# Set compute target to the one created in previous step
src.run_config.target = cpu_cluster.name

# Set environment
src.run_config.environment = myenv
 


In [22]:
run = experiment.submit(config=src)
run

Experiment,Id,Type,Status,Details Page,Docs Page
train-on-amlcompute,train-on-amlcompute_1591492618_a927a9af,azureml.scriptrun,Starting,Link to Azure Machine Learning studio,Link to Documentation


Note: if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run).

In [23]:
%%time
# Shows output of the run on stdout.
run.wait_for_completion(show_output=True)

RunId: train-on-amlcompute_1591492618_a927a9af
Web View: https://ml.azure.com/experiments/train-on-amlcompute/runs/train-on-amlcompute_1591492618_a927a9af?wsid=/subscriptions/ff2e23ae-7d7c-4cbd-99b8-116bb94dca6e/resourcegroups/RG-ITSMLTeam-Dev/workspaces/avadevitsmlsvc

Streaming azureml-logs/55_azureml-execution-tvmps_7d320f823f73f1b45ab17c63d48e0cadfb009af972385874a80333bd8c6e3182_d.txt

2020-06-07T01:17:23Z Starting output-watcher...
2020-06-07T01:17:23Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
66c01970b98cf858f203970bf01de970b14fa11ef1d28e1d3e977fd92fbbed33

Streaming azureml-logs/65_job_prep-tvmps_7d320f823f73f1b45ab17c63d48e0cadfb009af972385874a80333bd8c6e3182_d.txt

Entering job preparation. Current time:2020-06-07T01:17:24.913118
Starting job preparation. Current time:2020-06-07T01:17:25.713097
Extracting the control code.
fetching and extracting the control code on master node.
Retrieving project from snapshot: 7a29a86f-c31d-480d-9653-346fc213ae21
Starting

{'runId': 'train-on-amlcompute_1591492618_a927a9af',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2020-06-07T01:17:20.753881Z',
 'endTimeUtc': '2020-06-07T01:17:55.502662Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '05a3d318-9b8b-4b6b-b797-96488c48700b',
  'azureml.git.repository_uri': 'https://github.com/swanderz/MachineLearningNotebooks.git',
  'mlflow.source.git.repoURL': 'https://github.com/swanderz/MachineLearningNotebooks.git',
  'azureml.git.branch': 'SO_CPR',
  'mlflow.source.git.branch': 'SO_CPR',
  'azureml.git.commit': '3bf76c8fa7e9920c267965b7fbf534ee9cb2f378',
  'mlflow.source.git.commit': '3bf76c8fa7e9920c267965b7fbf534ee9cb2f378',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'runDefinition': {'script': 'tsne.py',
  'useAbsolutePath': False,
  'arguments': [],
  'sourceDirectoryDataStor

In [15]:
run.get_metrics()

{}