In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'last'

%load_ext autoreload
%autoreload 2

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import os

import azureml.core
from azureml.core import Workspace, Experiment, Datastore
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.train.dnn import PyTorch
from azureml.widgets import RunDetails
from azureml.tensorboard import Tensorboard

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.0.33


# Training models using Azure Machine Learning

In this notebook, instead of running the training script manually in a VM, we make use of the Azure Machine Learning (AML) Python SDK to run our experiments.

See the official [tutorial](https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-train-models-with-aml) covering this set-up with a sci-kit learn example.

## Connect to AML workspace

Refer to [Create a Azure Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/setup-create-workspace) for how to create an AML workspace in Azure Portal or using the Python SDK or the Azure CLI. 

In [3]:
# load workspace configuration from the config.json file in the current folder.
ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, ws.location, sep = '\t')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


siyu	westus2	yasiyu_rg	westus2


## Create or attach an existing compute resource

Documentation on [AmlCompute.provisioning_configuration](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute%28class%29?view=azure-ml-py#provisioning-configuration-vm-size-----vm-priority--dedicated---min-nodes-0--max-nodes-none--idle-seconds-before-scaledown-none--admin-username-none--admin-user-password-none--admin-user-ssh-key-none--vnet-resourcegroup-name-none--vnet-name-none--subnet-name-none--tags-none--description-none-)

In [4]:
# choose a name for your cluster
compute_name = 'gpu-cluster'
compute_min_nodes = 1
compute_max_nodes = 2
idle_seconds = 120

# for ssh into individual node to debug
admin_username='username'
admin_user_password='password'


vm_size = 'STANDARD_NC6'  # Choose a GPU SKU that is available in your subscription's AML quota (separate from main VM quota) and region

if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('Found compute target and will be using it: ' + compute_name)
else:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,
                                                                min_nodes=compute_min_nodes, 
                                                                max_nodes=compute_max_nodes,
                                                                idle_seconds_before_scaledown=idle_seconds,
                                                                admin_username=admin_username,
                                                                admin_user_password=admin_user_password)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

Found compute target and will be using it: gpu-cluster


## Connect to datastore

[Documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data)

In this demonstration, we stored all training data preprocessed using the SpaceNet utilities in a container on Azure Blob Storage. This allows us to mount the container on VMs or instances of AML clusters like the one we connected to above. To use AML, you have to have your data stored in the cloud (either Blob Storage or File Share) in order for the compute resource to access it.

In [7]:
storage_account_name = os.environ.get('STORAGE_ACCOUNT_NAME')
storage_account_key = os.environ.get('STORAGE_ACCOUNT_KEY')

In [8]:
input_data_store_name = 'dataprocessed'
output_data_store_name = 'models'

# in my set-up, the input data and output models are in two containers under the same storage account
input_container_name = 'data-processed'
output_container_name = 'models'

input_data_store = None
output_data_store = None
for name, ds in ws.datastores.items():
    if name == input_data_store_name:
        input_data_store = ds
    if name == output_data_store_name:
        output_data_store = ds
        
if input_data_store is None:
    'Input datastore {} is not in the workspace; registering it...'.format(input_data_store_name)
    input_data_store = Datastore.register_azure_blob_container(workspace=ws, 
                                             datastore_name=input_data_store_name, 
                                             container_name=input_container_name,
                                             account_name=storage_account_name, 
                                             account_key=storage_account_key,
                                             create_if_not_exists=True)

if output_data_store is None:
    'Output datastore {} is not in the workspace; reigstering it...'.format(output_data_store_name)
    output_data_store = Datastore.register_azure_blob_container(workspace=ws, 
                                             datastore_name=output_data_store_name, 
                                             container_name=output_container_name,
                                             account_name=storage_account_name, 
                                             account_key=storage_account_key,
                                             create_if_not_exists=True)

print(input_data_store)
print(output_data_store)

<azureml.data.azure_storage_datastore.AzureBlobDatastore object at 0x1166d9fd0>
<azureml.data.azure_storage_datastore.AzureBlobDatastore object at 0x11674f128>


## Create an AML experiment
In each AML workspace we can have multiple experiments, and each experiment can have multiple runs. You can use experiments to organize your project/workflow. 

In [9]:
experiment_name = 'baseline'

exp = Experiment(workspace=ws, name=experiment_name)

## Create the estimator to submit a run

An estimator object is used to submit a run of the experiment. 

More information on the PyTorch class of the AML Estimator is [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-pytorch).

In [10]:
script_params = {
    '--experiment_name': experiment_name,
    '--data_path_root': input_data_store,
    '--out_dir': output_data_store,
    '--num_epochs': 1
}

pt_est = PyTorch(source_directory='../training',  # this folder gets copied from your local machine to the remote compute
                 script_params=script_params,
                 compute_target=compute_target,
                 entry_script='train_aml.py',  # relative to source_directory
                 pip_packages=['scikit-image', 'tensorflow==1.9.0'],  # there's also a conda_packages option
                 use_gpu=True)

In [11]:
# submit the PyTorch job
run = exp.submit(pt_est)

To read more on what happens when you submit a job, see [Monitor a remote run](https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-train-models-with-aml#monitor-a-remote-run).

In [16]:
run.get_details

<bound method Run.get_details of Run(Experiment: baseline,
Id: baseline_1557850469_10634baa,
Type: azureml.scriptrun,
Status: Queued)>

In [None]:
# RunDetails(run).show()

## Start TensorBoard

In [17]:
# The Tensorboard constructor takes an array of runs
tb = Tensorboard([run])

# If successful, start() returns a string with the URI of the instance.
tb.start()

http://MacBook-Pro-6.local:6006


'http://MacBook-Pro-6.local:6006'

In [None]:
# when done, call the stop() method of the Tensorboard object, or it will stay running even after your job completes.
# tb.stop()