Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Distributed CNTK using custom docker images
In this tutorial, you will train a CNTK model on the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset using a custom docker image and distributed training.

## Prerequisites
* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning
* Go through the [configuration notebook](../../../configuration.ipynb) to:
    * install the AML SDK
    * create a workspace and its configuration file (`config.json`)

In [1]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

SDK version: 1.0.10


## Diagnostics
Opt-in diagnostics for better experience, quality, and security of future releases.

In [2]:
from azureml.telemetry import set_diagnostics_collection

set_diagnostics_collection(send_diagnostics=True)

Turning diagnostics collection on. 


## Initialize workspace

Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [4]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name,
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

Falling back to use azure cli credentials. This fall back to use azure cli credentials will be removed in the next release. 
Make sure your code doesn't require 'az login' to have happened before using azureml-sdk, except the case when you are specifying AzureCliAuthentication in azureml-sdk.


Found the config file in: C:\Users\adminye\notebooks\AzureML\how-to-use-azureml\training-with-deep-learning\config.json
Performing interactive authentication. Please follow the instructions on the terminal.


Note, we have launched a browser for you to login. For old experience with device code, use "az login --use-device-code"
You have logged in. Now let us find all the subscriptions to which you have access...
Failed to authenticate '{'additional_properties': {}, 'id': '/tenants/89359cf4-9e60-4099-80c4-775a0cfe27a7', 'tenant_id': '89359cf4-9e60-4099-80c4-775a0cfe27a7'}' due to error 'Get Token request returned http error: 400 and server response: {"error":"interaction_required","error_description":"AADSTS50076: Due to a configuration change made by your administrator, or because you moved to a new location, you must use multi-factor authentication to access '797f4846-ba00-4fd7-ba43-dac1f8f63013'.\r\nTrace ID: 9a877ca2-681a-4d02-98e9-6d9ea5650a01\r\nCorrelation ID: 00e9d1b1-beea-4e85-adfb-877ebce51f4b\r\nTimestamp: 2019-08-08 18:06:53Z","error_codes":[50076],"timestamp":"2019-08-08 18:06:53Z","trace_id":"9a877ca2-681a-4d02-98e9-6d9ea5650a01","correlation_id":"00e9d1b1-beea-4e85-adfb-877ebc

Interactive authentication successfully completed.
Workspace name: deepsatelliteye
Azure region: southcentralus
Subscription id: 95f31ea2-0e41-4d66-a5db-9ef0449ad928
Resource group: deepsatelliteye


## Create or Attach existing AmlCompute
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, you create `AmlCompute` as your training compute resource.

**Creation of AmlCompute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace this code will skip the creation process.

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [5]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "gpuclusterye"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current AmlCompute
print(compute_target.get_status().serialize())

Creating a new compute target...
Creating
Succeeded
AmlCompute wait for completion finished
Minimum number of nodes requested have been provisioned
{'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-08-08T18:10:49.621000+00:00', 'creationTime': '2019-08-08T18:08:13.470480+00:00', 'currentNodeCount': 0, 'errors': None, 'modifiedTime': '2019-08-08T18:10:59.326246+00:00', 'nodeStateCounts': {'idleNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0, 'preparingNodeCount': 0, 'runningNodeCount': 0, 'unusableNodeCount': 0}, 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'targetNodeCount': 0, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


## Upload training data
For this tutorial, we will be using the MNIST dataset.

First, let's download the dataset. We've included the `install_mnist.py` script to download the data and convert it to a CNTK-supported format. Our data files will get written to a directory named `'mnist'`.

In [None]:
#import install_mnist

#install_mnist.main('mnist')

To make the data accessible for remote training, you will need to upload the data from your local machine to the cloud. AML provides a convenient way to do so via a [Datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data). The datastore provides a mechanism for you to upload/download data, and interact with it from your remote compute targets. 

Each workspace is associated with a default datastore. In this tutorial, we will upload the training data to this default datastore, which we will then mount on the remote compute for training in the next section.

In [6]:
ds = ws.get_default_datastore()
print(ds.datastore_type, ds.account_name, ds.container_name)

AzureBlob deepsatellitey0943707353 azureml-blobstore-62b14c36-d3d2-4ca1-a937-ab34bd32ba7d


The following code will upload the training data to the path `./mnist` on the default datastore.

In [7]:
src_dir='C:/Users/adminye/Data/landcovertutorial/training_data'
ds.upload(src_dir=src_dir, target_path='./training_data')

Uploading C:/Users/adminye/Data/landcovertutorial/training_data\B10_LandCover.tif
Uploading C:/Users/adminye/Data/landcovertutorial/training_data\B10_NAIP.tif
Uploading C:/Users/adminye/Data/landcovertutorial/training_data\B11_LandCover.tif
Uploading C:/Users/adminye/Data/landcovertutorial/training_data\B11_NAIP.tif
Uploading C:/Users/adminye/Data/landcovertutorial/training_data\B12_LandCover.tif
Uploading C:/Users/adminye/Data/landcovertutorial/training_data\B12_NAIP.tif
Uploading C:/Users/adminye/Data/landcovertutorial/training_data\B13_LandCover.tif
Uploading C:/Users/adminye/Data/landcovertutorial/training_data\B13_NAIP.tif
Uploading C:/Users/adminye/Data/landcovertutorial/training_data\B14_LandCover.tif
Uploading C:/Users/adminye/Data/landcovertutorial/training_data\B14_NAIP.tif
Uploading C:/Users/adminye/Data/landcovertutorial/training_data\B15_LandCover.tif
Uploading C:/Users/adminye/Data/landcovertutorial/training_data\B15_NAIP.tif
Uploaded C:/Users/adminye/Data/landcovertutori

$AZUREML_DATAREFERENCE_566aaf08d65d4846ad7f41533cb42757

Now let's get a reference to the path on the datastore with the training data. We can do so using the `path` method. In the next section, we can then pass this reference to our training script's `--data_dir` argument. 

In [8]:
path_on_datastore = 'training_data'
ds_data = ds.path(path_on_datastore)
print(ds_data)

$AZUREML_DATAREFERENCE_dbd70bb2e55d48419e67a84cee337758


## Train model on the remote compute
Now that we have the cluster ready to go, let's run our distributed training job.

### Create a project directory
Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script, and any additional files your training script depends on.

In [9]:
import os

project_folder = './cntk-distr'
os.makedirs(project_folder, exist_ok=True)

Copy the training script `cntk_distr_mnist.py` into this project directory.

In [14]:
import shutil
training_script='C:/Users/adminye/Data/landcovertutorial/scripts/train_distributed.py'
#training_script = 'C:\Users\adminye\Data\landcovertutorial\scripts'
shutil.copy(training_script, project_folder) ## only one python script or others?

'./cntk-distr\\train_distributed.py'

### Create an experiment
Create an [experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this distributed CNTK tutorial. 

In [16]:
from azureml.core import Experiment

experiment_name = 'cntk-distr-land-use'
experiment = Experiment(ws, name=experiment_name)

### Create an Estimator
The AML SDK's base Estimator enables you to easily submit custom scripts for both single-node and distributed runs. You should this generic estimator for training code using frameworks such as sklearn or CNTK that don't have corresponding custom estimators. For more information on using the generic estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-ml-models).

In [19]:
from azureml.train.estimator import Estimator

script_params = {
    '--num_epochs': 1,
    '--data_dir': ds_data.as_mount(),
    '--output_dir': './outputs'
}

estimator = Estimator(source_directory=project_folder,
                      compute_target=compute_target,
                      entry_script='train_distributed.py',
                      script_params=script_params,
                      node_count=2,
                      process_count_per_node=1,
                      distributed_backend='mpi',
                      pip_packages=['cntk-gpu==2.6'],
                      custom_docker_base_image='microsoft/mmlspark:gpu-0.12',
                      use_gpu=True)

We would like to train our model using a [pre-built Docker container](https://hub.docker.com/r/microsoft/mmlspark/). To do so, specify the name of the docker image to the argument `custom_docker_base_image`. You can only provide images available in public docker repositories such as Docker Hub using this argument. To use an image from a private docker repository, use the constructor's `environment_definition` parameter instead. Finally, we provide the `cntk` package to `pip_packages` to install CNTK 2.6 on our custom image.

The above code specifies that we will run our training script on `2` nodes, with one worker per node. In order to run distributed CNTK, which uses MPI, you must provide the argument `distributed_backend='mpi'`.

### Submit job
Run your experiment by submitting your estimator object. Note that this call is asynchronous.

In [20]:
run = experiment.submit(estimator)
print(run)

Run(Experiment: cntk-distr-land-use,
Id: cntk-distr-land-use_1565288969_8119448f,
Type: azureml.scriptrun,
Status: Starting)


### Monitor your run
You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [21]:
from azureml.widgets import RunDetails

RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 's…

Alternatively, you can block until the script has completed training before running more code.

In [None]:
run.wait_for_completion(show_output=True)

RunId: cntk-distr-land-use_1565288969_8119448f

Streaming azureml-logs/20_image_build_log.txt

2019/08/08 18:29:42 Downloading source code...
2019/08/08 18:29:44 Finished downloading source code
2019/08/08 18:29:44 Using acb_vol_0b546b88-2e04-4ee0-bbf1-228530bb3d3f as the home volume
2019/08/08 18:29:44 Creating Docker network: acb_default_network, driver: 'bridge'
2019/08/08 18:29:45 Successfully set up Docker network: acb_default_network
2019/08/08 18:29:45 Setting up Docker configuration...
2019/08/08 18:29:46 Successfully set up Docker configuration
2019/08/08 18:29:46 Logging in to registry: deepsatellite7cd7a82.azurecr.io
2019/08/08 18:29:47 Successfully logged into deepsatellite7cd7a82.azurecr.io
2019/08/08 18:29:47 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'
2019/08/08 18:29:47 Scanning for dependencies...
2019/08/08 18:29:48 Successfully scanned dependencies
2019/08/08 18:29:48 Launching container with name: acb_step

Collecting flask==1.0.3 (from azureml-defaults->-r /azureml-environment-setup/condaenv.98rm6mdk.requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/9a/74/670ae9737d14114753b8c8fdf2e8bd212a05d3b361ab15b44937dfd40985/Flask-1.0.3-py2.py3-none-any.whl (92kB)
Collecting azureml-model-management-sdk==1.0.1b6.post1 (from azureml-defaults->-r /azureml-environment-setup/condaenv.98rm6mdk.requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/4e/53/9004a1e7d6d4e796abc4bcc8286bfc2a32739c5fbac3859ca7429a228897/azureml_model_management_sdk-1.0.1b6.post1-py2.py3-none-any.whl (130kB)
Collecting configparser==3.7.4 (from azureml-defaults->-r /azureml-environment-setup/condaenv.98rm6mdk.requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/ba/05/6c96328e92e625fc31445d24d75a2c92ef9ba34fc5b037fe69693c362a0d/configparser-3.7.4-py2.py3-none-any.whl
Collecting gunicorn==19.9.0 (from azureml-defaults->-r /azureml-environment-se

Collecting azure-common>=1.1.12 (from azureml-core==1.0.55.*->azureml-defaults->-r /azureml-environment-setup/condaenv.98rm6mdk.requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/00/55/a703923c12cd3172d5c007beda0c1a34342a17a6a72779f8a7c269af0cd6/azure_common-1.1.23-py2.py3-none-any.whl
Collecting azure-mgmt-keyvault>=0.40.0 (from azureml-core==1.0.55.*->azureml-defaults->-r /azureml-environment-setup/condaenv.98rm6mdk.requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/b3/d1/9fed0a3a3b43d0b1ad59599b5c836ccc4cf117e26458075385bafe79575b/azure_mgmt_keyvault-2.0.0-py2.py3-none-any.whl (80kB)
Collecting contextlib2 (from azureml-core==1.0.55.*->azureml-defaults->-r /azureml-environment-setup/condaenv.98rm6mdk.requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/a2/71/8273a7eeed0aff6a854237ab5453bc9aa67deb49df4832801c21f0ff3782/contextlib2-0.5.5-py2.py3-none-any.whl
Collecting msrestazure>=0.4.33 (from azu

Successfully built json-logging-py dill liac-arff pathspec pycparser
Installing collected packages: MarkupSafe, Jinja2, itsdangerous, Werkzeug, click, flask, dill, numpy, pytz, six, python-dateutil, pandas, liac-arff, pycparser, cffi, asn1crypto, cryptography, PyJWT, urllib3, chardet, idna, requests, adal, azureml-model-management-sdk, configparser, gunicorn, json-logging-py, applicationinsights, oauthlib, requests-oauthlib, isodate, msrest, jeepney, SecretStorage, azure-common, msrestazure, azure-mgmt-authorization, jmespath, pyopenssl, pyasn1, ndg-httpsclient, jsonpickle, azure-mgmt-containerregistry, ruamel.yaml, azure-mgmt-keyvault, contextlib2, azure-graphrbac, backports.weakref, backports.tempfile, azure-mgmt-resource, azure-mgmt-storage, websocket-client, docker, pathspec, azureml-core, azureml-defaults, scipy, cntk-gpu
Successfully installed Jinja2-2.10.1 MarkupSafe-1.1.1 PyJWT-1.7.1 SecretStorage-3.1.1 Werkzeug-0.15.5 adal-1.2.2 applicationinsights-0.11.9 asn1crypto-0.24.0 azu

19d5748c63e7: Pushed
407bb2bd3356: Pushed
db584c622b50: Pushed
52a7ea2bb533: Pushed
52f389ea437e: Pushed
88888b9b1b5b: Pushed
a94e0d5a7c40: Pushed
7028b03ecbb7: Pushed
54822109680c: Pushed
ae84bde7eb3c: Pushed
latest: digest: sha256:c733fb16d5008c9440648f9cb535018fcc8432a1e932982ff0ca3293bd609f58 size: 5139
2019/08/08 18:38:41 Successfully pushed image: deepsatellite7cd7a82.azurecr.io/azureml/azureml_5d1827098459fe91d7f93288491b208b:latest
2019/08/08 18:38:41 Step ID: acb_step_0 marked as successful (elapsed time in seconds: 320.454910)
2019/08/08 18:38:41 Populating digests for step ID: acb_step_0...
2019/08/08 18:38:43 Successfully populated digests for step ID: acb_step_0
2019/08/08 18:38:43 Step ID: acb_step_1 marked as successful (elapsed time in seconds: 213.971258)
2019/08/08 18:38:43 The following dependencies were found:
2019/08/08 18:38:43 
- image:
    registry: deepsatellite7cd7a82.azurecr.io
    repository: azureml/azureml_5d1827098459fe91d7f93288491b208b
    tag: latest
 