# BatchAI Training: MNIST

In this sample, we will create a cluster for BatchAI training. You need to setup the following:
 * Create Service Principal as described [here](https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-authenticate-service-principal-cli)
 * Azure Storage Account to store initial data
 * Create file share in that storage account and place `ConvNet_MNIST.py` and both data files there

In [1]:
from __future__ import print_function
from datetime import datetime
import os
import sys
import zipfile
from azure.storage.file import FileService
from azure.storage.blob import BlockBlobService
import azure.mgmt.batchai.models as models
import azure.mgmt.batchai as batchai
from azure.common.credentials import ServicePrincipalCredentials

Let's specify different parameters here:

In [5]:
tenant = "--place correct value here--"
subscription = "--place correct value here--"
resource_group_name = "batchai"

storage_account_name = "batchaidemo"
storage_account_key = "--place correct value here--"
fileshare = "data"

We create `credentials` object to access everything using our Service principal credentials, and then `client` object to manage BatchAI.

In [3]:
credentials = ServicePrincipalCredentials(client_id="--place correct value here--",
                                          secret="--place correct value here--",
                                          token_uri="https://login.microsoftonline.com/{0}/oauth2/token".format(tenant))
client = batchai.BatchAIManagementClient(
    credentials=credentials,
    subscription_id=subscription,
    base_url="")

Now we get reference to resource group where all objects will be placed. If the group does not exist - it is created automatically.

In [4]:
from azure.mgmt.resource import ResourceManagementClient

resource_management_client = ResourceManagementClient(credentials=credentials, subscription_id=subscription)

group = resource_management_client.resource_groups.create_or_update(
        resource_group_name, {'location': 'northeurope'})

## Create cluster

Cluster is a resource pool that will accept jobs. Here we define the configuration of the cluster and create it. Once created, it takes resources, so you should destroy it once done.

In [6]:
cluster_name = 'shwarscluster'
relative_mount_point = 'azurefileshare'

parameters = models.ClusterCreateParameters(
    location='northeurope',
    vm_size='STANDARD_NC6',
    user_account_settings=models.UserAccountSettings(
         admin_user_name="shwars",
         admin_user_password="ShwarZ13!"),
    scale_settings=models.ScaleSettings(
         manual=models.ManualScaleSettings(target_node_count=1)
     ),
    node_setup=models.NodeSetup(
        # Mount shared volumes to the host
         mount_volumes=models.MountVolumes(
             azure_file_shares=[
                 models.AzureFileShareReference(
                     account_name=storage_account_name,
                     credentials=models.AzureStorageCredentialsInfo(
         account_key=storage_account_key),
         azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
               storage_account_name, fileshare),
                  relative_mount_path = relative_mount_point)],
         ),
    ),
)

client.clusters.create(resource_group_name, cluster_name, parameters).result()

<azure.mgmt.batchai.models.cluster.Cluster at 0x6519198>

You can now look in the Azure Portal to see the cluster. You can also do the step above through the azure portal. Now we need to check the cluster status before submitting jobs to it.

In [10]:
cluster = client.clusters.get(resource_group_name, cluster_name)
print('Cluster state: {0} Target: {1}; Allocated: {2}; Idle: {3}; '
      'Unusable: {4}; Running: {5}; Preparing: {6}; leaving: {7}'.format(
    cluster.allocation_state,
    cluster.scale_settings.manual.target_node_count,
    cluster.current_node_count,
    cluster.node_state_counts.idle_node_count,
    cluster.node_state_counts.unusable_node_count,
    cluster.node_state_counts.running_node_count,
    cluster.node_state_counts.preparing_node_count,
    cluster.node_state_counts.leaving_node_count))

Cluster state: steady Target: 1; Allocated: 1; Idle: 0; Unusable: 0; Running: 0; Preparing: 1; leaving: 0


## Create and submit job

A job is basically a task to perform. In our case, we create a job based on docker image, so when job is submitted, the following happens:
 * Job is scheduled on the cluster
 * Chosen VM gets the docker image
 * The image is started with the provided command line


In [None]:
job_name = 'trainjob'

parameters = models.job_create_parameters.JobCreateParameters(
     location='northeurope',
     cluster=models.ResourceId(id=cluster.id),
     # The number of VMs in the cluster to use
     node_count=1,

     # Override the path where the std out and std err files will be written to.
     # In this case we will write these out to an Azure Files share
     std_out_err_path_prefix='$AZ_BATCHAI_MOUNT_ROOT/{0}'.format(relative_mount_point),

     input_directories=[models.InputDirectory(
         id='SAMPLE',
         path='$AZ_BATCHAI_MOUNT_ROOT/{0}/data'.format(relative_mount_point))],

     # Specify directories where files will get written to
     output_directories=[models.OutputDirectory(
        id='MODEL',
        path_prefix='$AZ_BATCHAI_MOUNT_ROOT/{0}'.format(relative_mount_point),
        path_suffix="Models")],

     # Container configuration
     container_settings=models.ContainerSettings(
         image_source_registry=models.ImageSourceRegistry(image='microsoft/cntk:2.1-gpu-python3.5-cuda8.0-cudnn6.0')),

     # Toolkit specific settings
     cntk_settings = models.CNTKsettings(
        python_script_file_path='$AZ_BATCHAI_INPUT_SAMPLE/ConvNet_MNIST.py',
        command_line_args='$AZ_BATCHAI_INPUT_SAMPLE $AZ_BATCHAI_OUTPUT_MODEL')
 )

# Create the job
client.jobs.create(resource_group_name, job_name, parameters).result()

Now we can monitor job status to check when it is done:

In [None]:
job = client.jobs.get(resource_group_name, job_name)
print('Job state: {0} '.format(job.execution_state.name))

Here is how we can get job results. They are also available on the file share that we provided.

In [None]:
files = client.jobs.list_output_files(resource_group_name, job_name, models.JobsListOutputFilesOptions(outputdirectoryid="stdouterr"))

for file in list(files):
     print('file: {0}, download url: {1}'.format(file.name, file.download_url))

## Clean up

To make sure that resources do not eat our azure subscription, we need to delete the job and the cluster.

In [None]:
client.jobs.delete(resource_group_name, job_name)

In [None]:
client.clusters.delete(resource_group_name, cluster_name)