# Exercise04 : Train on Remote GPU Virtual Machine

Now we run our previous sample (see "[Exercise03 : Just Train in Your Working Machine](./exercise03_train_simple.ipynb)") on remote virtual machine with GPU utilized.

> Note : If you don't have GPU quota, you can also run this example on CPU.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Save your training script as file (train.py)

Create ```scirpt``` directory.

In [1]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

By adding the following ```%%writefile``` at the beginning of the source code in "[Exercise03 : Just Train in Your Working Machine](./exercise03_train_simple.ipynb)", this source code is saved as ```./script/train.py```.

In [2]:
%%writefile script/train.py
import os
import argparse
import tensorflow as tf

# device test
print("##### List of available GPU #####")
print(tf.config.list_physical_devices("GPU"))

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument(
    "--data_folder",
    type=str,
    default="./data",
    help="Folder path for input data")
parser.add_argument(
    "--model_folder",
    type=str,
    default="./outputs",  # AML experiments outputs folder
    help="Folder path for model output")
parser.add_argument(
    "--learning_rate",
    type=float,
    default="0.001",
    help="Learning Rate")
parser.add_argument(
    "--first_layer",
    type=int,
    default="128",
    help="Neuron number for the first hidden layer")
parser.add_argument(
    "--second_layer",
    type=int,
    default="64",
    help="Neuron number for the second hidden layer")
parser.add_argument(
    "--epochs_num",
    type=int,
    default="6",
    help="Number of epochs")
FLAGS, unparsed = parser.parse_known_args()

# build model
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(FLAGS.first_layer, activation="relu"),
    tf.keras.layers.Dense(FLAGS.second_layer, activation="relu"),
    tf.keras.layers.Dense(10)
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(FLAGS.learning_rate),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

# run training
train_data_path = os.path.join(FLAGS.data_folder, "train")
train_data = tf.data.experimental.load(train_data_path)
model.fit(
    train_data.shuffle(1000).batch(128).prefetch(tf.data.AUTOTUNE),
    epochs=FLAGS.epochs_num
)

# save model and variables
model_path = os.path.join(FLAGS.model_folder, "mnist_tf_model")
model.save(model_path)
print("current working directory : ", os.getcwd())
print("model folder : ", model_path)

Writing script/train.py


## Train on remote VM

Now let's start to integrate with AML and automate training on remote virtual machine.

### Step 1 : Get workspace setting

Before starting, you must read your configuration settings. (See "[Exercise01 : Prepare Config Settings](./exercise01_prepare_config.ipynb)")

In [3]:
from azureml.core import Workspace
import azureml.core

ws = Workspace.from_config()

### Step 2 : Create new remote virtual machine

Create your new reomte virtual machine with GPU.<br>
Before starting, **please check as follows**.

- Make sure that the following size (in the following script, ```Standard_NC4as_T4_v3```) is supported in the location (in which AML workspace resides).<br>
You can also specify ```location``` in ```provisioning_configuration```, but it's not recommended to set the different location from AML workspace. (Since data in AML workspace will be mounted on this virtual machine.)
- You should have quota for ML GPU VM in your Azure subscription. If you don't have, please request quota in Azure Portal.

**If you don't have any quota for GPU, please change VM size (such as, Standard_D2_v2).**

> Note : If VM already exists, this script will get the existing one.

In [4]:
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException

try:
    compute_target = ComputeTarget(workspace=ws, name='myvm01')
    print('found existing:', compute_target.name)
except ComputeTargetException:
    print('creating new.')
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='Standard_NC4as_T4_v3', # change such as Standard_NC6 or Standard_D2_v2 if needed
        min_nodes=0,
        max_nodes=1)
    compute_target = ComputeTarget.create(ws, 'myvm01', compute_config)
    compute_target.wait_for_completion(show_output=True)

creating new.
InProgress..
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


By enabling auto-scaling (from 0 to 1), the node will be terminated if it's inactive. (You can save money.)<br>

> Note : You can also attach an existing virtual machine (bring your own compute resource) as a compute target.

### Step 3 : Get dataset reference for files

You can use registered dataset (named ```mnist_dataset```) to mount in your compute target.<br>
See "[Exercise02 : Prepare Data](./exercise02_prepare_data.ipynb)" for data preparation.

> Note : Dataset registration is not mandatory. (You can mount any data (as dataset) in AML datastore.)

In [5]:
from azureml.core import Dataset

dataset = Dataset.get_by_name(ws, 'mnist_dataset', version='latest')

# # For using unregistered data, see below
# from azureml.core import Datastore
# from azureml.core import Dataset
# ds = ws.get_default_datastore()
# ds_paths = [(ds, 'tfdata/')]
# dataset = Dataset.File.from_files(path = ds_paths)

### Step 4 : Run script and wait for completion

Submit a training job.

In this example, I use the registered dataset named ```mnist_dataset``` and mount this data in my compute target. (Run "[Exercise02 : Prepare Data](./exercise02_prepare_data.ipynb)" for data preparation.)

> Note : Here I use AML built-in environment (```AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu```), but you can build and use your own environment.<br>
> In the later example in this notebook, I'll run the same script with my own environment.

In [6]:
from azureml.core import Experiment, Environment, Run, ScriptRunConfig
from azureml.core.runconfig import DockerConfiguration

# create script run config
tf_env = Environment.get(workspace=ws, name='AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu')
src = ScriptRunConfig(
    source_directory='./script',
    script='train.py',
    arguments=['--data_folder', dataset.as_mount()],
    compute_target=compute_target,
    environment=tf_env,
    docker_runtime_config=DockerConfiguration(use_docker=True))

# submit and run !
exp = Experiment(workspace=ws, name='tf_remote_experiment')
run = exp.submit(config=src)
run.wait_for_completion(show_output=True)

RunId: tf_remote_experiment_1664945818_1d396756
Web View: https://ml.azure.com/runs/tf_remote_experiment_1664945818_1d396756?wsid=/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/rg-AML/workspaces/ws01&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

Streaming user_logs/std_log.txt

##### List of available GPU #####
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2022-10-05 05:05:07.148158: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-05 05:05:07.816872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10792 MB memory:  -> device: 0, name: Tesla K80, pci bus id: 4dd9:00:00.0, compute capability: 3.7
Epoch 1/6

  

{'runId': 'tf_remote_experiment_1664945818_1d396756',
 'target': 'myvm01',
 'status': 'Completed',
 'startTimeUtc': '2022-10-05T05:02:32.009974Z',
 'endTimeUtc': '2022-10-05T05:05:20.568375Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'amlctrain',
  'ContentSnapshotId': 'd1e62fa9-0a70-4726-be39-2b23edcbacab',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '16c18986-c760-49b0-a222-eeb89a5f9262'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'input__16c18986', 'mechanism': 'Mount'}}],
 'outputDatasets': [],
 'runDefinition': {'script': 'train.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['--data_folder', 'DatasetConsumptionConfig:input__16c18986'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'myvm01',
  'dataReferences': {},
  'data': {'input__16c18986': {'dataLocation': {'datas

### Step 5 : Download results and evaluate

Now let's check the generated model in local computer.

First, check generated files and logs.

In [7]:
run.get_file_names()

['outputs/mnist_tf_model/keras_metadata.pb',
 'outputs/mnist_tf_model/saved_model.pb',
 'outputs/mnist_tf_model/variables/variables.data-00000-of-00001',
 'outputs/mnist_tf_model/variables/variables.index',
 'system_logs/cs_capability/cs-capability.log',
 'system_logs/data_capability/data-capability.log',
 'system_logs/data_capability/rslex.log.2022-10-05-05',
 'system_logs/hosttools_capability/hosttools-capability.log',
 'system_logs/lifecycler/execution-wrapper.log',
 'system_logs/lifecycler/lifecycler.log',
 'system_logs/metrics_capability/metrics-capability.log',
 'system_logs/snapshot_capability/snapshot-capability.log',
 'user_logs/std_log.txt']

Download model into your local machine.

In [8]:
run.download_file(
    name='outputs/mnist_tf_model/keras_metadata.pb',
    output_file_path='remote_model/keras_metadata.pb')
run.download_file(
    name='outputs/mnist_tf_model/saved_model.pb',
    output_file_path='remote_model/saved_model.pb')
run.download_file(
    name='outputs/mnist_tf_model/variables/variables.data-00000-of-00001',
    output_file_path='remote_model/variables/variables.data-00000-of-00001')
run.download_file(
    name='outputs/mnist_tf_model/variables/variables.index',
    output_file_path='remote_model/variables/variables.index')

Predict your test data using downloaded model.

In [9]:
import tensorflow as tf

test_data = tf.data.Dataset.load("./data/test")

loaded_model = tf.keras.models.load_model("./remote_model")
for image, true_value in test_data.take(3):
    pred_output = loaded_model(tf.expand_dims(image, axis=0))
    pred_value = tf.math.argmax(pred_output, axis=-1).numpy().item()
    print("Predicted {}, True {}".format(pred_value, true_value))

2022-10-05 05:11:26.239294: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-05 05:11:26.393022: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-10-05 05:11:26.393059: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-10-05 05:11:26.425198: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-10-05 05:11:27.169895: W tensorflow/stream_executor/pla

Predicted 7, True 7
Predicted 2, True 2
Predicted 1, True 1


### Step 6 : Register Model with Dataset reference

By registering model with dataset reference, you can trace the model with the corresponding dataset version.

In [10]:
model = run.register_model(
    model_name='mnist_model_test',
    model_path='outputs/mnist_tf_model',
    datasets =[('training data',dataset)])

In order to track data used in this model, see this model in [Azure Machine Learning Studio](https://ml.azure.com/) and select "Data" tab. (See the following screenshot.)

![data tracking](https://tsmatz.files.wordpress.com/2021/08/20210823_track_data.jpg)

### [Optional] Step 7 : Train with your own environment

**This is not mandatory. (You can skip this section.)**

You can also build your own environment with custom docker image.<br>
Here we create a new docker environments for running scripts, and run the same training with this environment.

Register custom environment (named ```test-remote-gpu-env```) in AML with the following conda configuration.<br>
Here I use ```DEFAULT_GPU_IMAGE``` (```mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.2-cudnn8-ubuntu20.04```), but you can also bring your own image.

In [17]:
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.environment import Environment

# create environment
env = Environment('test-remote-gpu-env')
env.python.conda_dependencies = CondaDependencies.create(
    python_version="3.8",
    pip_packages=['tensorflow-gpu==2.10.0'])
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.2-cudnn8-ubuntu20.04'
# You can also use default GPU image (azureml.core.runconfig.DEFAULT_GPU_IMAGE)

# register environment to re-use later
env.register(workspace=ws)

{
    "assetId": "azureml://locations/eastus/workspaces/9f284df9-d636-40ed-bae1-0303c21d4b4f/environments/test-remote-gpu-env/versions/3",
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.2-cudnn8-ubuntu20.04",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "buildContext": null,
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "test-remote-gp

Train script with above custom environment.

It will take a long time (over 30 minutes) for the first time run, because it'll pull base image, generate new image (custom environment), start nodes in cluster, and run scripts.

In [18]:
src = ScriptRunConfig(
    source_directory='./script',
    script='train.py',
    arguments=['--data_folder', dataset.as_mount()],
    compute_target=compute_target,
    environment=env,
    docker_runtime_config=DockerConfiguration(use_docker=True))

# submit and run !
exp = Experiment(workspace=ws, name='tf_remote_experiment')
run = exp.submit(config=src)
run.wait_for_completion(show_output=True)

RunId: tf_remote_experiment_1664949349_0c9b7841
Web View: https://ml.azure.com/runs/tf_remote_experiment_1664949349_0c9b7841?wsid=/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/rg-AML/workspaces/ws01&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

Streaming azureml-logs/20_image_build_log.txt

2022/10/05 05:55:52 Downloading source code...
2022/10/05 05:55:53 Finished downloading source code
2022/10/05 05:55:54 Creating Docker network: acb_default_network, driver: 'bridge'
2022/10/05 05:55:54 Successfully set up Docker network: acb_default_network
2022/10/05 05:55:54 Setting up Docker configuration...
2022/10/05 05:55:55 Successfully set up Docker configuration
2022/10/05 05:55:55 Logging in to registry: 9f284df9d63640edbae10303c21d4b4f.azurecr.io
2022/10/05 05:55:56 Successfully logged into 9f284df9d63640edbae10303c21d4b4f.azurecr.io
2022/10/05 05:55:56 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'
2022/10/05 05:

  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
     â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”� 57.5/57.5 kB 12.3 MB/s eta 0:00:00
Collecting protobuf<3.20,>=3.9.2
  Downloading protobuf-3.19.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
     â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”� 1.1/1.1 MB 71.6 MB/s eta 0:00:00
Collecting six>=1.12.0
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting absl-py>=1.0.0
  Downloading absl_py-1.2.0-py3-none-any.whl (123 kB)
     â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”�â”� 123.4/123.4 kB 15.3 MB/s eta 0:00:00
Collecting gast<=0.4.0,>=0.2.1
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting libclang>=13.0.0
  Downloading libclang-14.0.6-py2.py3-none-manylinux2010_x86_64.whl (

 ---> Running in ed5f5da16703
Removing intermediate container ed5f5da16703
 ---> 9f5f59e3d6c6
Step 21/21 : CMD ["bash"]
 ---> Running in 93d59c77e030
Removing intermediate container 93d59c77e030
 ---> 5131f164ead5
Successfully built 5131f164ead5
Successfully tagged 9f284df9d63640edbae10303c21d4b4f.azurecr.io/azureml/azureml_7e083e90b49a8884f2992d5c300d503d:latest
Successfully tagged 9f284df9d63640edbae10303c21d4b4f.azurecr.io/azureml/azureml_7e083e90b49a8884f2992d5c300d503d:1
2022/10/05 06:01:51 Successfully executed container: acb_step_0
2022/10/05 06:01:51 Executing step ID: acb_step_1. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'
2022/10/05 06:01:51 Pushing image: 9f284df9d63640edbae10303c21d4b4f.azurecr.io/azureml/azureml_7e083e90b49a8884f2992d5c300d503d:1, attempt 1
The push refers to repository [9f284df9d63640edbae10303c21d4b4f.azurecr.io/azureml/azureml_7e083e90b49a8884f2992d5c300d503d]
cfa4d0675bf6: Preparing
2693497bb117: Preparing
046f789d146c: Pr

{'runId': 'tf_remote_experiment_1664949349_0c9b7841',
 'target': 'myvm01',
 'status': 'Completed',
 'startTimeUtc': '2022-10-05T06:21:22.340264Z',
 'endTimeUtc': '2022-10-05T06:23:42.428514Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'amlctrain',
  'ContentSnapshotId': '854efebe-1b17-4b1c-bd49-a1410fb8e940',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '16c18986-c760-49b0-a222-eeb89a5f9262'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'input__16c18986', 'mechanism': 'Mount'}}],
 'outputDatasets': [],
 'runDefinition': {'script': 'train.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['--data_folder', 'DatasetConsumptionConfig:input__16c18986'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'myvm01',
  'dataReferences': {},
  'data': {'input__16c18986': {'dataLocation': {'datas

### Step 8 : Remove AML compute

**You don't need to remove your AML compute** for saving money, because the nodes will be automatically terminated, when it's inactive.    
But if you want to clean up, please run the following.

In [12]:
# Delete cluster (nbodes) and remove from AML workspace
mycompute = AmlCompute(workspace=ws, name='myvm01')
mycompute.delete()

In [13]:
# get a status for the current cluster.
print(mycompute.status.serialize())

{'currentNodeCount': 1, 'targetNodeCount': 1, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 1, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-08-31T04:52:16.858000+00:00', 'errors': None, 'creationTime': '2021-08-31T04:45:45.747268+00:00', 'modifiedTime': '2021-08-31T04:46:11.291359+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 1, 'nodeIdleTimeBeforeScaleDown': 'PT1800S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC4AS_T4_V3'}
