# Exercise04 : Train on Remote GPU Virtual Machine

Now we run our previous sample (see "[Exercise03 : Just Train in Your Working Machine](./exercise03_train_simple.ipynb)") on remote virtual machine with GPU utilized.

> Note : If you don't have GPU quota, you can also run this example on CPU.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Initialize MLClient

Replace below's branket's string with your subscription id, resource group name, and AML workspace name.<br>
(I note that creating ```MLClient``` will not connect to AML workspace, and the client initialization is lazy.)

In [1]:
from azure.ai.ml import MLClient
from azure.identity import DeviceCodeCredential, TokenCachePersistenceOptions

# When you run on remote
cache_opt = TokenCachePersistenceOptions(allow_unencrypted_storage=True)
cred = DeviceCodeCredential(cache_persistence_options=cache_opt)

# # When you run on Azure ML Notebook
# from azure.identity import DefaultAzureCredential
# cred = DefaultAzureCredential()

# Get a handle to the workspace
ml_client = MLClient(
    credential=cred,
    subscription_id="{SUBSCRIPTION ID}",
    resource_group_name="{RESOURCE GROUP NAME}",
    workspace_name="{AML WORKSPACE NAME}",
)

## Save your training script as file (train.py)

Create ```scirpt``` directory and save Python script as ```./script/train.py```.

In [2]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

In [3]:
%%writefile script/train.py
import os
import argparse
import tensorflow as tf

# device test
print("##### List of available GPU #####")
print(tf.config.list_physical_devices("GPU"))

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument(
    "--data_folder",
    type=str,
    default="./data/train",
    help="Folder path for input data")
parser.add_argument(
    "--model_folder",
    type=str,
    default="./outputs",  # AML experiments outputs folder
    help="Folder path for model output")
parser.add_argument(
    "--learning_rate",
    type=float,
    default="0.001",
    help="Learning Rate")
parser.add_argument(
    "--first_layer",
    type=int,
    default="128",
    help="Neuron number for the first hidden layer")
parser.add_argument(
    "--second_layer",
    type=int,
    default="64",
    help="Neuron number for the second hidden layer")
parser.add_argument(
    "--epochs_num",
    type=int,
    default="6",
    help="Number of epochs")
FLAGS, unparsed = parser.parse_known_args()

# build model
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(FLAGS.first_layer, activation="relu"),
    tf.keras.layers.Dense(FLAGS.second_layer, activation="relu"),
    tf.keras.layers.Dense(10)
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(FLAGS.learning_rate),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

# run training
train_data = tf.data.experimental.load(FLAGS.data_folder)
model.fit(
    train_data.shuffle(1000).batch(128).prefetch(tf.data.AUTOTUNE),
    epochs=FLAGS.epochs_num
)

# save model and variables
model_path = os.path.join(FLAGS.model_folder, "mnist_tf_model")
model.save(model_path)
print("current working directory : ", os.getcwd())
print("model folder : ", model_path)

Writing script/train.py


## Train on remote VM

Now let's start to integrate with AML and automate training on remote virtual machine.

### Step 1 : Create new remote virtual machine

Create your new reomte virtual machine with GPU.<br>
Before starting, **please check the following**.

- Make sure that the following size (in the following script, ```Standard_NC4as_T4_v3```) is supported in the location (in which AML workspace resides).
- You should have quota for ML GPU VM in your Azure subscription. If you don't have, please request quota in Azure Portal.

**If you don't have any quota for GPU, please use CPU VM (such as, Standard_D2_v2).**

In [4]:
from azure.ai.ml.entities import AmlCompute

try:
    compute_target = ml_client.compute.get("myvm01")
    print("found existing: ", compute_target.name)
except Exception:
    print("creating new.")
    compute_target = AmlCompute(
        name="myvm01",
        type="amlcompute",
        size="Standard_NC4as_T4_v3", # change such as Standard_NC6 or Standard_D2_v2 if needed
        min_instances=0,
        max_instances=1,
        tier="Dedicated",
    )
    compute_target = ml_client.begin_create_or_update(compute_target)

To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code FB56HDWZ5 to authenticate.
creating new.


By setting 0 in ```min_instances```, the node will be terminated if it's inactive. (You can save money.)

> Note : By setting appropriate time duration in ```idle_time_before_scale_down``` parameter, you can prevent scaling-down when the training has finished. (Otherwise, it will scale down in 120 seconds after the training has finished, and the next training will slow to start because of cluster resizing.)

> Note : You can also attach an existing virtual machine (bring your own compute resource) as a compute target.

### Step 2 : Submit training job

Submit a training job with above compute and environment.

In this example, I use the registered data asset named ```mnist_data``` and mount this data in my compute target. (Run "[Exercise02 : Prepare Data](./exercise02_prepare_data.ipynb)" for data preparation.)<br>
In order to use data asset in AML, set ```{DATA_NAME}:{DATA_VERSION}``` or ```{DATA_NAME}@latest``` for the latest version of assets as follows.

See the progress and results in job view on [AML Studio](https://ml.azure.com/).

> Note : Here I use AML built-in environment (```AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu```), but you can build and use your own environment.<br>
> In the later example in this notebook, I'll run the same script with my own environment.

In [5]:
from azure.ai.ml import command, Input

# create the command
job = command(
    code="./script",
    command="python train.py --data_folder ${{inputs.mnist_tf}}/train",
    inputs={
        "mnist_tf": Input(
            type="uri_folder",
            path="mnist_data@latest",
        ),
    },
    environment="AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu@latest",
    compute="myvm01",
    display_name="tf_remote_experiment",
    experiment_name="tf_remote_experiment",
    description="This is example",
)

# submit the command
returned_job = ml_client.create_or_update(job)

[32mUploading script (0.0 MBs): 100%|█████████████████████████████████████████| 1862/1862 [00:00<00:00, 283662.43it/s][0m
[39m



You can get job name as follows.<br>
Job name is always used to get detailed information about job.

In [6]:
returned_job.name

'dynamic_brush_6xv178xzwd'

Please wait until the job is completed.

You can see current status (progress) with [AML studio UI](https://ml.azure.com/) (see "Jobs" pane) or with the following CLI command.

In [7]:
ml_client.jobs.get(returned_job.name)

Experiment,Name,Type,Status,Details Page
tf_remote_experiment,dynamic_brush_6xv178xzwd,command,Completed,Link to Azure Machine Learning studio


### Step 3 : Download results and evaluate

After the training has completed, go to [Azure ML studio UI](https://ml.azure.com/).<br>
You can then see the saved model in outputs directory.

![Saved Outputs](https://tsmatz.github.io/images/github/azure-ml-tensorflow-complete-sample/20220225_Experiment_Outputs.jpg)

Now let's check the generated model in local computer.<br>
Download artifacts (including the generated model in outputs) with SDK as follows.

In [10]:
ml_client.jobs.download(name=returned_job.name, download_path="./")

Downloading artifact azureml://datastores/workspaceartifactstore/ExperimentRun/dcid.dynamic_brush_6xv178xzwd to artifacts


Now check the downloaded result.

In [11]:
import tensorflow as tf

test_data = tf.data.Dataset.load("./data/test")

loaded_model = tf.keras.models.load_model("./artifacts/outputs/mnist_tf_model")
for image, true_value in test_data.take(3):
    pred_output = loaded_model(tf.expand_dims(image, axis=0))
    pred_value = tf.math.argmax(pred_output, axis=-1).numpy().item()
    print("Predicted {}, True {}".format(pred_value, true_value))

2022-10-05 00:36:25.639683: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-05 00:36:25.817034: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-10-05 00:36:25.817090: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-10-05 00:36:25.861734: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-10-05 00:36:27.051673: W tensorflow/stream_executor/platform/de

Predicted 7, True 7
Predicted 2, True 2
Predicted 1, True 1


### Step 4 : Register Model

Now upload (register) the downloaded model into AML model management.

> Note : You can directly register model file from run history as follows.<br>
> ```
> run_model = Model(
>   path="azureml://subscriptions/XXX/resourceGroups/XXX/workspaces/XXX/jobs/XXX/outputs/artifacts/outputs/mnist_tf_model",
>   type="custom_model",
>   name="mnist_model_test",
> )
> ml_client.models.create_or_update(run_model)
> ```

In [12]:
from azure.ai.ml.entities import Model
#from azure.ai.ml.constants import ModelType

file_model = Model(
    path="./artifacts/outputs/mnist_tf_model",
    type="custom_model",
    name="mnist_model_test",
)
ml_client.models.create_or_update(file_model)

[32mUploading mnist_tf_model (1.42 MBs): 100%|████████████████████████| 1418402/1418402 [00:00<00:00, 12633967.98it/s][0m
[39m



Model({'job_name': None, 'is_anonymous': False, 'auto_increment_version': False, 'name': 'mnist_model_test', 'description': None, 'tags': {}, 'properties': {}, 'id': '/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/models/mnist_model_test/versions/1', 'Resource__source_path': None, 'base_path': '/home/tsmatsuz/python_sdk2', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f2ce8fa9fa0>, 'serialize': <msrest.serialization.Serializer object at 0x7f2ce8ffbd00>, 'version': '1', 'latest_version': None, 'path': 'azureml://subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/workspaces/ws01/datastores/workspaceblobstore/paths/LocalUpload/137eb5837c85f2a0f2959021fdeb9038/mnist_tf_model', 'datastore': None, 'utc_time_created': None, 'flavors': None, 'arm_type': 'model_version', 'type': 'custom_model'})

### [Optional] Step 5 : Train with your own environment

**This is not mandatory. (You can skip this section.)**

You can also build your own environment with custom docker image.<br>
Here we create a new docker environments for running scripts, and run the same training with this environment.

First I create conda dependancies yaml and save as ```04_conda_pydata.yml```.

In [13]:
%%writefile 04_conda_pydata.yml
name: project_environment
dependencies:
- python=3.8
- pip:
  - tensorflow-gpu==2.10.0
channels:
- anaconda
- conda-forge

Writing 04_conda_pydata.yml


Register custom environment (named ```test-remote-gpu-env```) in AML with previous conda configuration.

In [14]:
from azure.ai.ml.entities import Environment

myenv = Environment(
    name="test-remote-gpu-env",
    description="This is example",
    conda_file="04_conda_pydata.yml",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.2-cudnn8-ubuntu20.04:latest",
)
myenv = ml_client.environments.create_or_update(myenv)

Go to [AML Studio UI](https://ml.azure.com/) and click "Environments". Next, click "Custom environments" tab and select the above environment.<br>
Please wait until the environment image build status is succeeded.

![Environment status](https://tsmatz.github.io/images/github/azure-ml-tensorflow-complete-sample/20221220_Environment_Status.jpg)

Train script with above custom environment.

It will take a long time (over 30 minutes) for the first time run, because it'll pull base image, generate new image (custom environment), start nodes in cluster, and run scripts.

In [15]:
# create the command
job = command(
    code="./script",
    command="python train.py --data_folder ${{inputs.mnist_tf}}/train",
    inputs={
        "mnist_tf": Input(
            type="uri_folder",
            path="mnist_data@latest",
        ),
    },
    environment="test-remote-gpu-env@latest",
    compute="myvm01",
    display_name="tf_remote_experiment",
    experiment_name="tf_remote_experiment",
    description="This is example",
)

# submit the command
returned_job = ml_client.create_or_update(job)

### Step 6 : Remove AML compute

**You don't need to remove your AML compute** for saving money, because the nodes will be automatically terminated, when it's inactive.<br>
But if you want to clean up, please run as follows.

In [None]:
ml_client.compute.begin_delete("myvm01")

Deleting compute myvm01 


......