# Exercise06 : Track Logs and Metrics

Here we add logging capabilities in our source code, and check the collected logs and metrics.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Variable's Setting

Replace below's branket's string and set the required variables.

> Note : By the following ```az configure --defaults```, you can skip setting for ```--resource-group``` and ```--workspace-name``` options in each ```az ml``` command.<br>
> ```az configure --defaults group=$resource_group workspace=$aml_workspace```

In [1]:
my_resource_group = "{AML-RESOURCE-GROUP-NAME}"
my_workspace = "{AML-WORSPACE-NAME}"

## Change your source code for experimentation logging

By using the Azure Machine Learning CLI v2, **MLflow tracking URI and experiment's name are automatically set and redirects the logging from MLflow to your AML workspace**.<br>
Therefore, change your source code in "[Exercise03 : Just Train in Your Working Machine](./exercise03_train_simple.ipynb)" to track logs and metrics with MLflow as follows. (The lines commented by "```##### Modified```" are modified.)

> Note : For details about MLflow and Azure ML integration, see [this repository](https://github.com/tsmatz/mlflow-azureml).

In [2]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

In [3]:
%%writefile script/train_experiment.py
import os
import argparse
import tensorflow as tf

import mlflow ##### Modified
mlflow.tensorflow.autolog() ##### Modified

### You can also manually log as follows (Here we use autolog())
# mlflow.log_params({
#     'learning_rate': FLAGS.learning_rate,
#     '1st_layer': FLAGS.first_layer,
#     '2nd_layer': FLAGS.second_layer})
# mlflow.log_metrics(
#     {'training_accuracy': result_accuracy, 'training_loss': result_loss},
#     step=result_step)

# device test
print("##### List of available GPU #####")
print(tf.config.list_physical_devices("GPU"))

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument(
    "--data_folder",
    type=str,
    default="./data/train",
    help="Folder path for input data")
parser.add_argument(
    "--model_folder",
    type=str,
    default="./outputs",  # AML experiments outputs folder
    help="Folder path for model output")
parser.add_argument(
    "--learning_rate",
    type=float,
    default="0.001",
    help="Learning Rate")
parser.add_argument(
    "--first_layer",
    type=int,
    default="128",
    help="Neuron number for the first hidden layer")
parser.add_argument(
    "--second_layer",
    type=int,
    default="64",
    help="Neuron number for the second hidden layer")
parser.add_argument(
    "--epochs_num",
    type=int,
    default="6",
    help="Number of epochs")
FLAGS, unparsed = parser.parse_known_args()

# build model
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(FLAGS.first_layer, activation="relu"),
    tf.keras.layers.Dense(FLAGS.second_layer, activation="relu"),
    tf.keras.layers.Dense(10)
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(FLAGS.learning_rate),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

# run training
train_data = tf.data.experimental.load(FLAGS.data_folder)
model.fit(
    train_data.shuffle(1000).batch(128).prefetch(tf.data.AUTOTUNE),
    epochs=FLAGS.epochs_num
)

# save model and variables
model_path = os.path.join(FLAGS.model_folder, "mnist_tf_model")
model.save(model_path)
print("current working directory : ", os.getcwd())
print("model folder : ", model_path)

Writing script/train_experiment.py


## Train on remote VM

As you have learned in "[Exercise04 : Train on Remote GPU Virtual Machine](./exercise04_train_remote.ipynb)", run this script on AML remote compute.<br>
(Here we use general purpose CPU machine, instead of GPU utilized machine.)

1. Create AML compute.

> Note : By setting appropriate time duration in ```--idle-time-before-scale-down``` option, you can prevent scaling-down when the training has finished. (Otherwise, it will scale down in 120 seconds after the training has finished, and the next training will slow to start because of cluster resizing.)

In [4]:
!az ml compute create --name myvm02 \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace \
  --type amlcompute \
  --min-instances 0 \
  --max-instances 1 \
  --size Standard_D2_v2

[K{\ Finished ..
  "id": "/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/rg-AML/providers/Microsoft.MachineLearningServices/workspaces/ws01/computes/myvm02",
  "idle_time_before_scale_down": 120,
  "location": "eastus",
  "max_instances": 1,
  "min_instances": 0,
  "name": "myvm02",
  "network_settings": {},
  "provisioning_state": "Succeeded",
  "resourceGroup": "rg-AML",
  "size": "STANDARD_D2_V2",
  "ssh_public_access_enabled": true,
  "tier": "dedicated",
  "type": "amlcompute"
}
[0m

2. Create custom environment.<br>
As I have mentioned above, MLflow tracking is configured in AML CLI v2. For MLflow logging, ```mlflow``` and ```azureml-mlflow``` packages should be installed on the environment as follows.<br>
Here I have created own environment, but you can also use AML built-in environment (curated environment), in which MLflow is already installed.

In [5]:
%%writefile 06_conda_pydata_for_logging.yml
name: project_environment
dependencies:
- python=3.8
- pip:
  - tensorflow==2.10.0
  - mlflow
  - azureml-mlflow
channels:
- anaconda
- conda-forge

Writing 06_conda_pydata_for_logging.yml


In [6]:
%%writefile 06_env_register.yml
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: test-remote-cpu-env-for-logging
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: 06_conda_pydata_for_logging.yml
description: This is example

Writing 06_env_register.yml


In [7]:
!az ml environment create --file 06_env_register.yml \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

{
  "conda_file": {
    "channels": [
      "anaconda",
      "conda-forge"
    ],
    "dependencies": [
      "python=3.8",
      {
        "pip": [
          "tensorflow==2.10.0",
          "mlflow",
          "azureml-mlflow"
        ]
      }
    ],
    "name": "project_environment"
  },
  "creation_context": {
    "created_at": "2022-10-04T06:05:58.628838+00:00",
    "created_by": "Tsuyoshi Matsuzaki",
    "created_by_type": "User",
    "last_modified_at": "2022-10-04T06:05:58.628838+00:00",
    "last_modified_by": "Tsuyoshi Matsuzaki",
    "last_modified_by_type": "User"
  },
  "description": "This is example",
  "id": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/rg-AML/providers/Microsoft.MachineLearningServices/workspaces/ws01/environments/test-remote-cpu-env-for-logging/versions/1",
  "image": "mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
  "name": "test-remote-cpu-env-for-logging",
  "os_type": "linux",
  "

3. Run script on above custom environment, in which ```mlflow``` and ```azureml-mlflow``` are already installed.

> Note : In this example, I also use the registered data asset named ```mnist_data``` to mount in your compute target. Run "[Exercise02 : Prepare Data](./exercise02_prepare_data.ipynb)" for data preparation.

In [8]:
%%writefile 06_train_experiment_job.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: script
command: >-
  python train_experiment.py
  --data_folder ${{inputs.mnist_tf}}/train
inputs:
  mnist_tf:
    type: uri_folder
    path: azureml:mnist_data@latest
environment: azureml:test-remote-cpu-env-for-logging@latest
compute: azureml:myvm02
display_name: tf_remote_experiment02
experiment_name: tf_remote_experiment02
description: This is example

Writing 06_train_experiment_job.yml


In [9]:
!az ml job create --file 06_train_experiment_job.yml \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

{
  "code": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/rg-AML/providers/Microsoft.MachineLearningServices/workspaces/ws01/codes/7c6cb5a7-40e7-4e84-9627-a794d3e2e9a5/versions/1",
  "command": "python train_experiment.py --data_folder ${{inputs.mnist_tf}}/train",
  "compute": "azureml:myvm02",
  "creation_context": {
    "created_at": "2022-10-04T06:06:54.599550+00:00",
    "created_by": "Tsuyoshi Matsuzaki",
    "created_by_type": "User"
  },
  "description": "This is example",
  "display_name": "tf_remote_experiment02",
  "environment": "azureml:test-remote-cpu-env-for-logging:1",
  "environment_variables": {},
  "experiment_name": "tf_remote_experiment02",
  "id": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/rg-AML/providers/Microsoft.MachineLearningServices/workspaces/ws01/jobs/sincere_melon_x3xb0svbks",
  "inputs": {
    "mnist_tf": {
      "mode": "ro_mount",
      "path": "azureml:mnist_data:1",
      "type": "uri_fol

## See logs in AML Studio UI

Go to [AML Studio UI](https://ml.azure.com/).<br>
Click "Jobs" and select "tf_remote_experiment02". You can then see the recorded metrics as follows.

![AML Experiment Metrics](https://tsmatz.github.io/images/github/azure-ml-tensorflow-complete-sample/20220225_Experiment_Metrics.jpg)

## Remove AML compute

**You don't need to remove your AML compute** for saving money, because the nodes will be automatically terminated, when it's inactive.<br>
But if you want to clean up, please run as follows.

In [18]:
!az ml compute delete --name myvm02 \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace \
  --yes

Deleting compute myvm02 
.....................................Done.
(3m 7s)

[0m