# Exercise06 : Track Logs and Metrics

Here we add logging capabilities in our source code, and check the collected logs and metrics.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Initialize MLClient

Replace below's branket's string with your subscription id, resource group name, and AML workspace name.<br>
(I note that creating ```MLClient``` will not connect to AML workspace, and the client initialization is lazy.)

In [1]:
from azure.ai.ml import MLClient
from azure.identity import DeviceCodeCredential, TokenCachePersistenceOptions

# When you run on remote
cache_opt = TokenCachePersistenceOptions(allow_unencrypted_storage=True)
cred = DeviceCodeCredential(cache_persistence_options=cache_opt)

# # When you run on Azure ML Notebook
# from azure.identity import DefaultAzureCredential
# cred = DefaultAzureCredential()

# Get a handle to the workspace
ml_client = MLClient(
    credential=cred,
    subscription_id="{SUBSCRIPTION ID}",
    resource_group_name="{RESOURCE GROUP NAME}",
    workspace_name="{AML WORKSPACE NAME}",
)

## Change your source code for experimentation logging

By using the Azure Machine Learning CLI v2, **MLflow tracking URI and experiment's name are automatically set and redirects the logging from MLflow to your AML workspace**.<br>
Therefore, change your source code in "[Exercise03 : Just Train in Your Working Machine](./exercise03_train_simple.ipynb)" to track logs and metrics with MLflow as follows. (The lines commented by "```##### Modified```" are modified.)

> Note : For details about MLflow and Azure ML integration, see [this repository](https://github.com/tsmatz/mlflow-azureml).

In [2]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

In [3]:
%%writefile script/train_experiment.py
import os
import argparse
import tensorflow as tf

import mlflow ##### Modified
mlflow.tensorflow.autolog() ##### Modified

### You can also manually log as follows (Here we use autolog())
# mlflow.log_params({
#     'learning_rate': FLAGS.learning_rate,
#     '1st_layer': FLAGS.first_layer,
#     '2nd_layer': FLAGS.second_layer})
# mlflow.log_metrics(
#     {'training_accuracy': result_accuracy, 'training_loss': result_loss},
#     step=result_step)

# device test
print("##### List of available GPU #####")
print(tf.config.list_physical_devices("GPU"))

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument(
    "--data_folder",
    type=str,
    default="./data/train",
    help="Folder path for input data")
parser.add_argument(
    "--model_folder",
    type=str,
    default="./outputs",  # AML experiments outputs folder
    help="Folder path for model output")
parser.add_argument(
    "--learning_rate",
    type=float,
    default="0.001",
    help="Learning Rate")
parser.add_argument(
    "--first_layer",
    type=int,
    default="128",
    help="Neuron number for the first hidden layer")
parser.add_argument(
    "--second_layer",
    type=int,
    default="64",
    help="Neuron number for the second hidden layer")
parser.add_argument(
    "--epochs_num",
    type=int,
    default="6",
    help="Number of epochs")
FLAGS, unparsed = parser.parse_known_args()

# build model
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(FLAGS.first_layer, activation="relu"),
    tf.keras.layers.Dense(FLAGS.second_layer, activation="relu"),
    tf.keras.layers.Dense(10)
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(FLAGS.learning_rate),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

# run training
train_data = tf.data.experimental.load(FLAGS.data_folder)
model.fit(
    train_data.shuffle(1000).batch(128).prefetch(tf.data.AUTOTUNE),
    epochs=FLAGS.epochs_num
)

# save model and variables
model_path = os.path.join(FLAGS.model_folder, "mnist_tf_model")
model.save(model_path)
print("current working directory : ", os.getcwd())
print("model folder : ", model_path)

Writing script/train_experiment.py


## Train on remote VM

As you have learned in "[Exercise04 : Train on Remote GPU Virtual Machine](./exercise04_train_remote.ipynb)", run this script on AML remote compute.<br>
(Here we use general purpose CPU machine, instead of GPU utilized machine.)

1. Create AML compute.

> Note : By setting appropriate time duration in ```idle_time_before_scale_down``` parameter, you can prevent scaling-down when the training has finished. (Otherwise, it will scale down in 120 seconds after the training has finished, and the next training will slow to start because of cluster resizing.)

In [4]:
from azure.ai.ml.entities import AmlCompute

try:
    compute_target = ml_client.compute.get("myvm02")
    print("found existing: ", compute_target.name)
except Exception:
    print("creating new.")
    compute_target = AmlCompute(
        name="myvm02",
        type="amlcompute",
        size="Standard_D2_v2",
        min_instances=0,
        max_instances=1,
        tier="Dedicated",
    )
    compute_target = ml_client.begin_create_or_update(compute_target)

To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code AY43LWJGG to authenticate.
creating new.


2. Create custom environment.<br>
As I have mentioned above, MLflow tracking is configured in AML CLI v2. For MLflow logging, ```mlflow``` and ```azureml-mlflow``` packages should be installed on the environment as follows.<br>
Here I have created own environment, but you can also use AML built-in environment (curated environment), in which MLflow is already installed.

In [5]:
%%writefile 06_conda_pydata_for_logging.yml
name: project_environment
dependencies:
- python=3.8
- pip:
  - tensorflow==2.10.0
  - mlflow
  - azureml-mlflow
channels:
- anaconda
- conda-forge

Writing 06_conda_pydata_for_logging.yml


In [6]:
from azure.ai.ml.entities import Environment

myenv = Environment(
    name="test-remote-cpu-env-for-logging",
    description="This is example",
    conda_file="06_conda_pydata_for_logging.yml",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
)
myenv = ml_client.environments.create_or_update(myenv)

Go to [AML Studio UI](https://ml.azure.com/) and click "Environments". Next, click "Custom environments" tab and select the above environment.<br>
Please wait until the environment image build status is succeeded.

![Environment status](https://tsmatz.github.io/images/github/azure-ml-tensorflow-complete-sample/20221220_Environment_Status.jpg)

3. Run script on above custom environment, in which ```mlflow``` and ```azureml-mlflow``` are already installed.

> Note : In this example, I also use the registered data asset named ```mnist_data``` to mount in your compute target. Run "[Exercise02 : Prepare Data](./exercise02_prepare_data.ipynb)" for data preparation.

In [7]:
from azure.ai.ml import command, Input

# create the command
job = command(
    code="./script",
    command="python train_experiment.py --data_folder ${{inputs.mnist_tf}}/train",
    inputs={
        "mnist_tf": Input(
            type="uri_folder",
            path="mnist_data@latest",
        ),
    },
    environment="test-remote-cpu-env-for-logging@latest",
    compute="myvm02",
    display_name="tf_remote_experiment02",
    experiment_name="tf_remote_experiment02",
    description="This is example",
)

# submit the command
returned_job = ml_client.create_or_update(job)

[32mUploading script (0.01 MBs): 100%|████████████████████████████████████████| 6402/6402 [00:00<00:00, 134283.17it/s][0m
[39m



## See logs in AML Studio UI

Go to [AML Studio UI](https://ml.azure.com/).<br>
Click "Jobs" and select "tf_remote_experiment02". You can then see the recorded metrics as follows.

![AML Experiment Metrics](https://tsmatz.github.io/images/github/azure-ml-tensorflow-complete-sample/20220225_Experiment_Metrics.jpg)

## Remove AML compute

**You don't need to remove your AML compute** for saving money, because the nodes will be automatically terminated, when it's inactive.<br>
But if you want to clean up, please run as follows.

In [8]:
ml_client.compute.begin_delete("myvm02")

Deleting compute myvm02 


............................................................

Done.
(5m 3s)

