# Exercise06 : Track Logs and Metrics

Here we add logging capabilities in our source code, and check the collected logs and metrics.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Initialize MLClient

Replace below's branket's string with your subscription id, resource group name, and AML workspace name.<br>
(I note that creating ```MLClient``` will not connect to AML workspace, and the client initialization is lazy.)

In [1]:
from azure.ai.ml import MLClient
from azure.identity import DeviceCodeCredential

# When you run on remote
cred = DeviceCodeCredential()

# # When you run on Azure ML Notebook
# from azure.identity import DefaultAzureCredential
# cred = DefaultAzureCredential()

# Get a handle to the workspace
ml_client = MLClient(
    credential=cred,
    subscription_id="{SUBSCRIPTION ID}",
    resource_group_name="{RESOURCE GROUP NAME}",
    workspace_name="{AML WORKSPACE NAME}",
)

  from cryptography import x509


## Change your source code for experimentation logging

By using the Azure Machine Learning CLI v2, **MLflow tracking URI and experiment's name are automatically set and directs the logging from MLflow to your AML workspace**.<br>
Therefore, change your source code in "[Exercise03 : Just Train in Your Working Machine](./exercise03_train_simple.ipynb)" to track logs and metrics with MLflow as follows. (The lines commented by "```##### Modified```" are modified.)

> Note : For details about MLflow and Azure ML integration, see [this repository](https://github.com/tsmatz/mlflow-azureml).

In [2]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

In [3]:
%%writefile script/train_experiment.py
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import sys
import os
import shutil
import argparse
import math

import tensorflow as tf

import mlflow ##### Modified

FLAGS = None
batch_size = 100

#
# define functions for Estimator
#

def _my_input_fn(filepath, num_epochs):
    # image - 784 (=28 x 28) elements of grey-scaled integer value [0, 1]
    # label - digit (0, 1, ..., 9)
    data_queue = tf.train.string_input_producer(
        [filepath],
        num_epochs = num_epochs) # data is repeated and it raises OutOfRange when data is over
    data_reader = tf.TFRecordReader()
    _, serialized_exam = data_reader.read(data_queue)
    data_exam = tf.parse_single_example(
        serialized_exam,
        features={
            'image_raw': tf.FixedLenFeature([], tf.string),
            'label': tf.FixedLenFeature([], tf.int64)
        })
    data_image = tf.decode_raw(data_exam['image_raw'], tf.uint8)
    data_image.set_shape([784])
    data_image = tf.cast(data_image, tf.float32) * (1. / 255)
    data_label = tf.cast(data_exam['label'], tf.int32)
    data_batch_image, data_batch_label = tf.train.batch(
        [data_image, data_label],
        batch_size=batch_size)
    return {'inputs': data_batch_image}, data_batch_label

def _get_input_fn(filepath, num_epochs):
    return lambda: _my_input_fn(filepath, num_epochs)

def _my_model_fn(features, labels, mode):
    # with tf.device(...): # You can set device if using GPUs

    # define network and inference
    # (simple 2 fully connected hidden layer : 784->128->64->10)
    with tf.name_scope('hidden1'):
        weights = tf.Variable(
            tf.truncated_normal(
                [784, FLAGS.first_layer],
                stddev=1.0 / math.sqrt(float(784))),
            name='weights')
        biases = tf.Variable(
            tf.zeros([FLAGS.first_layer]),
            name='biases')
        hidden1 = tf.nn.relu(tf.matmul(features['inputs'], weights) + biases)
    with tf.name_scope('hidden2'):
        weights = tf.Variable(
            tf.truncated_normal(
                [FLAGS.first_layer, FLAGS.second_layer],
                stddev=1.0 / math.sqrt(float(FLAGS.first_layer))),
            name='weights')
        biases = tf.Variable(
            tf.zeros([FLAGS.second_layer]),
            name='biases')
        hidden2 = tf.nn.relu(tf.matmul(hidden1, weights) + biases)
    with tf.name_scope('softmax_linear'):
        weights = tf.Variable(
            tf.truncated_normal(
                [FLAGS.second_layer, 10],
                stddev=1.0 / math.sqrt(float(FLAGS.second_layer))),
        name='weights')
        biases = tf.Variable(
            tf.zeros([10]),
            name='biases')
        logits = tf.matmul(hidden2, weights) + biases
 
    # compute evaluation matrix
    predicted_indices = tf.argmax(input=logits, axis=1)
    if mode != tf.estimator.ModeKeys.PREDICT:
        label_indices = tf.cast(labels, tf.int32)
        accuracy = tf.metrics.accuracy(label_indices, predicted_indices)
        tf.summary.scalar('accuracy', accuracy[1]) # output to TensorBoard 
        loss = tf.losses.sparse_softmax_cross_entropy(
            labels=labels,
            logits=logits)
 
    # define operations
    if mode == tf.estimator.ModeKeys.TRAIN:
        #global_step = tf.train.create_global_step()
        #global_step = tf.contrib.framework.get_or_create_global_step()
        global_step = tf.train.get_or_create_global_step()        
        optimizer = tf.train.GradientDescentOptimizer(
            learning_rate=FLAGS.learning_rate)
        train_op = optimizer.minimize(
            loss=loss,
            global_step=global_step)
        # Ask for accuracy and loss in each steps ##### Modified
        class _CustomLoggingHook(tf.train.SessionRunHook): ##### Modified
            def before_run(self, run_context): ##### Modified
                return tf.train.SessionRunArgs([accuracy[1], loss, global_step]) ##### Modified
            def after_run(self, run_context, run_values): ##### Modified
                result_accuracy, result_loss, result_step = run_values.results ##### Modified
                if result_step % 10 == 0 : ##### Modified
                    mlflow.log_metrics(
                        {'training_accuracy': result_accuracy, 'training_loss': result_loss},
                        step=result_step) ##### Modified
        return tf.estimator.EstimatorSpec(
            mode,
            training_chief_hooks=[_CustomLoggingHook()], ##### Modified
            loss=loss,
            train_op=train_op)
    if mode == tf.estimator.ModeKeys.EVAL:
        eval_metric_ops = {
            'accuracy': accuracy
        }
        return tf.estimator.EstimatorSpec(
            mode,
            loss=loss,
            eval_metric_ops=eval_metric_ops)
    if mode == tf.estimator.ModeKeys.PREDICT:
        probabilities = tf.nn.softmax(logits, name='softmax_tensor')
        predictions = {
            'classes': predicted_indices,
            'probabilities': probabilities
        }
        export_outputs = {
            'prediction': tf.estimator.export.PredictOutput(predictions)
        }
        return tf.estimator.EstimatorSpec(
            mode,
            predictions=predictions,
            export_outputs=export_outputs)

def _my_serving_input_fn():
    inputs = {'inputs': tf.placeholder(tf.float32, [None, 784])}
    return tf.estimator.export.ServingInputReceiver(inputs, inputs)

#
# Main
#

parser = argparse.ArgumentParser()
parser.add_argument(
    '--data_folder',
    type=str,
    default='./data',
    help='Folder path for input data')
parser.add_argument(
    '--chkpoint_folder',
    type=str,
    default='./logs',  # AML experiments logs folder
    help='Folder path for checkpoint files')
parser.add_argument(
    '--model_folder',
    type=str,
    default='./outputs',  # AML experiments outputs folder
    help='Folder path for model output')
parser.add_argument(
    '--learning_rate',
    type=float,
    default='0.07',
    help='Learning Rate')
parser.add_argument(
    '--first_layer',
    type=int,
    default='128',
    help='Neuron number for the first hidden layer')
parser.add_argument(
    '--second_layer',
    type=int,
    default='64',
    help='Neuron number for the second hidden layer')
FLAGS, unparsed = parser.parse_known_args()

# clean checkpoint and model folder if exists
if os.path.exists(FLAGS.chkpoint_folder) :
    for file_name in os.listdir(FLAGS.chkpoint_folder):
        file_path = os.path.join(FLAGS.chkpoint_folder, file_name)
        if os.path.isfile(file_path):
            os.remove(file_path)
        elif os.path.isdir(file_path):
            shutil.rmtree(file_path)
if os.path.exists(FLAGS.model_folder) :
    for file_name in os.listdir(FLAGS.model_folder):
        file_path = os.path.join(FLAGS.model_folder, file_name)
        if os.path.isfile(file_path):
            os.remove(file_path)
        elif os.path.isdir(file_path):
            shutil.rmtree(file_path)

# read TF_CONFIG
run_config = tf.estimator.RunConfig()

# create Estimator
mnist_fullyconnected_classifier = tf.estimator.Estimator(
    model_fn=_my_model_fn,
    model_dir=FLAGS.chkpoint_folder,
    config=run_config)
train_spec = tf.estimator.TrainSpec(
    input_fn=_get_input_fn(os.path.join(FLAGS.data_folder, 'train.tfrecords'), 2),
    max_steps=60000 * 2 / batch_size)
eval_spec = tf.estimator.EvalSpec(
    input_fn=_get_input_fn(os.path.join(FLAGS.data_folder, 'test.tfrecords'), 1),
    steps=10000 * 1 / batch_size,
    start_delay_secs=0)

# run !
eval_res = tf.estimator.train_and_evaluate(
    mnist_fullyconnected_classifier,
    train_spec,
    eval_spec
)

# save model and variables
model_dir = mnist_fullyconnected_classifier.export_savedmodel(
    export_dir_base = FLAGS.model_folder,
    serving_input_receiver_fn = _my_serving_input_fn)
print('current working directory is ', os.getcwd())
print('model is saved ', model_dir)

# send logs to AML ##### Modified
mlflow.log_params({
    'learning_rate': FLAGS.learning_rate,
    '1st_layer': FLAGS.first_layer,
    '2nd_layer': FLAGS.second_layer}) ##### Modified
mlflow.log_metrics({
    'final_accuracy': eval_res[0]['accuracy'],
    'final_loss': eval_res[0]['loss']}) ##### Modified

Writing script/train_experiment.py


## Train on remote VM

As you have learned in "[Exercise04 : Train on Remote GPU Virtual Machine](./exercise04_train_remote.ipynb)", run this script on AML remote compute.<br>
(Here we use general purpose CPU machine, instead of GPU utilized machine.)

1. Create AML compute.

> Note : By setting appropriate time duration in ```idle_time_before_scale_down``` parameter, you can prevent scaling-down when the training has finished. (Otherwise, it will scale down in 120 seconds after the training has finished, and the next training will slow to start because of cluster resizing.)

In [4]:
from azure.ai.ml.entities import AmlCompute

try:
    compute_target = ml_client.compute.get("myvm02")
    print("found existing: ", compute_target.name)
except Exception:
    print("creating new.")
    compute_target = AmlCompute(
        name="myvm02",
        type="amlcompute",
        size="Standard_D2_v2",
        min_instances=0,
        max_instances=1,
        tier="Dedicated",
    )
    compute_target = ml_client.begin_create_or_update(compute_target)

To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code EV2WAYL8R to authenticate.
creating new.


2. Create custom environment.<br>
As I have mentioned above, MLflow tracking is configured in AML CLI v2. For MLflow logging, ```mlflow``` and ```azureml-mlflow``` packages should be installed on the environment as follows.

In [5]:
%%writefile 06_conda_pydata_for_logging.yml
name: project_environment
dependencies:
- python=3.6
- pip:
  - tensorflow==1.15
  - mlflow
  - azureml-mlflow
channels:
- anaconda
- conda-forge

Writing 06_conda_pydata_for_logging.yml


In [6]:
from azure.ai.ml.entities import Environment

myenv = Environment(
    name="test-remote-cpu-env-for-logging",
    description="This is example",
    conda_file="06_conda_pydata_for_logging.yml",
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04",
)
myenv = ml_client.environments.create_or_update(myenv)

3. Run script on above custom environment, in which ```mlflow``` and ```azureml-mlflow``` are already installed.

> Note : In this example, I also use the registered data asset  (train.tfrecords, test.tfrecords) named ```mnist_tfrecords_data``` to mount in your compute target. Run "[Exercise02 : Prepare Data](./exercise02_prepare_data.ipynb)" for data preparation.

In [7]:
from azure.ai.ml import command, Input

# create the command
job = command(
    code="./script",
    command="python train_experiment.py --data_folder ${{inputs.mnist_tf}}",
    inputs={
        "mnist_tf": Input(
            type="uri_folder",
            path="mnist_tfrecords_data@latest",
        ),
    },
    environment="test-remote-cpu-env-for-logging@latest",
    compute="myvm02",
    display_name="tf_remote_experiment02",
    experiment_name="tf_remote_experiment02",
    description="This is example",
)

# submit the command
returned_job = ml_client.create_or_update(job)

[32mUploading script (0.02 MBs): 100%|██████████| 22932/22932 [00:00<00:00, 639592.10it/s]
[39m



## See logs in AML Studio UI

Go to [AML Studio UI](https://ml.azure.com/).<br>
Click "Jobs" and select "tf_remote_experiment02". You can then see the recorded metrics as follows.

![AML Experiment Metrics](https://tsmatz.github.io/images/github/azure-ml-tensorflow-complete-sample/20220225_Experiment_Metrics.jpg)

## Remove AML compute

**You don't need to remove your AML compute** for saving money, because the nodes will be automatically terminated, when it's inactive.<br>
But if you want to clean up, please run as follows.

In [8]:
ml_client.compute.begin_delete("myvm02")

Deleting compute myvm02 


............................................................

Done.
(5m 3s)

