# Exercise07 : Hyperparameter Tuning (Sweep Job)

AML provides framework-independent hyperparameter tuning capability.<br>
You can quickly search optimal parameters with scaled training workloads. This capability also works with metrics in AML logging.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Variable's Setting

Replace below's branket's string and set the required variables.

In [1]:
my_resource_group = "{AML-RESOURCE-GROUP-NAME}"
my_workspace = "{AML-WORSPACE-NAME}"

## Save your training code

First, you must save your training code.    
Here I should use the source code in "[Exercise06 : Track Logs and Metrics](./exercise06_experimentation.ipynb)", which sends logs into AML run history. (The metrics will be tracked in hyper-parameter tuning (sweep) job.)

Create ```scirpt``` directory.

In [2]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

Save source code as ```./script/train_expriment.py```.<br>
This source code is the exact same source code as one in "[Exercise06 : Track Logs and Metrics](./exercise06_experimentation.ipynb)"

In [3]:
%%writefile script/train_experiment.py
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import sys
import os
import shutil
import argparse
import math

import tensorflow as tf

import mlflow

FLAGS = None
batch_size = 100

#
# define functions for Estimator
#

def _my_input_fn(filepath, num_epochs):
    # image - 784 (=28 x 28) elements of grey-scaled integer value [0, 1]
    # label - digit (0, 1, ..., 9)
    data_queue = tf.train.string_input_producer(
        [filepath],
        num_epochs = num_epochs) # data is repeated and it raises OutOfRange when data is over
    data_reader = tf.TFRecordReader()
    _, serialized_exam = data_reader.read(data_queue)
    data_exam = tf.parse_single_example(
        serialized_exam,
        features={
            'image_raw': tf.FixedLenFeature([], tf.string),
            'label': tf.FixedLenFeature([], tf.int64)
        })
    data_image = tf.decode_raw(data_exam['image_raw'], tf.uint8)
    data_image.set_shape([784])
    data_image = tf.cast(data_image, tf.float32) * (1. / 255)
    data_label = tf.cast(data_exam['label'], tf.int32)
    data_batch_image, data_batch_label = tf.train.batch(
        [data_image, data_label],
        batch_size=batch_size)
    return {'inputs': data_batch_image}, data_batch_label

def _get_input_fn(filepath, num_epochs):
    return lambda: _my_input_fn(filepath, num_epochs)

def _my_model_fn(features, labels, mode):
    # with tf.device(...): # You can set device if using GPUs

    # define network and inference
    # (simple 2 fully connected hidden layer : 784->128->64->10)
    with tf.name_scope('hidden1'):
        weights = tf.Variable(
            tf.truncated_normal(
                [784, FLAGS.first_layer],
                stddev=1.0 / math.sqrt(float(784))),
            name='weights')
        biases = tf.Variable(
            tf.zeros([FLAGS.first_layer]),
            name='biases')
        hidden1 = tf.nn.relu(tf.matmul(features['inputs'], weights) + biases)
    with tf.name_scope('hidden2'):
        weights = tf.Variable(
            tf.truncated_normal(
                [FLAGS.first_layer, FLAGS.second_layer],
                stddev=1.0 / math.sqrt(float(FLAGS.first_layer))),
            name='weights')
        biases = tf.Variable(
            tf.zeros([FLAGS.second_layer]),
            name='biases')
        hidden2 = tf.nn.relu(tf.matmul(hidden1, weights) + biases)
    with tf.name_scope('softmax_linear'):
        weights = tf.Variable(
            tf.truncated_normal(
                [FLAGS.second_layer, 10],
                stddev=1.0 / math.sqrt(float(FLAGS.second_layer))),
        name='weights')
        biases = tf.Variable(
            tf.zeros([10]),
            name='biases')
        logits = tf.matmul(hidden2, weights) + biases
 
    # compute evaluation matrix
    predicted_indices = tf.argmax(input=logits, axis=1)
    if mode != tf.estimator.ModeKeys.PREDICT:
        label_indices = tf.cast(labels, tf.int32)
        accuracy = tf.metrics.accuracy(label_indices, predicted_indices)
        tf.summary.scalar('accuracy', accuracy[1]) # output to TensorBoard 
        loss = tf.losses.sparse_softmax_cross_entropy(
            labels=labels,
            logits=logits)
 
    # define operations
    if mode == tf.estimator.ModeKeys.TRAIN:
        #global_step = tf.train.create_global_step()
        #global_step = tf.contrib.framework.get_or_create_global_step()
        global_step = tf.train.get_or_create_global_step()        
        optimizer = tf.train.GradientDescentOptimizer(
            learning_rate=FLAGS.learning_rate)
        train_op = optimizer.minimize(
            loss=loss,
            global_step=global_step)
        # Ask for accuracy and loss in each steps
        class _CustomLoggingHook(tf.train.SessionRunHook):
            def before_run(self, run_context):
                return tf.train.SessionRunArgs([accuracy[1], loss, global_step])
            def after_run(self, run_context, run_values):
                result_accuracy, result_loss, result_step = run_values.results
                if result_step % 10 == 0 :
                    mlflow.log_metrics(
                        {'training_accuracy': result_accuracy, 'training_loss': result_loss},
                        step=result_step)
        return tf.estimator.EstimatorSpec(
            mode,
            training_chief_hooks=[_CustomLoggingHook()],
            loss=loss,
            train_op=train_op)
    if mode == tf.estimator.ModeKeys.EVAL:
        eval_metric_ops = {
            'accuracy': accuracy
        }
        return tf.estimator.EstimatorSpec(
            mode,
            loss=loss,
            eval_metric_ops=eval_metric_ops)
    if mode == tf.estimator.ModeKeys.PREDICT:
        probabilities = tf.nn.softmax(logits, name='softmax_tensor')
        predictions = {
            'classes': predicted_indices,
            'probabilities': probabilities
        }
        export_outputs = {
            'prediction': tf.estimator.export.PredictOutput(predictions)
        }
        return tf.estimator.EstimatorSpec(
            mode,
            predictions=predictions,
            export_outputs=export_outputs)

def _my_serving_input_fn():
    inputs = {'inputs': tf.placeholder(tf.float32, [None, 784])}
    return tf.estimator.export.ServingInputReceiver(inputs, inputs)

#
# Main
#

parser = argparse.ArgumentParser()
parser.add_argument(
    '--data_folder',
    type=str,
    default='./data',
    help='Folder path for input data')
parser.add_argument(
    '--chkpoint_folder',
    type=str,
    default='./logs',  # AML experiments logs folder
    help='Folder path for checkpoint files')
parser.add_argument(
    '--model_folder',
    type=str,
    default='./outputs',  # AML experiments outputs folder
    help='Folder path for model output')
parser.add_argument(
    '--learning_rate',
    type=float,
    default='0.07',
    help='Learning Rate')
parser.add_argument(
    '--first_layer',
    type=int,
    default='128',
    help='Neuron number for the first hidden layer')
parser.add_argument(
    '--second_layer',
    type=int,
    default='64',
    help='Neuron number for the second hidden layer')
FLAGS, unparsed = parser.parse_known_args()

# clean checkpoint and model folder if exists
if os.path.exists(FLAGS.chkpoint_folder) :
    for file_name in os.listdir(FLAGS.chkpoint_folder):
        file_path = os.path.join(FLAGS.chkpoint_folder, file_name)
        if os.path.isfile(file_path):
            os.remove(file_path)
        elif os.path.isdir(file_path):
            shutil.rmtree(file_path)
if os.path.exists(FLAGS.model_folder) :
    for file_name in os.listdir(FLAGS.model_folder):
        file_path = os.path.join(FLAGS.model_folder, file_name)
        if os.path.isfile(file_path):
            os.remove(file_path)
        elif os.path.isdir(file_path):
            shutil.rmtree(file_path)

# read TF_CONFIG
run_config = tf.estimator.RunConfig()

# create Estimator
mnist_fullyconnected_classifier = tf.estimator.Estimator(
    model_fn=_my_model_fn,
    model_dir=FLAGS.chkpoint_folder,
    config=run_config)
train_spec = tf.estimator.TrainSpec(
    input_fn=_get_input_fn(os.path.join(FLAGS.data_folder, 'train.tfrecords'), 2),
    max_steps=60000 * 2 / batch_size)
eval_spec = tf.estimator.EvalSpec(
    input_fn=_get_input_fn(os.path.join(FLAGS.data_folder, 'test.tfrecords'), 1),
    steps=10000 * 1 / batch_size,
    start_delay_secs=0)

# run !
eval_res = tf.estimator.train_and_evaluate(
    mnist_fullyconnected_classifier,
    train_spec,
    eval_spec
)

# save model and variables
model_dir = mnist_fullyconnected_classifier.export_savedmodel(
    export_dir_base = FLAGS.model_folder,
    serving_input_receiver_fn = _my_serving_input_fn)
print('current working directory is ', os.getcwd())
print('model is saved ', model_dir)

# send logs to AML
mlflow.log_params({
    'learning_rate': FLAGS.learning_rate,
    '1st_layer': FLAGS.first_layer,
    '2nd_layer': FLAGS.second_layer})
mlflow.log_metrics({
    'final_accuracy': eval_res[0]['accuracy'],
    'final_loss': eval_res[0]['loss']})

Overwriting script/train_experiment.py


## Create AML compute

Create AML compute pool for computing environment.<br>
Here I create a cluster with max 4 instances to scale sweep job.

> Note : By setting appropriate time duration in ```--idle-time-before-scale-down``` option, you can prevent scaling-down when the training has finished. (Otherwise, it will scale down in 120 seconds after the training has finished, and the next training will slow to start because of cluster resizing.)

In [4]:
!az ml compute create --name hypertest01 \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace \
  --type amlcompute \
  --min-instances 0 \
  --max-instances 4 \
  --size Standard_D2_v2

{
  "id": "/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/computes/hypertest01",
  "idle_time_before_scale_down": 120,
  "location": "eastus",
  "max_instances": 4,
  "min_instances": 0,
  "name": "hypertest01",
  "network_settings": {},
  "provisioning_state": "Succeeded",
  "resourceGroup": "AML-rg",
  "size": "STANDARD_D2_V2",
  "ssh_public_access_enabled": true,
  "tier": "dedicated",
  "type": "amlcompute"
}
[0m

## Create AML environment

As I have mentioned in "[Exercise06 : Track Logs and Metrics](./exercise06_experimentation.ipynb)", we should use an environment with ```mlflow``` and ```azureml-mlflow``` installed.

**If you have already created custom environment in [Exercise06](./exercise06_experimentation.ipynb), you don't need to run the following command.** (Because this custom environment already exists.)

In [None]:
%%writefile 06_conda_pydata_for_logging.yml
name: project_environment
dependencies:
- python=3.6
- pip:
  - tensorflow-gpu==1.15
  - mlflow
  - azureml-mlflow
channels:
- anaconda
- conda-forge

In [None]:
%%writefile 06_env_register.yml
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: test-remote-cpu-env-for-logging
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
conda_file: 06_conda_pydata_for_logging.yml
description: This is example

In [None]:
!az ml environment create --file 06_env_register.yml \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

## Submit a job with hyper-parameter's search

Now submit a job, in which multiple trainings will run depending on different hyper-parameters.<br>
In this example, we monitor training accuracy depending on 3 arguments - ```--learning_rate```, ```--first_layer```, and ```--second_layer```. Each arguments can have 3 different values (and then total 27 trials can be run), but here I set maximum 20 trials to run, in which the values of arguments are randomly picked up.
<br>These trials will be parallelized on above 4 node to speed up.

For ```sampling_algorithm```, you can use ```grid```, ```random```, and ```bayesian```.<br>
You can also specify an early termnination policy (```early_termination```), in which the training will terminate if the primary metric falls outside of some threshold. (Here we don't apply early termination.)

> Note : In this example, I also use the registered data asset  (train.tfrecords, test.tfrecords) named ```mnist_tfrecords_data``` to mount in your compute target. Run "[Exercise02 : Prepare Data](./exercise02_prepare_data.ipynb)" for data preparation.

In [5]:
%%writefile 07_hyperparam_job.yml
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
  code: script
  command: >-
    python train_experiment.py
    --data_folder ${{inputs.mnist_tf}}
    --learning_rate ${{search_space.learning_rate}}
    --first_layer ${{search_space.first_layer}}
    --second_layer ${{search_space.second_layer}}
  environment: azureml:test-remote-cpu-env-for-logging@latest
inputs:
  mnist_tf:
    type: uri_folder
    path: azureml:mnist_tfrecords_data@latest
compute: azureml:hypertest01
sampling_algorithm: random
search_space:
  learning_rate:
    type: choice
    values: [0.01, 0.05, 0.9]
  first_layer:
    type: choice
    values: [100, 125, 150]
  second_layer:
    type: choice
    values: [30, 60, 90]
objective:
  goal: maximize
  primary_metric: training_accuracy
limits:
  max_total_trials: 20
  max_concurrent_trials: 4
display_name: hyperdrive_test
experiment_name: hyperdrive_test
description: This is example

Writing 07_hyperparam_job.yml


In [6]:
!az ml job create --file 07_hyperparam_job.yml \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

[32mUploading script (0.02 MBs): 100%|████| 22737/22737 [00:00<00:00, 614732.36it/s][0m
[39m

{
  "compute": "azureml:hypertest01",
  "creation_context": {
    "created_at": "2022-06-07T02:59:55.460612+00:00",
    "created_by": "Tsuyoshi Matsuzaki",
    "created_by_type": "User"
  },
  "description": "This is example",
  "display_name": "hyperdrive_test",
  "experiment_name": "hyperdrive_test",
  "id": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/jobs/quiet_boat_r3cvz6k5c3",
  "inputs": {
    "mnist_tf": {
      "mode": "ro_mount",
      "path": "azureml://locations/eastus/workspaces/e3065a8e-03f5-431f-a3d9-976175f54379/data/mnist_tfrecords_data/versions/1",
      "type": "uri_folder"
    }
  },
  "limits": {
    "max_concurrent_trials": 4,
    "max_total_trials": 20,
    "timeout": 5184000
  },
  "name": "quiet_boat_r3cvz6k5c3",
  "objective": {
    "goal": "maximize",
    "primary_met

## View logs

You can view logs and metrics in jobs on [Azure ML studio UI](https://ml.azure.com/).

![AML Hyperdrive Metrics](https://tsmatz.github.io/images/github/azure-ml-tensorflow-complete-sample/20220225_Hyperdrive_Metrics.jpg)

## Remove AML compute

**You don't need to remove your AML compute** for saving money, because the nodes will be automatically terminated, when it's inactive.<br>
But if you want to clean up, please run as follows.

In [7]:
!az ml compute delete --name hypertest01 \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace \
  --yes

Deleting compute hypertest01 
...........Done.
(0m 56s)

[0m