# Exercise07 : Hyperparameter Tuning

AML provides framework-independent hyperparameter tuning capability.    
This capability monitors accuracy in AML logs.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Save your training code

First, you must save your training code.    
Here we should use the source code in "[Exercise06 : Experimentation Logs and Outputs](./exercise06_experimentation.ipynb)", which sends logs periodically into AML run history.

Create ```scirpt``` directory.

In [1]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

Save source code as ```./script/train_expriment.py```.

In [2]:
%%writefile script/train_experiment.py
import os
import argparse
import tensorflow as tf

from azureml.core.run import Run

# Get run when running in remote
if 'run' not in locals():
    run = Run.get_context()

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument(
    "--data_folder",
    type=str,
    default="./data",
    help="Folder path for input data")
parser.add_argument(
    "--model_folder",
    type=str,
    default="./outputs",  # AML experiments outputs folder
    help="Folder path for model output")
parser.add_argument(
    "--learning_rate",
    type=float,
    default="0.001",
    help="Learning Rate")
parser.add_argument(
    "--first_layer",
    type=int,
    default="128",
    help="Neuron number for the first hidden layer")
parser.add_argument(
    "--second_layer",
    type=int,
    default="64",
    help="Neuron number for the second hidden layer")
parser.add_argument(
    "--epochs_num",
    type=int,
    default="6",
    help="Number of epochs")
FLAGS, unparsed = parser.parse_known_args()

# build model
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(FLAGS.first_layer, activation="relu"),
    tf.keras.layers.Dense(FLAGS.second_layer, activation="relu"),
    tf.keras.layers.Dense(10)
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(FLAGS.learning_rate),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

# Create custom callback
class CustomOutputCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        # Note : Use log_list() instead of calling a lot of times.
        run.log('training_accuracy', logs["sparse_categorical_accuracy"])
        run.log('training_loss', logs["loss"])
    def on_train_end(self, logs=None):
        run.log('final_accuracy', logs["sparse_categorical_accuracy"])
        run.log('final_loss', logs["loss"])

# run training
train_data_path = os.path.join(FLAGS.data_folder, "train")
train_data = tf.data.experimental.load(train_data_path)
model.fit(
    train_data.shuffle(1000).batch(128).prefetch(tf.data.AUTOTUNE),
    epochs=FLAGS.epochs_num,
    callbacks=[CustomOutputCallback()]
)

# save model and variables
model_path = os.path.join(FLAGS.model_folder, "mnist_tf_model")
model.save(model_path)
print("current working directory : ", os.getcwd())
print("model folder : ", model_path)

# send logs to AML
run.log('learning_rate', FLAGS.learning_rate)
run.log('1st_layer', FLAGS.first_layer)
run.log('2nd_layer', FLAGS.second_layer)

Writing script/train_experiment.py


## Get workspace setting

Before starting, you must read your configuration settings. (See "[Exercise01 : Prepare Config Settings](./exercise01_prepare_config.ipynb)".)

In [3]:
from azureml.core import Workspace
import azureml.core

ws = Workspace.from_config()

## Create AML compute

Create AML compute pool for computing environment.

In [4]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

try:
    compute_target = ComputeTarget(workspace=ws, name='hypertest01')
    print('found existing:', compute_target.name)
except ComputeTargetException:
    print('creating new.')
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='Standard_D2_v2',
        min_nodes=0,
        max_nodes=4)
    compute_target = ComputeTarget.create(ws, 'hypertest01', compute_config)
    compute_target.wait_for_completion(show_output=True)

creating new.
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Prepare Dataset

You can mount your dataset (See "[Exercise02 : Prepare Data](./exercise02_prepare_data.ipynb)") into your AML compute.<br>
Now we get the registered dataset reference.

In [5]:
from azureml.core import Dataset

dataset = Dataset.get_by_name(ws, 'mnist_dataset', version='latest')

# # For using unregistered data, see below
# from azureml.core import Datastore
# from azureml.core import Dataset
# ds = ws.get_default_datastore()
# ds_paths = [(ds, 'tfdata/')]
# dataset = Dataset.File.from_files(path = ds_paths)

## Generate Hyperparameter Sampling

Set how to explorer for script's arguments (the arguments in ```train_experiment.py```).<br>
You can choose from ```GridParameterSampling```, ```RandomParameterSampling```, and ```BayesianParameterSampling```.

In [6]:
from azureml.train.hyperdrive import *

param_sampling = RandomParameterSampling(
    {
        '--learning_rate': choice(0.001, 0.005, 0.009),
        '--first_layer': choice(100, 125, 150),
        '--second_layer': choice(30, 60, 90)
    }
)

## Generate script run config

In [7]:
from azureml.core import Environment, Experiment, ScriptRunConfig

# generate script run config
tf_env = Environment.get(workspace=ws, name='AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu')
src = ScriptRunConfig(
    source_directory='./script',
    script='train_experiment.py',
    arguments=['--data_folder', dataset.as_mount()],
    compute_target=compute_target,
    environment=tf_env
)

## Generate HyperDrive config

Generate run config with an early termnination policy (```BanditPolicy```). With this policy, the training will terminate if the primary metric falls outside of the top 10% range (checking every 2 iterations).

In [8]:
# early termnination :
# primary metric falls outside of the top 10% (0.1) range by checking every 2 iterations
policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

# generate run config
hd_config = HyperDriveConfig(
    run_config=src,
    hyperparameter_sampling=param_sampling,
    primary_metric_name='training_accuracy',
    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
    policy=policy,
    max_total_runs=20,
    max_concurrent_runs=4)

## Run script and wait for completion

This will start training with 4 parallel nodes. (You can scale as you like.)

In [9]:
from azureml.core import Experiment

experiment = Experiment(workspace=ws, name='hyperdrive_test')
run = experiment.submit(config=hd_config)
run.wait_for_completion(show_output=True)

RunId: HD_c91cafed-2b1f-48db-bc54-5cce23c04164
Web View: https://ml.azure.com/runs/HD_c91cafed-2b1f-48db-bc54-5cce23c04164?wsid=/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/rg-AML/workspaces/ws01&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

Streaming azureml-logs/hyperdrive.txt

[2022-10-05T06:46:26.812462][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space
[2022-10-05T06:46:27.9599891Z][SCHEDULER][INFO]Scheduling job, id='HD_c91cafed-2b1f-48db-bc54-5cce23c04164_0' 
[2022-10-05T06:46:28.0513886Z][SCHEDULER][INFO]Scheduling job, id='HD_c91cafed-2b1f-48db-bc54-5cce23c04164_1' 
[2022-10-05T06:46:28.2502985Z][SCHEDULER][INFO]Scheduling job, id='HD_c91cafed-2b1f-48db-bc54-5cce23c04164_2' 
[2022-10-05T06:46:28.3475659Z][SCHEDULER][INFO]Scheduling job, id='HD_c91cafed-2b1f-48db-bc54-5cce23c04164_3' 
[2022-10-05T06:46:28.300717][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.
[2022-10-05T06:46:28.451151

{'runId': 'HD_c91cafed-2b1f-48db-bc54-5cce23c04164',
 'target': 'hypertest01',
 'status': 'Completed',
 'startTimeUtc': '2022-10-05T06:46:25.957497Z',
 'endTimeUtc': '2022-10-05T07:20:36.891027Z',
 'services': {},
 'properties': {'primary_metric_config': '{"name":"training_accuracy","goal":"maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': 'ce12060e-a47f-4525-baf5-da31c0477399',
  'user_agent': 'python/3.8.10 (Linux-5.15.0-1020-azure-x86_64-with-glibc2.29) msrest/0.7.1 Hyperdrive.Service/1.0.0 Hyperdrive.SDK/core.1.46.0',
  'space_size': '27',
  'score': '0.9703999757766724',
  'best_child_run_id': 'HD_c91cafed-2b1f-48db-bc54-5cce23c04164_5',
  'best_metric_status': 'Succeeded',
  'best_data_container_id': 'dcid.HD_c91cafed-2b1f-48db-bc54-5cce23c04164_5'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'configuration': None,
  'attribution': None,
  'telemetryValues': {'

## View logs

You can view logs and metrics in Experiments on [Azure ML studio UI](https://ml.azure.com/).

![AML Hyperdrive Metrics](https://tsmatz.github.io/images/github/azure-ml-tensorflow-complete-sample/20220225_Hyperdrive_Metrics.jpg)

In your notebook, you can also view using AML run history widget as follows.

In [10]:
from azureml.widgets import RunDetails
RunDetails(run_instance=run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

You can also explorer metrics with your python code.

In [11]:
allmetrics = run.get_metrics()
print(allmetrics)

{'HD_c91cafed-2b1f-48db-bc54-5cce23c04164_19': {'training_accuracy': [0.8514666557312012, 0.92208331823349, 0.9394833445549011, 0.9505500197410583, 0.9579499959945679, 0.9628333449363708], 'training_loss': [2.786658525466919, 0.5159221291542053, 0.30762091279029846, 0.22386091947555542, 0.17581500113010406, 0.1487947553396225], 'final_accuracy': 0.9628333449363708, 'final_loss': 0.1487947553396225, 'learning_rate': 0.001, '1st_layer': 100, '2nd_layer': 90}, 'HD_c91cafed-2b1f-48db-bc54-5cce23c04164_18': {'training_accuracy': [0.8007333278656006, 0.8979833126068115, 0.9146166443824768, 0.9243166446685791, 0.9292666912078857, 0.9310833215713501], 'training_loss': [5.462890625, 0.3702734708786011, 0.3083052337169647, 0.2728637754917145, 0.2532891035079956, 0.24299181997776031], 'final_accuracy': 0.9310833215713501, 'final_loss': 0.24299181997776031, 'learning_rate': 0.009, '1st_layer': 150, '2nd_layer': 90}, 'HD_c91cafed-2b1f-48db-bc54-5cce23c04164_15': {'training_accuracy': [0.85523331165

## Remove AML compute

In [13]:
# Delete cluster (nbodes) and remove from AML workspace
mycompute = AmlCompute(workspace=ws, name='hypertest01')
mycompute.delete()