# Exercise07 : Hyperparameter Tuning (Sweep Job)

AML provides framework-independent hyperparameter tuning capability.<br>
You can quickly search optimal parameters with scaled training workloads. This capability also works with metrics in AML logging.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Initialize MLClient

Replace below's branket's string with your subscription id, resource group name, and AML workspace name.<br>
(I note that creating ```MLClient``` will not connect to AML workspace, and the client initialization is lazy.)

In [1]:
from azure.ai.ml import MLClient
from azure.identity import DeviceCodeCredential, TokenCachePersistenceOptions

# When you run on remote
cache_opt = TokenCachePersistenceOptions(allow_unencrypted_storage=True)
cred = DeviceCodeCredential(cache_persistence_options=cache_opt)

# # When you run on Azure ML Notebook
# from azure.identity import DefaultAzureCredential
# cred = DefaultAzureCredential()

# Get a handle to the workspace
ml_client = MLClient(
    credential=cred,
    subscription_id="{SUBSCRIPTION ID}",
    resource_group_name="{RESOURCE GROUP NAME}",
    workspace_name="{AML WORKSPACE NAME}",
)

## Save your training code

First, you must save your training code.    
Here I should use the source code in "[Exercise06 : Track Logs and Metrics](./exercise06_experimentation.ipynb)", which sends logs into AML run history. (The metrics will be tracked in hyper-parameter tuning (sweep) job.)

Create ```scirpt``` directory.

In [2]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

Save source code as ```./script/train_expriment.py```.<br>
This source code is the exact same source code as one in "[Exercise06 : Track Logs and Metrics](./exercise06_experimentation.ipynb)"

In [3]:
%%writefile script/train_experiment.py
import os
import argparse
import tensorflow as tf

import mlflow ##### Modified
mlflow.tensorflow.autolog() ##### Modified

### You can also manually log as follows (Here we use autolog())
# mlflow.log_params({
#     'learning_rate': FLAGS.learning_rate,
#     '1st_layer': FLAGS.first_layer,
#     '2nd_layer': FLAGS.second_layer})
# mlflow.log_metrics(
#     {'training_accuracy': result_accuracy, 'training_loss': result_loss},
#     step=result_step)

# device test
print("##### List of available GPU #####")
print(tf.config.list_physical_devices("GPU"))

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument(
    "--data_folder",
    type=str,
    default="./data/train",
    help="Folder path for input data")
parser.add_argument(
    "--model_folder",
    type=str,
    default="./outputs",  # AML experiments outputs folder
    help="Folder path for model output")
parser.add_argument(
    "--learning_rate",
    type=float,
    default="0.001",
    help="Learning Rate")
parser.add_argument(
    "--first_layer",
    type=int,
    default="128",
    help="Neuron number for the first hidden layer")
parser.add_argument(
    "--second_layer",
    type=int,
    default="64",
    help="Neuron number for the second hidden layer")
parser.add_argument(
    "--epochs_num",
    type=int,
    default="6",
    help="Number of epochs")
FLAGS, unparsed = parser.parse_known_args()

# build model
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(FLAGS.first_layer, activation="relu"),
    tf.keras.layers.Dense(FLAGS.second_layer, activation="relu"),
    tf.keras.layers.Dense(10)
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(FLAGS.learning_rate),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

# run training
train_data = tf.data.experimental.load(FLAGS.data_folder)
model.fit(
    train_data.shuffle(1000).batch(128).prefetch(tf.data.AUTOTUNE),
    epochs=FLAGS.epochs_num
)

# save model and variables
model_path = os.path.join(FLAGS.model_folder, "mnist_tf_model")
model.save(model_path)
print("current working directory : ", os.getcwd())
print("model folder : ", model_path)

Overwriting script/train_experiment.py


## Create AML compute

Create AML compute pool for computing environment.<br>
Here I create a cluster with max 4 instances to scale sweep job.

> Note : By setting appropriate time duration in ```idle_time_before_scale_down``` parameter, you can prevent scaling-down when the training has finished. (Otherwise, it will scale down in 120 seconds after the training has finished, and the next training will slow to start because of cluster resizing.)

In [4]:
from azure.ai.ml.entities import AmlCompute

try:
    compute_target = ml_client.compute.get("hypertest01")
    print("found existing: ", compute_target.name)
except Exception:
    print("creating new.")
    compute_target = AmlCompute(
        name="hypertest01",
        type="amlcompute",
        size="Standard_D2_v2",
        min_instances=0,
        max_instances=4,
        tier="Dedicated",
    )
    compute_target = ml_client.begin_create_or_update(compute_target)

To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code AX9T4BVTB to authenticate.
creating new.


## Create AML environment

As I have mentioned in "[Exercise06 : Track Logs and Metrics](./exercise06_experimentation.ipynb)", we should use an environment with ```mlflow``` and ```azureml-mlflow``` installed.

**If you have already created custom environment in [Exercise06](./exercise06_experimentation.ipynb), you don't need to run the following command.** (Because this custom environment already exists.)

In [None]:
%%writefile 06_conda_pydata_for_logging.yml
name: project_environment
dependencies:
- python=3.8
- pip:
  - tensorflow-gpu==2.10.0
  - mlflow
  - azureml-mlflow
channels:
- anaconda
- conda-forge

In [None]:
from azure.ai.ml.entities import Environment

myenv = Environment(
    name="test-remote-cpu-env-for-logging",
    description="This is example",
    conda_file="06_conda_pydata_for_logging.yml",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
)
myenv = ml_client.environments.create_or_update(myenv)

Go to [AML Studio UI](https://ml.azure.com/) and click "Environments". Next, click "Custom environments" tab and select the above environment.<br>
Please wait until the environment image build status is succeeded.

![Environment status](https://tsmatz.github.io/images/github/azure-ml-tensorflow-complete-sample/20221220_Environment_Status.jpg)

## Submit a job with hyper-parameter's search

Now submit a job, in which multiple trainings will run depending on different hyper-parameters.<br>
In this example, we monitor training accuracy depending on 3 arguments - ```--learning_rate```, ```--first_layer```, and ```--second_layer```. Each arguments can have 3 different values (and then total 27 trials can be run), but here I set maximum 20 trials to run, in which the values of arguments are randomly picked up.<br>
These trials will be parallelized on above 4 node to speed up.

First, we define an usual command job without hyper-parameter (sweep) settings.

> Note : In this example, I also use the registered data asset named ```mnist_data``` to mount in your compute target. Run "[Exercise02 : Prepare Data](./exercise02_prepare_data.ipynb)" for data preparation.

In [5]:
from azure.ai.ml import command, Input

job = command(
    code="./script",
    command="python train_experiment.py --data_folder ${{inputs.mnist_tf}}/train --learning_rate ${{inputs.learning_rate}} --first_layer ${{inputs.first_layer}} --second_layer ${{inputs.second_layer}}",
    inputs={
        "mnist_tf": Input(
            type="uri_folder",
            path="mnist_data@latest",
        ),
        "learning_rate": 0.001,
        "first_layer": 100,
        "second_layer": 30,
    },
    environment="test-remote-cpu-env-for-logging@latest",
    compute="hypertest01",
    display_name="hyperdrive_test",
    experiment_name="hyperdrive_test",
    description="This is example",
)

Next we apply the sweep settings to the above generated job.

By ```objective``` setting, the accuracy in each training is tracked and it's evaluated to maximize. (```sparse_categorical_accuracy``` is the metrics name of MLflow tracking in this training.)

For ```sampling_algorithm```, you can use ```grid```, ```random```, and ```bayesian```.<br>
You can also specify an early termnination policy (```early_termination```), in which the training will terminate if the primary metric falls outside of some threshold. (Here we don't apply early termination.)

In [6]:
from azure.ai.ml.sweep import Choice

# Customize inputs for sweep
job_for_sweep = job(
    learning_rate=Choice(values=[0.001, 0.005, 0.009]),
    first_layer=Choice(values=[100, 125, 150]),
    second_layer=Choice(values=[30, 60, 90]),
)

# Apply sweep for parameters
sweep_job = job_for_sweep.sweep(
    compute="hypertest01",
    sampling_algorithm="random",
    primary_metric="sparse_categorical_accuracy",
    goal="Maximize",
)
sweep_job.set_limits(max_total_trials=20, max_concurrent_trials=4)

Now let's submit the job.

In [7]:
returned_job = ml_client.create_or_update(sweep_job)

## View logs

Click the link in the following output.

In [8]:
returned_job

Experiment,Name,Type,Status,Details Page
python_sdk2,neat_spoon_4r4xqk7mgd,sweep,Running,Link to Azure Machine Learning studio


You can then view logs and metrics in jobs on [Azure ML studio UI](https://ml.azure.com/).<br>
(Select "Trials" tab.)

![AML Hyperdrive Metrics](https://tsmatz.github.io/images/github/azure-ml-tensorflow-complete-sample/20220225_Hyperdrive_Metrics.jpg)

## Remove AML compute

**You don't need to remove your AML compute** for saving money, because the nodes will be automatically terminated, when it's inactive.<br>
But if you want to clean up, please run as follows.

In [10]:
ml_client.compute.begin_delete("hypertest01")

Deleting compute hypertest01 


.........................

Done.
(2m 7s)

