# Tuning Hyperparameters

There are many machine learning algorithms that require *hyperparameters* (parameter values that influence training, but can't be determined from the training data itself). For example, when training a logistic regression model, you can use a *regularization rate* hyperparameter to counteract bias in the model; or when training a convolutional neural network, you can use hyperparameters like *learning rate* and *batch size* to control how weights are adjusted and how many data items are processed in a mini-batch respectively. The choice of hyperparameter values can significantly affect the performance of a trained model, or the time taken to train it; and often you need to try multiple combinations to find the optimal solution.

In this case, you'll use a simple example of a logistic regression model with a single hyperparameter, but the principles apply to any kind of model you can train with Azure Machine Learning.

## Before You Start

Before you start this lab, ensure that you have completed the *Create an Azure Machine Learning Workspace* and *Create a Compute Instance* tasks in [Lab 1: Getting Started with Azure Machine Learning](./labdocs/Lab01.md). Then open this notebook in Jupyter on your Compute Instance.

## Connect to Your Workspace

The first thing you need to do is to connect to your workspace using the Azure ML SDK.

> **Note**: You may be prompted to authenticate. Just copy the code and click the link provided to sign into your Azure subscription, and then return to this notebook.

In [12]:
import azureml
from azureml import widgets
from azureml.train import hyperdrive, sklearn
import os


In [2]:
ws = azureml.core.Workspace.from_config()
print(f'Ready to use Azure ML {core.VERSION} to work with {ws.name}')

Ready to use Azure ML 1.12.0 to work with workspace


## Prepare Data for an Experiment

In this lab, you'll use a dataset containing details of diabetes patients. Run the cell below to create this dataset (if you already created it, the code will create a new version)

In [4]:
default_ds = ws.get_default_datastore()

print('Dataset already registered.')

Dataset already registered.


## Prepare a Training Script

Let's start by creating a folder for the training script you'll use to train a logistic regression model.

In [6]:
experiment_folder = 'diabetes-training-hyperdrive'
os.makedirs(experiment_folder, exist_ok=True)

print('Folder ready.')

Folder ready.


Now create the Python script to train the model. This must include:

- A parameter for each hyperparameter you want to optimize (in this case, there's only the regularization hyperparameter)
- Code to log the performance metric you want to optimize for (in this case, you'll log both AUC and accuracy, so you can choose to optimize the model for either of these)

In [7]:
%%writefile $experiment_folder/diabetes_training.py
import os
import argparse
import joblib
import azureml
import pandas as pd
import numpy as np
from sklearn import model_selection, linear_model, metrics

parser = argparse.ArgumentParser()
parser.add_argument(
    '--regularization', type=float, dest='reg_rate', default=0.01, 
    help='regularization rate'
)
args = parser.parse_args()
reg = args.reg_rate

run = azureml.core.Run.get_context()

print('Loading Data...')
diabetes: pd.DataFrame = run.input_datasets['diabetes'].to_pandas_dataframe()

X = diabetes[
        [
            'Pregnancies', 'PlasmaGlucose', 'DiastolicBloodPressure',
            'TricepsThickness', 'SerumInsulin', 'BMI', 'DiabetesPedigree',
            'Age',
        ]
    ].to_numpy()
y = diabetes['Diabetic'].to_numpy()

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.30, random_state=0
)

print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  reg)
model = (
    linear_model.LogisticRegression(C=1/reg, solver="liblinear")
    .fit(X_train, y_train)
)

y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', acc)

y_scores = model.predict_proba(X_test)
auc = metrics.roc_auc_score(y_test, y_scores[:, 1])
print('AUC:', auc)
run.log('AUC', auc)

os.makedirs('outputs', exist_ok=True)
joblib.dump(model, 'outputs/diabetes_model.pkl')

run.complete()


Writing diabetes-training-hyperdrive/diabetes_training.py


## Prepare a Compute Target

One of the benefits of cloud compute is that it scales on-demand, enabling you to provision enough compute resources to process multiple runs of an experiment in parallel, each with different hyperparameter values.

You'll create an Azure Machine Learning compute cluster in your workspace (or use an existing one if you have created it previously).

> **Important**: Change *your-compute-cluster* to the unique name for your compute cluster in the code below before running it!

In [11]:
cluster_name = "susumu-cluster"

training_cluster = azureml.core.compute.ComputeTarget(ws, cluster_name)
print('Found existing cluster, use it.')
training_cluster.wait_for_completion(show_output=True)


Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Run a *Hyperdrive* Experiment

Azure Machine Learning includes a hyperparameter tuning capability through *Hyperdrive* experiments. These experiments launch multiple child runs, each with a different hyperparameter combination. The run producing the best model (as determined by the logged target performance metric for which you want to optimize) can be identified, and its trained model selected for registration and deployment.

In [13]:
params = hyperdrive.GridParameterSampling(
    {
        '--regularization': hyperdrive.choice([
            0.001, 0.005, 0.01, 0.05, 0.1, 1.0
        ])
    }
)


diabetes_ds = ws.datasets.get('diabetes dataset')

hyper_estimator = sklearn.SKLearn(
    experiment_folder,
    inputs=[diabetes_ds.as_named_input('diabetes')],
    pip_packages=['azureml-sdk'],
    entry_script='diabetes_training.py',
    compute_target=training_cluster,
)

config = hyperdrive.HyperDriveConfig(
    params,
    'AUC',
    hyperdrive.PrimaryMetricGoal.MAXIMIZE,
    max_total_runs=6,
    estimator=hyper_estimator,
)

experiment = azureml.core.Experiment(ws, 'diabetes_training_hyperdrive')
run = experiment.submit(config)

widgets.RunDetails(run).show()
run.wait_for_completion()


_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

{'runId': 'HD_6dbe371e-486c-4b49-b64f-d7dcee1ca0d6',
 'target': 'susumu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2020-08-18T19:05:04.275508Z',
 'endTimeUtc': '2020-08-18T19:23:39.377112Z',
 'properties': {'primary_metric_config': '{"name": "AUC", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '670349e7-0671-470a-b484-31aff5d42644',
  'score': '0.856969468262725',
  'best_child_run_id': 'HD_6dbe371e-486c-4b49-b64f-d7dcee1ca0d6_5',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://workspace9901294163.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_6dbe371e-486c-4b49-b64f-d7dcee1ca0d6/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=X5C87suxshkgGFgk%2B8W2hH1hXYSbUJQHy%2B1Fdkcw0uQ%3D&st=2020-08-18T19%3A13%3A43Z&se=2020-08-19T03%3A23%3A43Z&sp=r'}}

You can view the experiment run status in the widget above. You can also view the main Hyperdrive experiment run and its child runs in [Azure Machine Learning studio](https://ml.azure.com).

> **Note**: The widget may not refresh. You'll see summary information displayed below the widget when the run has completed.

## Determine the Best Performing Run

When all of the runs have finished, you can find the best one based on the performance metric you specified (in this case, the one with the best AUC).

In [14]:
for child_run in run.get_children_sorted_by_primary_metric():
    print(child_run)

best_run = run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['arguments']

print('Best Run Id: ', best_run.id)
print(' -AUC:', best_run_metrics['AUC'])
print(' -Accuracy:', best_run_metrics['Accuracy'])
print(' -Regularization Rate:', parameter_values)

{'run_id': 'HD_6dbe371e-486c-4b49-b64f-d7dcee1ca0d6_5', 'hyperparameters': '{"--regularization": 1.0}', 'best_primary_metric': 0.856969468262725, 'status': 'Completed'}
{'run_id': 'HD_6dbe371e-486c-4b49-b64f-d7dcee1ca0d6_4', 'hyperparameters': '{"--regularization": 0.1}', 'best_primary_metric': 0.8568613016622707, 'status': 'Completed'}
{'run_id': 'HD_6dbe371e-486c-4b49-b64f-d7dcee1ca0d6_1', 'hyperparameters': '{"--regularization": 0.005}', 'best_primary_metric': 0.8568570988700241, 'status': 'Completed'}
{'run_id': 'HD_6dbe371e-486c-4b49-b64f-d7dcee1ca0d6_3', 'hyperparameters': '{"--regularization": 0.05}', 'best_primary_metric': 0.8568436056949162, 'status': 'Completed'}
{'run_id': 'HD_6dbe371e-486c-4b49-b64f-d7dcee1ca0d6_2', 'hyperparameters': '{"--regularization": 0.01}', 'best_primary_metric': 0.8568309973181761, 'status': 'Completed'}
{'run_id': 'HD_6dbe371e-486c-4b49-b64f-d7dcee1ca0d6_0', 'hyperparameters': '{"--regularization": 0.001}', 'best_primary_metric': 0.8568283429230729

Now that you've found the best run, you can register the model it trained.

In [15]:
best_run.register_model(
    model_path='outputs/diabetes_model.pkl', model_name='diabetes_model',
    tags={'Training context':'Hyperdrive'},
    properties={
        'AUC': best_run_metrics['AUC'], 
        'Accuracy': best_run_metrics['Accuracy']
    },
)

for model in azureml.core.Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print (f'\t{tag_name}: {tag}')
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print (f'\t{prop_name}: {prop}')
    print('\n')


diabetes_model version: 11
	Training context: Hyperdrive
	AUC: 0.856969468262725
	Accuracy: 0.7891111111111111


diabetes_model version: 10
	Training context: Inline Training
	AUC: 0.8770884123588237
	Accuracy: 0.8893333333333333


diabetes_model version: 9
	Training context: Inline Training
	AUC: 0.8778421812030448
	Accuracy: 0.8903333333333333


diabetes_model version: 8
	Training context: Pipeline


diabetes_model version: 7
	Training context: Pipeline


diabetes_model version: 6
	Training context: Pipeline


diabetes_model version: 5
	Training context: Parameterized SKLearn Estimator
	AUC: 0.8483904671874223
	Accuracy: 0.7736666666666666


diabetes_model version: 4
	Training context: Parameterized SKLearn Estimator
	AUC: 0.8483904671874223
	Accuracy: 0.7736666666666666


diabetes_model version: 3
	Training context: Estimator
	AUC: 0.8484929598487486
	Accuracy: 0.774


diabetes_model version: 2
	Training context: Estimator
	AUC: 0.8483377282451863
	Accuracy: 0.774


diabetes_model v

> **More Information**: For more information about Hyperdrive, see the [Azure ML documentation](https://docs.microsoft.com/azure/machine-learning/how-to-tune-hyperparameters).