# Hyperparameter Tuning using Hyperdrive

All relevant steps for hyperparameter tuning with Hyperdrive have been implemented as Python functions in the file `functions.py`. Thereby this notebook becomes less cluttered. For imported dependencies for scikit-learn and Azure, please see the file `functions.py`.

In [1]:
import azureml.core
from functions import get_workspace, get_data, get_compute_cluster, get_hyperd_environment, run_hyperd, \
    show_and_test_local_hyperd_model, deploy_hyperd_model, test_deployed_hyperd_model

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.42.0


In [2]:
ws = get_workspace()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

quick-starts-ws-199863
aml-quickstarts-199863
southcentralus
48a74bb7-9950-4cc1-9caa-5d50f995cc55


## Dataset

As dataset the Adult dataset from the UCI machine learning repository is used. The task is to predict the income class (over 50k or below) based on an individual person's features. For a more thorough description of the dataset, please see the `README.md` file. The function `get_data()` called in the following downloads the training and test data, does some preprocessing (the suffix `hyperd` indicates that we need to encode categorical data as integers during the preprocessing) and stores the data in a blobstore of the current workspace. For details, please see the comments in the file `functions.py`.

In [3]:
_, test_ds = get_data(suffix='hyperd')

Loading datasets from workspace ...


## Hyperdrive Run

The `run_hyperd()` function starts the hyperparameter tuning run based on the provided configuration settings. The function also stores the best found random forest model under `./outputs/best_model_hyperdrive.pkl`. For details and a reasoning about parameter choices, please see the `README.md` file. For details about starting the hyperdrive run, please have a look at the file `functions.py`.

In [4]:
from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import choice, uniform
    
# Setup a random parameter sampling for random forest models
ps = RandomParameterSampling({
    '--n_estimators': choice(range(2, 100)),  # number of decision trees in the forst
    '--max_depth': choice(range(2, 10)),      # maximum depth of the involved decision trees
    '--max_features': choice(range(1, 14)),   # maximum number of features randomly chosen per decision tree
    '--min_samples_leaf': uniform(0.01, 0.1)  # minimum fraction of samples per leaf
    })

# Choose a bandit policy for early stopping
policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1) # evaluate performance every two runs,
                                                               # stop if lower than 1% point difference to
                                                               # best result in previous two runs
        
# Use the main() function from this script to train a model
src = ScriptRunConfig(
    source_directory=".",
    script="functions.py",
    compute_target=get_compute_cluster(),
    environment=get_hyperd_environment()
    )
        
# Setup a hyperdrive config
hd_config = HyperDriveConfig(
    run_config=src,
    hyperparameter_sampling=ps,
    policy=policy,
    primary_metric_name='accuracy',  # choose accuracy as the primary metric for easier comparison with published results
    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, # accuracy should be maximized
    max_total_runs=100, # try 100 different hyperparameter combinations in total
    max_concurrent_runs=3
    )

run_hyperd(hd_config)

Found existing cluster, use it.


_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: HD_cf7be3e3-d6c5-4699-82ed-4527fb0c6967
Web View: https://ml.azure.com/runs/HD_cf7be3e3-d6c5-4699-82ed-4527fb0c6967?wsid=/subscriptions/48a74bb7-9950-4cc1-9caa-5d50f995cc55/resourcegroups/aml-quickstarts-199863/workspaces/quick-starts-ws-199863&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254

Streaming azureml-logs/hyperdrive.txt

"<START>[2022-06-30T12:34:40.884979][API][INFO]Experiment created<END>\n""<START>[2022-06-30T12:34:41.544867][GENERATOR][INFO]Trying to sample '3' jobs from the hyperparameter space<END>\n"<START>[2022-06-30T12:34:42.3352423Z][SCHEDULER][INFO]Scheduling job, id='HD_cf7be3e3-d6c5-4699-82ed-4527fb0c6967_0'<END><START>[2022-06-30T12:34:42.4649286Z][SCHEDULER][INFO]Scheduling job, id='HD_cf7be3e3-d6c5-4699-82ed-4527fb0c6967_1'<END>"<START>[2022-06-30T12:34:42.545547][GENERATOR][INFO]Successfully sampled '3' jobs, they will soon be submitted to the execution target.<END>\n"<START>[2022-06-30T12:34:42.6128518Z][SCHEDULER][INFO]Scheduling job, id='HD_cf7be3e3-d6c5-4

## Best Model

The function `show_and_test_local_hyperd_model` loads the best model from the Hyperdrive run, prints its properties and assesses performance on an independent test set. For details, see the file `functions.py`.

In [None]:
show_and_test_local_hyperd_model(test_ds)

## Model Deployment and Test

The `register_and_deploy_hyperd_model()` function registers and deploys the best model found by Hyperdrive on a compute instance. For more details, see the file `functions.py`.

In [None]:
# Register model
model = Model.register(ws,
    model_name='adult-hyperd-model',
    description='Model for the Adult dataset from UCI machine learning repository',
    model_path=HYPERDRIVE_MODEL_PATH)

deploy_hyperd_model(model)

The `test_deployed_hyperd_model()` function takes a row from the test set, encodes it as JSON string and sends an according HTTP request to test the endpoint. In the following, we are getting predictions for the first ten samples in the test set. For more details, see the file `functions.py`.

In [None]:
for row in range(0, 10):
    test_deployed_hyperd_model(test_ds, row)

Finally, some clean up is performed, i.e. the compute cluster, web service and registered model are deleted from the current workspace. For details, see the file `functions.py`.

In [None]:
clean_up(automl=False)