# Hyperparameter Tuning using HyperDrive

In [1]:
from azureml.core import Workspace, Experiment, ScriptRunConfig, Environment
from azureml.widgets import RunDetails
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice
import os

## Dataset

The Car Evaluation dataset has been taken from the UCI Machine Learning Repository. This dataset contains various structural and technical details about cars on the basis of which they are classified into different categories. 
<br>Some of those details are: 
<br>1. Cost of buying the car (low, med, high, vhigh)
<br>2. Maintenance of the car (low, med, high, vhigh)
<br>3. Number of doors present in the car (2,3,4,5more)
<br>4. Number of passangers the car can accomodate (2, 4, more)
<br>5. Luggage space in the car (small, med)
<br>5. Safety of the car (low, med, high)
<br>The target class has 4 possible values: unacc, acc, good, vgood (ranging the category of car from the worst to the best).
<br>The dataset can be accessed from the following link: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

In [2]:
ws = Workspace.from_config()
experiment_name = 'Car_Evaluation_Hyperdrive'

experiment=Experiment(ws, experiment_name)

run = experiment.start_logging()

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code ESE8RESEE to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.


In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute

# TODO: Create compute cluster
# Use vm_size = "Standard_D2_V2" in your provisioning configuration.
# max_nodes should be no greater than 4.

try:
    cpu_cluster = ComputeTarget(workspace=ws, name="capstone-compute")
    print('Found existing cluster, use it.')
except:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws,"project-cluster", compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [4]:
%%writefile conda_dependencies.yml
 dependencies:
 - python=3.6.2
 - scikit-learn
 - pip:
    - azureml-defaults

Writing conda_dependencies.yml


## Hyperdrive Configuration

We use Random Forests since this is a Classification problem. Random Forests are based on the ensemble learning method and consist of multiple decision tress. Random Forests have been proven to be very effective in Classification and Regression problems.

The early termination policy (Bandit Policy), stops the training process in case the performance of the model starts deteriorating with increasing iteration number. This helps us retain the last best-fitted model and also save the consumption of resources.

In the HyperDrive Configuration, we use the Random Sampler automates and speeds up the process of trying out different combinations of hyperparameter values in order to obtain the most efficient model. In this case, we have 4 hyperparameters,
(1) Number of Estimators (2) Maximum depth of the Decision trees (3) The criterion used to measure the quality of split (4) Maximum number of features to be considered while making a split.

The primary metric used to determine is performance of the model is Accuracy. While accuracy is a very standard metric, we could have used other metrics like AUC Score. In order to limit the consumption of resources, we limit the maximum run to 20 iterations and maximum concurrent runs to 5. These values can be further increased based on the availability of the required reosurces.


In [5]:
env = Environment.from_conda_specification(name="sklearn-env", file_path="./conda_dependencies.yml")

# TODO: Create an early termination policy. This is not required if you are using Bayesian sampling.
early_termination_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

#TODO: Create the different params that you will be using during training
param_sampling = RandomParameterSampling(
    {
        '--n_estimators': choice(50,100,150,200,250),
        '--max_depth': choice(10,20,30,40,50),
        '--criterion': choice("gini","entropy"),
        '--max_features': choice(5,10,15)
    }
)

if "training" not in os.listdir():
    os.mkdir("./training")

#TODO: Create your estimator and hyperdrive config
estimator = ScriptRunConfig(source_directory = ".", compute_target = cpu_cluster, script = 'train.py', environment = env)

hyperdrive_run_config = HyperDriveConfig(run_config = estimator, 
                             hyperparameter_sampling = param_sampling,
                             policy = early_termination_policy,
                             primary_metric_name = 'Accuracy', 
                             primary_metric_goal = PrimaryMetricGoal.MAXIMIZE, 
                             max_total_runs = 20,
                             max_concurrent_runs = 5)

In [6]:
#TODO: Submit your experiment

hyperdrive_run = experiment.submit(hyperdrive_run_config)

## Run Details

Here we examine the model training process using the RunDetails widget which prints the training logs in near-realtime.

In [7]:
RunDetails(hyperdrive_run).show()

run.wait_for_completion(show_output=True)

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: 7f4492ac-7048-4ff7-bbc9-0a8f6f66ddd9
Web View: https://ml.azure.com/experiments/Car_Evaluation_Hyperdrive/runs/7f4492ac-7048-4ff7-bbc9-0a8f6f66ddd9?wsid=/subscriptions/3d1a56d2-7c81-4118-9790-f85d1acf0c77/resourcegroups/aml-quickstarts-136623/workspaces/quick-starts-ws-136623

Execution Summary
RunId: 7f4492ac-7048-4ff7-bbc9-0a8f6f66ddd9
Web View: https://ml.azure.com/experiments/Car_Evaluation_Hyperdrive/runs/7f4492ac-7048-4ff7-bbc9-0a8f6f66ddd9?wsid=/subscriptions/3d1a56d2-7c81-4118-9790-f85d1acf0c77/resourcegroups/aml-quickstarts-136623/workspaces/quick-starts-ws-136623



{'runId': '7f4492ac-7048-4ff7-bbc9-0a8f6f66ddd9',
 'target': 'local',
 'status': 'Canceled',
 'startTimeUtc': '2021-01-30T17:01:31.757601Z',
 'endTimeUtc': '2021-01-30T17:22:39.964Z',
 'properties': {'ContentSnapshotId': '928c31f6-0bfd-4259-91d3-047cfed4bc84'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {},
 'submittedBy': 'ODL_User 136623'}

## Best Model

Once the training process is completed, we then extract the best performing model. The best performing model will be the one which has the highest values for our primary metrics i.e Accuracy.

In [8]:
import joblib
# Get your best run and save the model from that run.

best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
print("Accuracy: {}".format(best_run_metrics['Accuracy']))

Accuracy: 0.9807321772639692


In [9]:
best_run.get_details()

{'runId': 'HD_0410efed-f233-41c4-a7ed-0f02440ccf63_1',
 'target': 'project-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-01-30T17:10:23.5897Z',
 'endTimeUtc': '2021-01-30T17:12:54.953118Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': 'a0847bbe-abec-4b60-8e63-56e797a91ac1',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'train.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['--criterion',
   'entropy',
   '--max_depth',
   '40',
   '--max_features',
   '10',
   '--n_estimators',
   '250'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'project-cluster',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'jobName': None,
  'maxRunDurationSeconds': 2592000,
  'nodeCount': 1,
  'priority': None,
  'credentialPassthrough': 

In [10]:
#TODO: Save the best model
best_run.register_model("car-evaluation-hyperdrive-best-model","outputs/capstone-hyperdrive-model.joblib")

Model(workspace=Workspace.create(name='quick-starts-ws-136623', subscription_id='3d1a56d2-7c81-4118-9790-f85d1acf0c77', resource_group='aml-quickstarts-136623'), name=car-evaluation-hyperdrive-best-model, id=car-evaluation-hyperdrive-best-model:1, version=1, tags={}, properties={})

## Model Deployment

Since the Random Forest model had an accuracy which is lower than the model produced by Azure AutoML run, we do not deploy this model.