# Hyperparameter Tuning using HyperDrive

Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [14]:
import os
import shutil

import azureml.core
from azureml.widgets import RunDetails
from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.33.0


## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at ./config.json

In [15]:
ws = Workspace.from_config()

print('Workspace name: ' + ws.name,
      'Azure region: ' + ws.location,
      'Subscription id: ' + ws.subscription_id,
      'Resource group: ' + ws.resource_group, sep='\n')

Workspace name: workspace-rvl
Azure region: westeurope
Subscription id: b17f1c19-34a2-47b8-a207-40ea477828fc
Resource group: resource-group-rvl


## Create an Azure ML experiment
Let's create an experiment named "hyperparameter_tuning-experiment" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

In [16]:
experiment_name = 'hyperparameter_tuning-experiment'
project_folder = './hyperparameter-tuning--project'

experiment = Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
hyperparameter_tuning-experiment,workspace-rvl,Link to Azure Machine Learning studio,Link to Documentation


In [17]:
run = experiment.start_logging()

## Create or Attach an AmlCompute cluster
We will need to create a compute target for our AutoML run. We will use `vm_size = Standard_D2_V2` in our provisioning configuration and select max_nodes to be no greater than 4.

In [18]:
aml_compute_cluster_name = "cpu-cluster"

# Verify that cluster doesn't exist already
try:
    aml_compute = ComputeTarget(workspace=ws, name=aml_compute_cluster_name)
    print("Found existing cluster, use it.")

except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size="Standard_D2_V2",
                                                           min_nodes=0,
                                                           max_nodes=4)
    aml_compute = ComputeTarget.create(workspace=ws,
                                       name=aml_compute_cluster_name,
                                       provisioning_configuration=compute_config)
    
aml_compute.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [19]:
compute_targets = ws.compute_targets

for i, key in enumerate(compute_targets):
    print(f"{i+1}. Compute target\n\tname: {compute_targets[key].name}\n\tType: {compute_targets[key].type}")


1. Compute target
	name: compute-instance-rvl
	Type: ComputeInstance
2. Compute target
	name: cpu-cluster
	Type: AmlCompute


## Dataset

In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

The external data is going to be loaded in `train.py` script. Here's a snapshot of the script: 

```python
# Create TabularDataset using TabularDatasetFactory
url_path = "https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv"
dataset = TabularDatasetFactory.from_delimited_files(path=url_path)
data_df = dataset.to_pandas_dataframe()
```

## Hyperdrive Configuration

The pipeline we are using here consists of a custom-coded Scikit-learn model logistic regression model stored in `train.py` script and a Hyperdrive run sweeping over model paramters. The following steps are part of the pipeline:

- Data preprocessing 
- Splitting data into train and test sets
- Setting logistic regression parameters:
    - --C - Inverse of regularization strenght
    - --max_iter - Maximum number of iterations convergence

`RandomParameterSampling` defines random sampling over a hyperparameter search space. In this sampling algorithm, parameter values are chosen from a set of discrete values or a distribution over a continuous range. This has an advantage against GridSearch method that runs all combinations of parameters and requires large amount of time to run.

For the Inverse of regularization strenght parameter I have chosen uniform distribution with min=0.001 and max=1.0 For the Maximum number of iterations convergence I inputed a range of values (5, 25, 50, 100, 150)

`BanditPolicy` Class Defines an early termination policy based on slack criteria, and a frequency and delay interval for evaluation. This greatly helps to ensure if model with given parameters is not performing well, it is turned down instead of running it for any longer.


In [26]:
# Create an early termination policy
early_termination_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

# Create the different params that you will be using during training
param_sampling = RandomParameterSampling(
    {
        "--C": uniform(0.0001, 1.0),
        "--max_iter": choice(5, 25, 50, 100, 200, 500, 1000)
    }
)

script_folder = "./training"

if "training" not in os.listdir():
    os.mkdir(script_folder)
    
shutil.copy('./train.py', script_folder)

# Create your estimator and hyperdrive config
estimator = SKLearn(source_directory=script_folder,
                    compute_target=aml_compute,
                    entry_script="train.py")

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                                         hyperparameter_sampling=param_sampling,
                                         primary_metric_name="Accuracy",
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=50,
                                         max_concurrent_runs=4,
                                         policy=early_termination_policy)



In [27]:
# Submit your experiment
hyperdrive_run = experiment.submit(config=hyperdrive_run_config)



## Run Details

In the cell below, we use the `RunDetails` widget to show the different experiments.

In [28]:
RunDetails(hyperdrive_run).show()
hyperdrive_run.wait_for_completion(show_output=True)

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: HD_7fd07718-c5c5-42f5-ae1f-bd2f62a56380
Web View: https://ml.azure.com/runs/HD_7fd07718-c5c5-42f5-ae1f-bd2f62a56380?wsid=/subscriptions/b17f1c19-34a2-47b8-a207-40ea477828fc/resourcegroups/resource-group-rvl/workspaces/workspace-rvl&tid=0f823349-2c12-431b-a03c-b2c0a43d6fb4

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-08-23T08:43:49.242592][API][INFO]Experiment created<END>\n""<START>[2021-08-23T08:43:49.753258][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2021-08-23T08:43:50.015454][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"

Execution Summary
RunId: HD_7fd07718-c5c5-42f5-ae1f-bd2f62a56380
Web View: https://ml.azure.com/runs/HD_7fd07718-c5c5-42f5-ae1f-bd2f62a56380?wsid=/subscriptions/b17f1c19-34a2-47b8-a207-40ea477828fc/resourcegroups/resource-group-rvl/workspaces/workspace-rvl&tid=0f823349-2c12-431b-a03c-b2c0a43d6fb4



{'runId': 'HD_7fd07718-c5c5-42f5-ae1f-bd2f62a56380',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-08-23T08:43:48.979783Z',
 'endTimeUtc': '2021-08-23T09:32:44.712927Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '80240a8f-8685-407e-aded-545908fb3f27',
  'user_agent': 'python/3.6.9 (Linux-5.4.0-1055-azure-x86_64-with-debian-buster-sid) msrest/0.6.21 Hyperdrive.Service/1.0.0 Hyperdrive.SDK/core.1.33.0',
  'score': '0.9166666666666666',
  'best_child_run_id': 'HD_7fd07718-c5c5-42f5-ae1f-bd2f62a56380_29',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://workspacervl4370776545.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_7fd07718-c5c5-42f5-ae1f-bd2f62a56380/azureml-logs/hyperdrive.txt?sv=2019-0

## Best Model

In the cell below, we get the best model from the hyperdrive experiments and display all the properties of the model.

In [37]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['arguments']

print("Best Experiment Run:")
print(f" Best Run Id: {best_run.id}")
print(f" Accuracy: {best_run_metrics['Accuracy']}")
print(f" Regularization Strength: {best_run_metrics['Regularization Strength:']}")
print(f" Max iterations: {best_run_metrics['Max iterations:']}")


Best Experiment Run:
 Best Run Id: HD_7fd07718-c5c5-42f5-ae1f-bd2f62a56380_29
 Accuracy: 0.9166666666666666
 Regularization Strength: 0.5443967516166649
 Max iterations: 25


In [42]:
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
hyperparameter_tuning-experiment,HD_7fd07718-c5c5-42f5-ae1f-bd2f62a56380_29,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


Save the best model

In [47]:
best_model = best_run.register_model(model_name='hyperdrive_model.pkl',
                                     model_path='./outputs/')

best_model.download(target_dir='', exist_ok=True)

'outputs/model_hyppar_tunning.joblib'