# Hyperparameter Tuning using HyperDrive

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [7]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace


#hyperdrive dependencies
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import normal, uniform, choice
import os

## Dataset

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [4]:
ws = Workspace.from_config()
experiment_name = 'hyperdrive_sayed_exp'
experiment=Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
hyperdrive_sayed_exp,quick-starts-ws-137382,Link to Azure Machine Learning studio,Link to Documentation


In [5]:
# Choosing a name for the CPU cluster
amlcompute_cluster_name = "sayed-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',# for GPU, use "STANDARD_NC6"
                                                           #vm_priority = 'lowpriority', # optional
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 3)

Creating
Succeeded................................
AmlCompute wait for completion finished

Wait timeout has been reached
Current provisioning state of AmlCompute is "Succeeded" and current node count is "0"


## Hyperdrive Configuration

TODO: Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.


I chose LogisticRegression model to predict the value of the dependent variable. As it is a classification problem so LogisticRegression can perform better in this regard. 
Among the parameter sampling methods i chose random sampling method where hyperparameter values are randomly selected from the defined search space. It also supports early termination of low-performance runs. Furthermore,for the early stopping policy i chose Bandit policy with slack factor of 0.1, which will terminate runs where the primary metric is not within the specified slack factor compared to the best performing run.

The 2 parameters that i tuned using hyperdrive these are described below:
  * '--C': choice(0.01,5,20,100,500)
       * This parameter is an inverse of regularization strength. Larger values cause weaker and smaller values cause stronger regularization. I chose 0.01,5,20,100,500
  * '--max_iter': choice(10,50,100,150,200)
       * The discrete values chosen for Max iteration were 10,50,100,150,200.


In [21]:
script_folder = './'
script='train.py'

# Specifying parameter sampler
ps = RandomParameterSampling(
     {
        '--C': choice(0.01,5,20,100,500), 
        '--max_iter': choice(10,50,100,150,200)
     }
)

# Creating an early termination policy. This is not required if you are using Bayesian sampling.
early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=2)

#accuracy set as primary metric, focuses on maximizing "accuracy"
primary_metric_name="Accuracy"
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE
max_total_runs=20
max_concurrent_runs=4

if "training" not in os.listdir():
    os.mkdir("./training")

# Creating a SKLearn estimator for use with train.py
estimator = SKLearn(script_folder,
        compute_target=compute_target, 
        entry_script=script)


# Creating a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
hyperdrive_config = HyperDriveConfig(estimator = estimator,
                             hyperparameter_sampling=ps,
                             policy=early_termination_policy,
                             primary_metric_name=primary_metric_name,
                             primary_metric_goal=primary_metric_goal,
                             max_total_runs=max_total_runs,
                             max_concurrent_runs=max_concurrent_runs)



In [22]:
# Submiting your hyperdrive run to the experiment 
hyperdrive_run = experiment.submit(config = hyperdrive_config, show_output=True)



## Run Details


TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [23]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'â€¦

## Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [24]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
best_run_model_names = best_run.get_file_names()

print('Best Run Id: ', best_run.id)
print('\n Accuracy:', best_run_metrics['Accuracy'])
print('\n best_run_model_names:',best_run_model_names)
print('\n best_run:',best_run.get_details())

Best Run Id:  HD_3057fb41-d9de-4598-95f6-dcf9565495e2_0

 Accuracy: 0.9166666666666666

 best_run_model_names: ['azureml-logs/55_azureml-execution-tvmps_e47703c908bcdc1887aafcd7cccdf00b8f1b90f8f55549c623ad5f951267b032_d.txt', 'azureml-logs/65_job_prep-tvmps_e47703c908bcdc1887aafcd7cccdf00b8f1b90f8f55549c623ad5f951267b032_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_e47703c908bcdc1887aafcd7cccdf00b8f1b90f8f55549c623ad5f951267b032_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/107_azureml.log', 'logs/azureml/job_prep_azureml.log', 'logs/azureml/job_release_azureml.log', 'outputs/mymodel.joblib']

 best_run: {'runId': 'HD_3057fb41-d9de-4598-95f6-dcf9565495e2_0', 'target': 'sayed-cluster', 'status': 'Completed', 'startTimeUtc': '2021-02-06T01:49:29.936097Z', 'endTimeUtc': '2021-02-06T02:02:56.12678Z', 'properties': {'_azureml.ComputeTargetType': 'amlcompute', 'ContentSnapshotId': 'd902d17f-7d39-4f2f-a8c2-7c64c3851450

In [25]:
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
hyperdrive_sayed_exp,HD_3057fb41-d9de-4598-95f6-dcf9565495e2_0,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [26]:
# saving the best model
model = best_run.register_model(model_name='best_model_sayed', 
                           model_path='outputs/mymodel.joblib')