# Hyperparameter Tuning using HyperDrive

In [1]:
import joblib

from azureml.core import Workspace, Experiment
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core import Environment, ScriptRunConfig

from azureml.train.hyperdrive.policy import NoTerminationPolicy
from azureml.train.hyperdrive.sampling import BayesianParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice, uniform

from azureml.widgets import RunDetails

## Dataset

### Set up experiment

In [2]:
ws = Workspace.from_config()
experiment_name = "creditcard-experiment"
project_folder = './creditcard-hyperdrive-project'

experiment = Experiment(ws, experiment_name)
run = experiment.start_logging()

### Connect to Compute

In [3]:
amlcompute_cluster_name = "automl-cls"

try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_v3", max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count=0, timeout_in_minutes=10)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


### Get Dataset

In [4]:
key = "creditcard-dataset"
description = "Credit Card - Dealing from Imbalance Datasets from https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud"

found = False
if key in ws.datasets.keys():
    print("Found existing dataset, use it.")
    found = True
    dataset = ws.datasets[key] # already registered
    
if not found:
    example_data = "https://media.githubusercontent.com/media/satriawadhipurusa/ml-dataset-collection/master/Fraud-Detection/creditcard-fraud.csv" # uploaded to Git for download
    dataset = Dataset.Tabular.from_delimited_files(example_data)
    dataset = dataset.register(workspace=ws, name=key, description=description)

Found existing dataset, use it.


In [5]:
dataset.to_pandas_dataframe().head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## Hyperdrive Configuration

TODO: Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.

We will use Hyperdrive to search the best hyperparameters of the model. The model will be using `SupportVectorClassifier` or SVC, this model excel in separating hyperplanes of different classes, especially in finding anomaly (**fraud**). This is due the `class_weight` parameter of the model that can be set for imbalanced dataset. 

The followings are the hyperparameter of SVC:

* gamma (Kernel Coefficient): 0.01 - 100
* C (regularization): 0.01 - 100
* class weight: `{0: 0.05, 1: 0.95}`, `{0: 0.1, 1: 0.9}`, `{0: 0.25, 1: 0.75}`

These three parameters are essentials in SVC, and both the gamma and C use a very large parameter space (0.01 - 100). Since we also limited in time and budget, we will use `BayesianParameterSampling` to make the search more informed. Using this sampling method, the algorithm will learn from previous runs to narrow the search space on a parameter that will maximize the objective function, which is maximize the primary metric. Since it's using bayesian, `NoTeriminationPolicy` will be used instead.

Finally, we set the primary metric name as **"AUC Weighted"** instead of Accuracy, it is suited for this type of imbalanced dataset. This metric also used in Automated ML previously so we can compare them on the same ground. The other config, we will maximize `max_total_runs` and `max_duration_minutes`, since bayesian sampling usually took a longer than randomized search or grid search.

In [20]:
# TODO: Create an early termination policy. This is not required if you are using Bayesian sampling.
early_termination_policy = NoTerminationPolicy()

#TODO: Create the different params that you will be using during training.
param_sampling = BayesianParameterSampling({
    "gamma": uniform(0.01, 100),
    "C": uniform(0.01, 100),
    "class_weight": choice(
        "{0: 0.05, 1: 0.95}",
        "{0: 0.1, 1: 0.9}",
        "{0: 0.25, 1: 0.75}")
})

#TODO: Create your estimator and hyperdrive config
environment = Environment.from_conda_specification(name="sklearn-env", file_path="conda.yaml")
arguments = [
    "--gamma",
    1.0,
    "--C",
    1.0,
    "--class_weight",
    "{0: 0.05, 1: 0.95}"
]
estimator = ScriptRunConfig(source_directory=".",
                            script="./training/train.py",
                            arguments=arguments,
                            environment=environment,
                            compute_target=compute_target)

hyperdrive_run_config = HyperDriveConfig(
    hyperparameter_sampling=param_sampling,
    primary_metric_name="AUC_Weighted",
    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
    run_config=estimator,
    policy=early_termination_policy,
    max_total_runs=60,
    max_concurrent_runs=2,
    max_duration_minutes=60
)

In [21]:
#TODO: Submit your experiment
remote_run = experiment.submit(hyperdrive_run_config)

## Run Details

The different runs show that some run will have much higher metric than other runs. It's the bayesian sampling job to find which parameters can produce the best run with highest metrics. We see that the best metric is **0.919** in AUC Weighted with **0.275 gamma**, **44.808 C**, and `class_weight` of `{0: 0.25, 1: 0.75}`. This is smaller than Automated ML, and hence we can decide that we will not deploy this model but deploy the Automated ML one.

In [22]:
RunDetails(remote_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [23]:
remote_run.wait_for_completion(show_output=True)

RunId: HD_cfdc1f99-cc38-4233-8925-784c29868da5
Web View: https://ml.azure.com/runs/HD_cfdc1f99-cc38-4233-8925-784c29868da5?wsid=/subscriptions/5e4d75b9-5b13-49fb-8306-ae971a3c14b1/resourcegroups/mlops-resource/workspaces/mlops-demo&tid=f336fb5b-9257-44b3-a041-3897edf080c9

Streaming azureml-logs/hyperdrive.txt

"<START>[2022-04-27T11:13:40.299235][API][INFO]Experiment created<END>\n""<START>[2022-04-27T11:13:41.856318][GENERATOR][INFO]Trying to sample '2' jobs from the hyperparameter space<END>\n"<START>[2022-04-27T11:13:43.0230877Z][SCHEDULER][INFO]Scheduling job, id='HD_cfdc1f99-cc38-4233-8925-784c29868da5_0'<END>"<START>[2022-04-27T11:13:43.096764][GENERATOR][INFO]Successfully sampled '2' jobs, they will soon be submitted to the execution target.<END>\n"<START>[2022-04-27T11:13:43.1286956Z][SCHEDULER][INFO]Scheduling job, id='HD_cfdc1f99-cc38-4233-8925-784c29868da5_1'<END>

Execution Summary
RunId: HD_cfdc1f99-cc38-4233-8925-784c29868da5
Web View: https://ml.azure.com/runs/HD_cfdc1f

{'runId': 'HD_cfdc1f99-cc38-4233-8925-784c29868da5',
 'target': 'automl-cls',
 'status': 'Completed',
 'startTimeUtc': '2022-04-27T11:13:40.040744Z',
 'endTimeUtc': '2022-04-27T12:14:45.96542Z',
 'services': {},
 'properties': {'primary_metric_config': '{"name": "AUC_Weighted", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '6b427597-adfa-4b37-90aa-2d9d03658203',
  'user_agent': 'python/3.8.5 (Linux-5.4.0-1074-azure-x86_64-with-glibc2.10) msrest/0.6.21 Hyperdrive.Service/1.0.0 Hyperdrive.SDK/core.1.40.0',
  'space_size': 'infinite_space_size',
  'score': '0.9193005539344387',
  'best_child_run_id': 'HD_cfdc1f99-cc38-4233-8925-784c29868da5_4',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlopsdemstorage50a93f777.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_cfdc1f99-cc38-4233-8925

## Best Model

In [None]:
remote_run.getdd

In [35]:
best_hyperdrive_run.properties

{'_azureml.ComputeTargetType': 'amlctrain',
 'ContentSnapshotId': '6b427597-adfa-4b37-90aa-2d9d03658203',
 'ProcessInfoFile': 'azureml-logs/process_info.json',
 'ProcessStatusFile': 'azureml-logs/process_status.json'}

In [29]:
best_hyperdrive_run = remote_run.get_best_run_by_primary_metric()

print(f"Best HyperDrive Run:\n\n{best_hyperdrive_run}")
print("==============")
namefile = "outputs/model.joblib"
best_hyperdrive_run.download_file(namefile, namefile) # save the best model
best_hyperdrive_model = joblib.load(open(namefile, "rb"))

print(f"Best HyperDrive Model:\n\n{best_hyperdrive_model}")

Best HyperDrive Run:

Run(Experiment: creditcard-experiment,
Id: HD_cfdc1f99-cc38-4233-8925-784c29868da5_4,
Type: azureml.scriptrun,
Status: Completed)
Best HyperDrive Model:

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('numeric',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 m

## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [31]:
# Register the model
best_hyperdrive_run.register_model(model_name="credit-fraud-model", model_path="outputs/model.joblib")

Model(workspace=Workspace.create(name='mlops-demo', subscription_id='5e4d75b9-5b13-49fb-8306-ae971a3c14b1', resource_group='mlops-resource'), name=credit-fraud-model, id=credit-fraud-model:2, version=2, tags={}, properties={})

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.

