# Hyperparameter Tuning using HyperDrive

Going ahead with the project, we want to use HyperDrive to train a model and do hyperparameter tuning, to tackle the problem we are trying to solve (giving trading signals on Bitcoin price movements). To achieve this, We will start by importing necessary dependencies and setting up our AzureML workspace details, along with our experiment and environment. We end this first part by creating the compute cluster we will use to train and tune out model.

In [1]:
from azureml.core import Workspace, Experiment, Environment, Datastore, Dataset
import os
import pandas as pd

# Setting up the workspace
# From a config.json file
ws = Workspace.from_config()

# From a known workspace
# workspace_name = os.environ.get('WORKSPACE_NAME', 'udacity-projects')
# ws = Workspace.get(name=workspace_name)

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

# Setup the experiment
experiment_name = os.environ.get('EXPERIMENT_NAME', 'az-capstone-hd')
exp = Experiment(workspace=ws, name=experiment_name)

# Setup the environment
# From a Conda specification file
env = Environment.from_conda_specification(name = "azcapstone", file_path = "envs/env.yml")

# From a pip requirements file
# env = Environment.from_pip_requirements(name = "az-ml", file_path = "path-to-pip-requirements-file")

# Registering and building the environment
env = env.register(workspace=ws)
env_build = env.build(workspace=ws)

# Enable logs
run = exp.start_logging()

Workspace name: quick-starts-ws-186496
Azure region: southcentralus
Subscription id: 2c48c51c-bd47-40d4-abbe-fb8eabd19c8c
Resource group: aml-quickstarts-186496


Creating relevant resources...

In [2]:
from azureml.core.compute import ComputeTarget, AmlCompute

# Setup the compute cluster
# Use vm_size = "Standard_D2_V2" in your provisioning configuration.
# max_nodes should be no greater than 4.
compute_name = os.environ.get('CLUSTER_NAME', 'hd-cluster')
compute_min_nodes = os.environ.get('CLUSTER_MIN_NODES', 0)
compute_max_nodes = os.environ.get('CLUSTER_MAX_NODES', 4)
vm_size = os.environ.get('CLUSTER_SKU', 'STANDARD_D2_V2')

# Verify if the compute cluster exists
if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(
        vm_size=vm_size,
        min_nodes=compute_min_nodes,
        max_nodes=compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)

    # poll for a minimum number of nodes and for a specific timeout.
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=30)

     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

found compute target. just use it. hd-cluster


## Dataset

Now, we will read the dataset we've got externally from our [first notebook](1-data-sourcing.ipynb) and drop label columns to avoid data leakage.

In [3]:
df = pd.read_csv('data/ndf.csv')
print('dataset shape: ', df.shape)
print('columns:\n', df.columns)

dataset shape:  (2703, 34)
columns:
 Index(['Date', 'shangai', 'btc', 'crude oil', 'euro', 'gold', 'silver', 'ftse',
       'spy', 'hsi', 'nasdaq', 'nikkei', 'rates', 'open', 'high', 'low', 'MA4',
       'MA50', 'MA80', 'stochRSI', 'RSI', 'btc_std_dev', 'std_dif', 'vol_btc',
       'hashrate', 'difficulty', 'transactions', 't_cost', 'y_returns',
       'y_close', 'y_c', 'y_returns_shift', 'y_c_shift', 'y_close_shift'],
      dtype='object')


In [4]:
drop_col_list = ['y_close', 'y_close_shift', 'y_returns_shift'] # our label will be y_c_shift
df.drop(columns=drop_col_list, inplace=True)

In [5]:
# Register the dataset
datastore = ws.get_default_datastore()
dataset = Dataset.Tabular.register_pandas_dataframe(df, datastore, "hd-dataset", show_progress=True)
df = dataset.to_pandas_dataframe()

Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/bb161871-a2f1-42c0-ae3e-04b56f3c8196/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


## Hyperdrive Configuration

In order to compare our model with the one got in AutoML, we need to setup our experiment with the same performance metric. For this reason, we will be using accuracy, as in our case neither false or negative positives have a particularly bad outcome to select precision, recall or other metrics.

For tuning our model, we use random parameter sampling over the parameters "C" (regularization strength) and "max_iter" (maximum iterations). Randomness is a good way to go through a set of values without compromising too much the computational resources needed.

As a termination policy, we picked the BanditPolicy, which uses a slack criteria to terminate the policy. You can read more about it in [AzureML docs](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.banditpolicy?view=azure-ml-py), but basically it terminates the run if new models do not perform within a given percentage range of the top accuracy gotten in previous trainings.

The model that we will be training can be found in the [`train.py` file](scripts/train.py), where we setup the training run by importing necessary libraries, preparing and splitting the data, and finally setting up the main training function. We picked for this task a `Logistic Regressor` from scikit-learn, as it works well for multi-label classification within tasks where we do not have a big amount of data.

In [6]:
from azureml.widgets import RunDetails
# from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice
from azureml.core import ScriptRunConfig

# Setup hyperparameter tuning

# Specify parameter sampler
ps = RandomParameterSampling(
    {
        'C': choice([x*0.001 for x in range(1,1000)]),
        'max_iter': choice(range(100, 500))
    }
)

# Specify a termination Policy
policy = BanditPolicy(slack_factor=0.1)

# Get the previously registered environment
# env = Environment.get(workspace=ws, name="az-ml")

# Create an estimator for use with train.py and pass in the environment
est = ScriptRunConfig(
    source_directory="./scripts",
    script="train.py",
    arguments=['--ds_name', 'hd-dataset'], # dataset.name],
    compute_target=compute_target,
    environment=env)

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
hyperdrive_run_config = HyperDriveConfig(
    run_config=est,
    hyperparameter_sampling=ps,
    policy=policy,
    primary_metric_name="accuracy",
    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
    max_total_runs=100,
    max_concurrent_runs=4)

In [7]:
# Submit hyperdrive run to the experiment and show run details with the widget.
hyperdrive_run = exp.submit(hyperdrive_run_config)

## Run Details

On the cells below, we will be using the `RunDetails` widget to see the different results we get when tuning the hyperparameters.

In [8]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

## Best Model

Now it's time to show off our best model from the experiment. In the cells below we import, display all the properties of the model and save it locally.

In [10]:
import joblib

# Get best run and save the model from that run.
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['arguments']
print(best_run_metrics)

{'regularization strength:': 0.045, 'max iterations:': 270, 'accuracy': 0.5502958579881657}


In [11]:
# Save the best model
print('Best Run Id: ', best_run.id)

for i in best_run_metrics:
    print(i, best_run_metrics[i])

# model = best_run.register_model(model_name='capstone-hd', model_path='./outputs/capstone-hd.joblib')
# model.download(target_dir="models", exist_ok=True)

Best Run Id:  HD_aaac09a6-eb80-4ee4-8b6f-eb3961138879_43
regularization strength: 0.045
max iterations: 270
accuracy 0.5502958579881657


## Model Deployment

Since we already deployed the AutoML model in the previous notebook, on this experiment we will only register the best HyperDrive model. The cells below will do exactly this, register the model in AzureML for further use.

In [12]:
model_name = 'capstone-hd'
description = "AutoML model for predicting day-ahead Bitcoin price movements"
tags = None
model = best_run.register_model(model_name=model_name, model_path='outputs/model-hd.joblib',
                                description=description, tags=tags)
model.download(target_dir="models", exist_ok=True)

ModelPathNotFoundException: ModelPathNotFoundException:
	Message: Could not locate the provided model_path models/capstone-hd.joblib in the set of files uploaded to the run: ['logs/azureml/dataprep/backgroundProcess.log', 'logs/azureml/dataprep/backgroundProcess_Telemetry.log', 'logs/azureml/dataprep/rslex.log', 'outputs/model-hd.joblib', 'system_logs/cs_capability/cs-capability.log', 'system_logs/hosttools_capability/hosttools-capability.log', 'system_logs/lifecycler/execution-wrapper.log', 'system_logs/lifecycler/lifecycler.log', 'system_logs/lifecycler/vm-bootstrapper.log', 'user_logs/std_log.txt']
                See https://aka.ms/run-logging for more details.
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Could not locate the provided model_path models/capstone-hd.joblib in the set of files uploaded to the run: ['logs/azureml/dataprep/backgroundProcess.log', 'logs/azureml/dataprep/backgroundProcess_Telemetry.log', 'logs/azureml/dataprep/rslex.log', 'outputs/model-hd.joblib', 'system_logs/cs_capability/cs-capability.log', 'system_logs/hosttools_capability/hosttools-capability.log', 'system_logs/lifecycler/execution-wrapper.log', 'system_logs/lifecycler/lifecycler.log', 'system_logs/lifecycler/vm-bootstrapper.log', 'user_logs/std_log.txt']\n                See https://aka.ms/run-logging for more details."
    }
}

In [None]:
# Since we already demonstrated deployment with AutoML, in this case we will only register it (cell above),
# no need for deployment, nor testing the endpoint in this case

What we do need to do though, is delete the compute target which we used for training and tuning!

In [None]:
compute_target.delete()
print('Compute cluster deleted!')

**Submission Checklist**
- (DONE) I have registered the model.
- (N/A, done with AutoML) I have deployed the model with the best accuracy as a webservice.
- (N/A, done with AutoML) I have tested the webservice by sending a request to the model endpoint.
- (DONE) I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- (DONE) The project includes a file containing the environment details (see `envs` directory).

