# Lab 2 - Hyperparameter Tuning with `Hyperdrive`

The goal of this lab is to show how to utilize AML service feature called *Hyperdrive* for scale out hyper parameter tuning.

Azure Machine Learning allows you to automate hyperparameter exploration in an efficient manner, saving you significant time and resources. You specify the range of hyperparameter values and a maximum number of training runs. The system then automatically launches multiple simultaneous runs with different parameter configurations and finds the configuration that results in the best performance, measured by the metric you choose. Poorly performing training runs are automatically early terminated, reducing wastage of compute resources. These resources are instead used to explore other hyperparameter configurations.

You will continue working on the same scenario as in Lab 1.

## Connect AML workspace

Check the version of AML SDK.

In [2]:
# Verify AML SDK Installed

import azureml.core
print("SDK Version:", azureml.core.VERSION)

SDK Version: 1.0.23


Connect to the workspace.

In [3]:
from azureml.core import Workspace

# Connect to workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

Found the config file in: /home/byteb/events/MachineLearningOps/.azureml/config.json


If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


MLOpsFatosIsmali
DSIMLOpsHack
westeurope
051aa254-957d-4431-a6df-6caa8963bdd7


## Prepare Hyperdrive run
### Create a training script
We will utilize the similar training script to the one used in Lab 1. The only difference is that in addition to fine tuning **C** we will also try to find the most optimal *Logistic Regression* solver.

In [4]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

In [5]:
%%writefile $script_folder/train.py

import argparse
import os
import numpy as np
import pandas as pd

from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate

from azureml.core.run import Run

# Retrieve command line arguments
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str,  help='data folder mounting point')
parser.add_argument('--filename', type=str,  help='training file name')
parser.add_argument('--C', type=float , help='regularization')
parser.add_argument('--solver', type=str , help='Algorithm to use int the optimization problem')
args = parser.parse_args()

# Configure a path to training data
data_folder = os.path.join(args.data_folder, 'datasets')
print('Loading data from: ', data_folder)
data_csv_path = os.path.join(data_folder, args.filename)

# Load the dataset
df = pd.read_csv(data_csv_path)

# Preprocess the data
feature_columns = [
                   # Demographic
                   'age', 
                   'job', 
                   'education', 
                   'marital',  
                   'housing', 
                   'loan', 
                   # Previous campaigns
                   'month',
                   'campaign',
                   'poutcome',
                   # Economic indicators
                   'emp_var_rate',
                   'cons_price_idx',
                   'cons_conf_idx',
                   'euribor3m',
                   'nr_employed']

df = df[feature_columns + ['y']]
df_train = pd.get_dummies(df, drop_first=True).astype(dtype='float')

# Create logistic regression estimater
lr = LogisticRegression(solver=args.solver, C=args.C, max_iter=300, class_weight='balanced')

# Logistic regression requires feature scaling
scaler = StandardScaler()

# Create a training pipeline
pipeline = Pipeline(steps=[('scaler', scaler),
                           ('lr', lr)])


# Train and evaluate the model using cross validation
X = df_train.drop('y', axis=1)
y = df_train.y

# Evaluate metrics(s) by cross-validation
print("Starting training using {} solver and C={}".format(args.solver, args.C))
scoring = ['accuracy', 'recall']
scores = cross_validate(pipeline, X, y, 
                        cv=10, 
                        return_train_score=False,
                        scoring=scoring)

cv_accuracy = np.mean(scores['test_accuracy'])
cv_recall = np.mean(scores['test_recall'])

print("CV accuracy: ", cv_accuracy)
print("CV recall: ", cv_recall)

# Persist the metrics in Azure ML Experiment
# Acquire the current run and log run parameters and performance measures
run = Run.get_context()
run.log("Solver", args.solver)
run.log("C", args.C)
run.log("val_accuracy", cv_accuracy)
run.log("val_recall", cv_recall)


# Train the model on a full dataset
trained_pipeline = pipeline.fit(X, y)

# Serialize the model to ./outputs directory so that it can be automatically copied to Azure ML Experiment
print("Saving the model to outputs ...")
joblib.dump(value=trained_pipeline, filename='outputs/model.pkl')

Writing ./script/train.py


### Connect to Azure ML Compute

We are reusing the cluster created in Lab 1. In case you removed the cluster the below code snippet is going to re-create it.

In [6]:
# Create an Azure ML Compute cluster

# Create Azure ML cluster
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os

# choose a name for your cluster
cluster_name = "cpu-cluster"
cluster_min_nodes = 1
cluster_max_nodes = 3
vm_size = "STANDARD_DS11_V2"

# Check if the cluster exists. If yes connect to it
if cluster_name in ws.compute_targets:
    compute_target = ws.compute_targets[cluster_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('Found existing compute target, using this compute target instead of creating:  ' + cluster_name)
    else:
        print("Error: A compute target with name ",cluster_name," was found, but it is not of type AmlCompute.")
else:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size, 
                                                                min_nodes = cluster_min_nodes, 
                                                                max_nodes = cluster_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current BatchAI cluster status, use the 'status' property    
    print(compute_target.status.serialize())

Found existing compute target, using this compute target instead of creating:  cpu-cluster


### Connect to the default datastore

In Lab 1, we uploaded the training files to the default datastore.

In [7]:
ds = ws.get_default_datastore()
print(ds.datastore_type, ds.account_name, ds.container_name)

AzureBlob mlopsfatosisma6452541516 azureml-blobstore-ba8ddba8-ba9b-45c0-a53f-08c3c660a28d


### Define the hyperparameter space.

The hyperparameter space is a range of values defined for each hyperparameter. `Hyperdrive` supports a number of strategies for the hyperparameter space sampling, including random sampling, grid sampling, and Bayesian sampling.

In this lab we are going to utilize grid sampling.

In [8]:
from azureml.train.hyperdrive import *

ps = GridParameterSampling(
    {
        '--solver': choice('lbfgs', 'newton-cg'),
        '--C': choice(0.001, 0.002, 0.005, 0.01, 0.5, 1, 2, 3)
    }
)


### Create an estimator object

Note that we are not configuring command line parameters that are defined in the hyperparameter grid. 

In [9]:
from azureml.train.estimator import Estimator

script_params = {
    '--data-folder': ds.as_mount(),
    '--filename': 'banking_train.csv'
}

est_config = Estimator(source_directory=script_folder,
                       script_params=script_params,
                       compute_target=compute_target,
                       entry_script='train.py',
                       conda_packages=['scikit-learn', 'pandas'])

### Specify early termination policy

Terminate poorly performing runs automatically with an early termination policy. Termination reduces wastage of resources and instead uses these resources for exploring other parameter configurations.

Azure Machine Learning service supports the following Early Termination Policies.

- Bandit policy
- Median stopping policy
- Truncation selection policy
- No termination policy

In this lab we are going to utilize **No termination policy**.



In [10]:
policy = NoTerminationPolicy()

### Configure experiment

Now we are ready to configure a `Hyperdrive` experiment. In addition to the configurations defined in the sections above we are also setting resource allocations constraints, including maximum number of training runs, maximum number of concurrent runs and the primary optimization metric.

If you go back to visit the training script, you will notice that it logs cross validation accuracy and recall after every run. As you remember from the Lab 1 we want to minimize the number of false negatives - customers who were wrongly identified as ones with low propencity to buy. We want the model with a high recall.

In [11]:
htc = HyperDriveRunConfig(estimator=est_config, 
                          hyperparameter_sampling=ps,
                          policy=policy,
                          primary_metric_name="val_recall", 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=16,
                          max_concurrent_runs=3)

Create new experient to capture `hyperdrive` runs.

In [12]:
experiment_name = 'propensity_to_buy_hyperdrive'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

### Submit the run

Finally,launch the hyperdrive job.

In [13]:
hdr = exp.submit(config=htc)
hdr

Experiment,Id,Type,Status,Details Page,Docs Page
propensity_to_buy_hyperdrive,propensity_to_buy_hyperdrive_1555508196466,hyperdrive,Running,Link to Azure Portal,Link to Documentation


### Monitor the job

In [14]:
from azureml.widgets import RunDetails

RunDetails(hdr).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [15]:
hdr.wait_for_completion(show_output=True) # specify True for a verbose log

RunId: propensity_to_buy_hyperdrive_1555508196466

Execution Summary
RunId: propensity_to_buy_hyperdrive_1555508196466



{'runId': 'propensity_to_buy_hyperdrive_1555508196466',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'endTimeUtc': '2019-04-17T13:43:36.000Z',
 'properties': {'primary_metric_config': '{"name": "val_recall", "goal": "maximize"}',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'ContentSnapshotId': '0e9605ae-dcb1-44be-9792-612db1c86e7d'},
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlopsfatosisma6452541516.blob.core.windows.net/azureml/ExperimentRun/dcid.propensity_to_buy_hyperdrive_1555508196466/azureml-logs/hyperdrive.txt?sv=2018-03-28&sr=b&sig=FbFmzQ05UI8MnansWG%2Ft%2BHzcX2f5rPFuxI4oXDQzxYo%3D&st=2019-04-17T13%3A36%3A36Z&se=2019-04-17T21%3A46%3A36Z&sp=r'}}

## Find and retrieve the best model

When all the jobs finish, you can find the one that hsa the highest performance metrics - in our case *recall*.

In [16]:
best_run = hdr.get_best_run_by_primary_metric()

In [17]:
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['arguments']

print('Best Run Id: ', best_run.id)
print('\n Validation recall:', best_run_metrics['val_recall'])
print('\n solver:',parameter_values[5])
print('\n C:',parameter_values[7])

Best Run Id:  propensity_to_buy_hyperdrive_1555508196466_0

 Validation recall: 0.6400870938758948

 solver: lbfgs

 C: 0.001


## Register the best model

The last step in the training script wrote the file model.pkl in the outputs directory. As noted before, outputs is a special directory in that all content in this directory is automatically uploaded to your workspace. This content appears in the run record in the experiment under your workspace.

You can register the model so that it can be later queried, examined and deploy

In [18]:
tags = {"CreatedBy:": "HyperDrive"}
model_name = 'propensity_to_buy_predictor'

model = best_run.register_model(model_name=model_name, 
                                model_path='outputs/model.pkl',
                                tags=tags)

print(model.name, model.id, model.version, sep = '\t')

propensity_to_buy_predictor	propensity_to_buy_predictor:1	1


## Test the best model

In [19]:
from azureml.core.model import Model
from sklearn.externals import joblib

model_name = 'propensity_to_buy_predictor'
model_path = Model.get_model_path(model_name, _workspace=ws)
model = joblib.load(model_path)

In [20]:
import numpy as np
import pandas as pd
import os
from sklearn.metrics import accuracy_score, recall_score
from azureml.core.model import Model
from sklearn.externals import joblib

# Rehydrate the model from Model Registry
model_name = 'propensity_to_buy_predictor'
model_path = Model.get_model_path(model_name, _workspace=ws)
model = joblib.load(model_path)

# Load a test dataset
folder = '../datasets'
filename = 'banking_test.csv'
pathname = os.path.join(folder, filename)
df = pd.read_csv(pathname, delimiter=',')
feature_columns = [
                   # Demographic
                   'age', 
                   'job', 
                   'education', 
                   'marital',  
                   'housing', 
                   'loan', 
                   # Previous campaigns
                   'month',
                   'campaign',
                   'poutcome',
                   # Economic indicators
                   'emp_var_rate',
                   'cons_price_idx',
                   'cons_conf_idx',
                   'euribor3m',
                   'nr_employed']
df_test = df[feature_columns + ['y']]
df_test = pd.get_dummies(df_test, drop_first=True).astype(dtype='float')

# Score the test dataset and calculate performance metrics
y_pred = model.predict(df_test.drop('y', axis=1))
print("Test accuracy: ", accuracy_score(df_test.y, y_pred))
print("Test recall: ", recall_score(df_test.y, y_pred))

Test accuracy:  0.8197378004369993
Test recall:  0.6228448275862069


## Next Step

In the next lab, you will operationalize the model developed in this lab.