# Udacity ML Azure, Project 1: Optimizing a Machine Learning Pipeline

Here 2 approaches for optimizing machine learning pipelines are trialled and the results compared.
1. Hyperdrive: Build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model.
2. Azure AutoML.

## Initialize Workspace

In [1]:
from azureml.core import Workspace, Experiment

# Note: this requires the workspace name to be known at the point of editing this notebook:
#ws = Workspace.get(name="udacity-project")

# An alternative approach is to use the "ws = Workspace.from_config()" method.
# This fetches the workspace details from the config file created by Azure when the session starts.
# https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py
ws = Workspace.from_config()
exp = Experiment(workspace=ws, name="udacity-project")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

Workspace name: quick-starts-ws-139623
Azure region: southcentralus
Subscription id: 9a7511b8-150f-4a58-8528-3e7d50216c31
Resource group: aml-quickstarts-139623


# Hyperdrive
## 1. Create a compute cluster

In [2]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException  # Use when verifying whether cluster already exists

# TODO: Create compute cluster
# Use vm_size = "Standard_D2_V2" in your provisioning configuration.
# max_nodes should be no greater than 4.

# Useful refs for verifying whether cluster exists and provisioning a compute cluster:
# https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.computetarget?view=azure-ml-py
# https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training/train-on-amlcompute/train-on-amlcompute.ipynb

# Set the compute cluster name
cluster_name = "azml-cluster"

# Verify that cluster does not exist already
# If cluster already exists use it, otherwise initialize the cluster (exception thrown if cluster does not exist)
try:
    azml_cluster = ComputeTarget(ws, cluster_name)  # exists
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # Set up the compute cluster config
    compute_config = AmlCompute.provisioning_configuration(vm_size = "Standard_D2_V2", max_nodes = 4)
    # Create the compute cluster
    azml_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

# Wait for the current provisioning operation to finish on the cluster.
azml_cluster.wait_for_completion(show_output = True)

# Display status information about the AmlCompute target:
# Fetch AmlComputeStatus object using ComputeTarget's get_status() method
# AmlComputeStatus: Represents detailed status information about an AmlCompute target.
# Convert the AmlComputeStatus object into a JSON serialized dictionary using serialize().
# Ref: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute.amlcomputestatus
print(azml_cluster.get_status().serialize())


Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-02-28T16:53:16.462000+00:00', 'errors': None, 'creationTime': '2021-02-28T16:53:12.527524+00:00', 'modifiedTime': '2021-02-28T16:53:28.686709+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D2_V2'}


## 2. Set up the Hyperdrive configuration

In [4]:
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice
import os

# Specify parameter sampler
# https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.randomparametersampling
param_dict = {
    "--C" : uniform(0.001, 2.0),
    "--max_iter" : choice(10,25,50,100,150,200,250)
}
ps = RandomParameterSampling(param_dict)

# Specify a Policy
# https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.banditpolicy
policy = BanditPolicy(evaluation_interval=5, slack_factor=0.2)

if "training" not in os.listdir():
    os.mkdir("./training")

# Create a SKLearn estimator for use with train.py
# https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.sklearn.sklearn
est = SKLearn(source_directory=".",
              compute_target=azml_cluster, 
              vm_size="Standard_D2_V2", 
              vm_priority="dedicated", 
              entry_script="train.py")

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
# https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.hyperdriveconfig
hyperdrive_config = HyperDriveConfig(hyperparameter_sampling=ps, 
                                     primary_metric_name="Accuracy", 
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                                     policy=policy, 
                                     estimator=est,
                                     max_total_runs=16,
                                     max_concurrent_runs=4)

'SKLearn' estimator is deprecated. Please use 'ScriptRunConfig' from 'azureml.core.script_run_config' with your own defined environment or the AzureML-Tutorial curated environment.


## 3. Submit Hyperdrive run

In [5]:
# Submit your hyperdrive run to the experiment and show run details with the widget.
hd_run = exp.submit(hyperdrive_config)
RunDetails(hd_run).show()



_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

## 4. Get the best Hyperdrive run and save the model

In [6]:
import joblib
# Get your best run and save the model from that run.
best_run = hd_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()

print("best_run_metrics:",best_run_metrics)

best_run_metrics: {'Regularization Strength:': 1.6807881339560455, 'Max iterations:': 25, 'Accuracy': 0.9107044325915027}


In [8]:
# Implementation for saving the model guided by information here:
# https://knowledge.udacity.com/questions/424266
# https://www.kaggle.com/pankaj1234/azure-machine-learning-model-training 
# https://towardsdatascience.com/azure-machine-learning-service-train-a-model-df72c6b5dc
print("Filenames:",best_run.get_file_names()) # print the filenames
print("Run Details:",best_run.get_details()['runDefinition']['arguments']) # print the run details

Filenames: ['azureml-logs/55_azureml-execution-tvmps_8b7bb7d43ee67b8a9ad25f1b41545ffa8886017ff5dd93d3d5458687683e8cbc_d.txt', 'azureml-logs/65_job_prep-tvmps_8b7bb7d43ee67b8a9ad25f1b41545ffa8886017ff5dd93d3d5458687683e8cbc_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_8b7bb7d43ee67b8a9ad25f1b41545ffa8886017ff5dd93d3d5458687683e8cbc_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/103_azureml.log', 'logs/azureml/job_prep_azureml.log', 'logs/azureml/job_release_azureml.log', 'outputs/my_model.joblib']
Run Details: ['--C', '1.6807881339560455', '--max_iter', '25']


In [10]:
print("Best Run ID:",best_run.id)
print("Accuracy:",best_run_metrics["Accuracy"])

Best Run ID: HD_17aefb7a-a94c-4e0b-8507-ec4a2938f198_0
Accuracy: 0.9107044325915027


In [11]:
# Display details of the best run
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
udacity-project,HD_17aefb7a-a94c-4e0b-8507-ec4a2938f198_0,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [12]:
# Download the model to the local file system from the best run using the run object’s download_file() method.
# Note: The model will be the last file in the list hence the -1 index can be used to reference it.
os.makedirs("outputs", exist_ok=True)  # Ensure that there is a local outputs folder
best_run.download_file(best_run.get_file_names()[-1], output_file_path='./outputs/my_best_hd_model.joblib')  # Download

In [13]:
# Verify that best model has been retrieved
joblib.load('./outputs/my_best_hd_model.joblib')

The sklearn.linear_model.logistic module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.linear_model. Anything that cannot be imported from sklearn.linear_model is now part of the private API.
Trying to unpickle estimator LogisticRegression from version 0.20.3 when using version 0.22.2.post1. This might lead to breaking code or invalid results. Use at your own risk.
From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.


LogisticRegression(C=1.6807881339560455, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=25, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [14]:
# Register the model
best_hyperdrive_model = best_run.register_model(
    model_name="best_hyperdrive_model",
    model_path="./outputs/my_model.joblib",  # Cloud path
    tags=best_run.get_metrics()
)

# -----------------------------------------------------------------------------------------------------------
# AutoML

## 0. Install missing packages

In [None]:
# It was suggested that the: 
# AttributeError: /anaconda/envs/azureml_py36/lib/libxgboost.so: undefined symbol: XGBoosterUnserializeFromBuffer
# which occurs when attempting to retrieve the best automl run and model, 
# is due to XGBoost not being pre-installed. 
# Relevant info in this Udacity Knowledge thread: 
# https://knowledge.udacity.com/questions/504246
# Suggested solution:
# import sys
# !conda install — yes — prefix {sys.prefix} py-xgboost
# !conda install -y anaconda py-xgboost

import xgboost
print(xgboost.__version__)

## 1. Read in the dataset

In [15]:
from azureml.data.dataset_factory import TabularDatasetFactory

# Create TabularDataset using TabularDatasetFactory
# Data is available at: 
# "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"

raw_data_url = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"
ds = TabularDatasetFactory.from_delimited_files(raw_data_url)

## 2. Prepare the data

Clean the data and divide into training and test sets.

In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split
from train import clean_data

# Use the clean_data function to clean the data (as per project instructions).
x, y = clean_data(ds)

# Split into train and test set (as per project instructions)
# Note: nothing has been specified in the project instructions or the project rubric with respect to 
# the use of the test set...
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=8, stratify=y)

# Training data as DataFrame (joining input and target feature again into one DataFrame)
df_train = pd.concat([x_train,y_train], axis=1)




## 3. Set the parameters of the AutoMLConfig

In [21]:
from azureml.train.automl import AutoMLConfig

# Get the default datastore
datastore = ws.get_default_datastore()

# Convert the DataFrame into a TabularDataset
train_data = TabularDatasetFactory.register_pandas_dataframe(df_train, datastore, "train_data")

# Set parameters for AutoMLConfig
# NOTE: DO NOT CHANGE THE experiment_timeout_minutes PARAMETER OR YOUR INSTANCE WILL TIME OUT.
# If you wish to run the experiment longer, you will need to run this notebook in your own
# Azure tenant, which will incur personal costs.
automl_config = AutoMLConfig(
    experiment_timeout_minutes=30,
    task='classification',
    primary_metric='accuracy',
    training_data=train_data,
    label_column_name='y',
    n_cross_validations=5,
    compute_target=azml_cluster)



Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/2494b52e-9225-4fb7-94b9-d3fb3fbd26e9/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


## 4. Submit the AutoML run

In [22]:
from azureml.widgets import RunDetails #imported again here as AutoML rerun in second attempt
# Submit your automl run
experiment = Experiment(workspace=ws, name="automl_expt")
automl_run = experiment.submit(config=automl_config, show_output=True)
# View progress
RunDetails(automl_run).show()
automl_run.wait_for_completion(show_output=True)

Running on remote.
No run_configuration provided, running on azml-cluster with default configuration
Running on remote compute: azml-cluster
Parent Run ID: AutoML_a34fcb14-5db4-4ecb-8c25-d48d762eb4d7

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a f

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## 5. Retrieve and save the best model

In [None]:
# Find the version of xgboost in use by the AutoML - look in the logs

# INSTALL the matching version from this notebook

# import it again:
import importlib
importlib.reload(xgboost)
print(xgboost.__version__)

In [24]:
# Retrieve and save your best automl model.
best_automl_run, best_automl_model = automl_run.get_output()

AttributeError: /anaconda/envs/azureml_py36/lib/libxgboost.so: undefined symbol: XGBoosterUnserializeFromBuffer

In [None]:
print("Filenames (best AutoML run):",best_automl_run.get_file_names()) # print the filenames
print(best_automl_run.get_tags())

In [None]:
# Download the model
best_automl_run.download_file(best_automl_run.get_file_names()[-1], output_file_path='./outputs/')

In [None]:
# Alternatively use joblib (testing to see whether both approaches work)
joblib.dump(value=best_automl_model, filename='outputs/automl_best_model.joblib')

In [None]:
registered_best_automl_model = best_automl_run.register_model(
    model_name = 'automl_best_model', 
    model_path = './outputs/automl_best_model.pkl',
    tags=best_automl_run.get_metrics())

## 6. Additional step - test the best AutoML model 

Use the test set to make predictions using the trained model and display the results.

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
# https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-auto-train-models
y_predict = registered_best_automl_model.predict(x_test)

In [None]:
# Prepare the target values for use in scoring
y_true = y_test.values.flatten().tolist()

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
tn, fp, fn, tp = confusion_matrix(y_true, y_predict).ravel()
print("True Negative Count:", tn)
print("False Positive Count:", fp)
print("False Negative Count:", fn)
print("True Positive Count:", tp)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
target_names = ['NO', 'YES']
print(classification_report(y_true, y_pred, target_names=target_names))

## -------------------------------------------------------------------------------------------------------------------------------
## Clean up deployed resources.

In [None]:
# Delete the compute cluster 
azml_cluster.delete()