# Udacity ML Azure, Project 1: Optimizing a Machine Learning Pipeline

Here means of optimizating machine learning pipelines are trialled and compared.
* 1. Hyperdrive: Build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model.
* 2. Azure AutoML.

The results of these two approaches are compared.

In [None]:
from azureml.core import Workspace, Experiment

# Note: this requires the workspace name to be known at the point of editing this notebook:
#ws = Workspace.get(name="udacity-project")

# An alternative approach is to use the "ws = Workspace.from_config()" method.
# This fetches the workspace details from the config file created by Azure when the session starts.
# https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py
ws = Workspace.from_config()
exp = Experiment(workspace=ws, name="udacity-project")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

# Hyperdrive
## 1. Create a compute cluster

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException  # Use when verifying whether cluster already exists

# TODO: Create compute cluster
# Use vm_size = "Standard_D2_V2" in your provisioning configuration.
# max_nodes should be no greater than 4.

# Useful refs for verifying whether cluster exists and provisioning a compute cluster:
# https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.computetarget?view=azure-ml-py
# https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training/train-on-amlcompute/train-on-amlcompute.ipynb

# Set the compute cluster name
cluster_name = "azml-cluster"

# Verify that cluster does not exist already
# If cluster already exists use it, otherwise initialize the cluster (exception thrown if cluster does not exist)
try:
    azml_cluster = ComputeTarget(ws, cluster_name)  # exists
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # Set up the compute cluster config
    compute_config = AmlCompute.provisioning_configuration(vm_size = "Standard_D2_V2", max_nodes = 4)
    # Create the compute cluster
    azml_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

# Wait for the current provisioning operation to finish on the cluster.
azml_cluster.wait_for_completion(show_output = True)

# Display status information about the AmlCompute target:
# Fetch AmlComputeStatus object using ComputeTarget's get_status() method
# AmlComputeStatus: Represents detailed status information about an AmlCompute target.
# Convert the AmlComputeStatus object into a JSON serialized dictionary using serialize().
# Ref: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute.amlcomputestatus
print(azml_cluster.get_status().serialize())


## 2. Set up the Hyperdrive configuration

In [None]:
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice
import os

# Specify parameter sampler
# https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.randomparametersampling
param_dict = {
    "--C" : uniform(0.001, 2.0),
    "--max_iter" : choice(10,25,50,100,150,200,250)
}
ps = RandomParameterSampling(param_dict)

# Specify a Policy
# https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.banditpolicy
policy = BanditPolicy(evaluation_interval=5, slack_factor=0.2)

if "training" not in os.listdir():
    os.mkdir("./training")

# Create a SKLearn estimator for use with train.py
# https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.sklearn.sklearn
est = SKLearn(source_directory=".",
              compute_target=azml_cluster, 
              vm_size="Standard_D2_V2", 
              vm_priority="dedicated", 
              entry_script="train.py")

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
# https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.hyperdriveconfig
hyperdrive_config = HyperDriveConfig(hyperparameter_sampler=ps, 
                                     primary_metric_name="Accuracy", 
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                                     policy=policy, 
                                     estimator=est,
                                     max_total_runs=16,
                                     max_concurrent_runs=4)

## 3. Submit Hyperdrive run

In [None]:
# Submit your hyperdrive run to the experiment and show run details with the widget.
hd_run = exp.submit(hyperdrive_config)
RunDetails(hd_run).show()

In [None]:
hd_run.wait_for_completion(show_output = True)

## 4. Get the best Hyperdrive run and save the model

In [None]:
import joblib
# Get your best run and save the model from that run.
best_run = hd_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()

# Implementation for saving the model guided by information here:
# https://knowledge.udacity.com/questions/424266
# https://github.com/Azure/MachineLearningNotebooks/blob/a039166b901b90b6fa5a364c728aca7936e3789d/how-to-use-azureml/ml-frameworks/scikit-learn/train-hyperparameter-tune-deploy-with-sklearn/train-hyperparameter-tune-deploy-with-sklearn.ipynb
# Download the model from the best run
print("Filenames:",best_run.get_file_names()) # print the filenames
print("Run Details:",best_run.get_details()['runDefinitions']['arguments']) # print the run details
best_run.download_file(best_run.get_file_names()[-1], output_file_path='./outputs/')

print("Best Run ID:",best_run.id)
print("Accuracy:",best_run_metrics["accuracy"])

In [None]:
best_run

In [None]:
# Verify that best model has been retrieved
joblib.load('./outputs/my_model.joblib')

# Register the model
best_hyperdrive_model = best_run.register_model(
    model_name="best_hyperdrive_model",
    model_path="./outputs/my_model.joblib",
    tags=best_run.get_metrics()
)

# ------------------------------------------------------------------------------------------------------------
# AutoML
## 1. Read in dataset

In [None]:
from azureml.data.dataset_factory import TabularDatasetFactory

# Create TabularDataset using TabularDatasetFactory
# Data is available at: 
# "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"

raw_data_url = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"
ds = TabularDatasetFactory.from_delimited_files(raw_data_url)

## 2. Clean the data

Clean the data and divide into training and test sets.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from train import clean_data

# Use the clean_data function to clean the data.
x, y = clean_data(ds)

# Split into train and test set
# Note: nothing has been specified in the rubric with respect to use of the test set...
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=8, stratify=y)

# Training data as DataFrame (joining input and target feature again into one DataFrame)
df_train = pd.concat([x_train,y_train], axis=1)

## 3. Set the parameters of the AutoMLConfig

In [None]:
from azureml.train.automl import AutoMLConfig

# Set parameters for AutoMLConfig
# NOTE: DO NOT CHANGE THE experiment_timeout_minutes PARAMETER OR YOUR INSTANCE WILL TIME OUT.
# If you wish to run the experiment longer, you will need to run this notebook in your own
# Azure tenant, which will incur personal costs.
automl_config = AutoMLConfig(
    experiment_timeout_minutes=30,
    task='classification',
    primary_metric='Accuracy',
    training_data=df_train,
    label_column_name='y',
    n_cross_validations=5,
    compute_target=azml_cluster)

## 4. Submit the AutoML run

In [2]:
# Submit your automl run
experiment = Experiment(workspace, "automl_expt")
automl_run = experiment.submit(config=automl_config, show_output=True)
# View progress
RunDetails(automl_run).show()

## 5. Retrieve and save the best model

In [None]:
# Retrieve and save your best automl model.
best_run, best_model = automl_run.get_output()
best_run.register_model(model_name = 'automl_best_model.pkl', model_path = './outputs/')

## 6. Additional step - test the best AutoML model 

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
# https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-auto-train-models
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
y_predict = best_model.predict(x_test)
y_actual = y_test.values.flatten().tolist()

tn, fp, fn, tp = confusion_matrix(y_actual, y_predict).ravel()
target_names = ['YES', 'NO']
print(classification_report(y_true, y_pred, target_names=target_names))

## -------------------------------------------------------------------------------------------------------------------------------
## Clean up deployed resources.

In [None]:
# Delete the compute cluster 
azml_cluster.delete()