# Udacity Azure Machine Learning Engineer - Project 1

**Name:** Bob Peck

**Date:** March 5, 2023

This is the Jupyter Notebook associated with the Udacity Azure Machine Learning Project 1. The objective of this project is to compare a custom-coded model (using Scikit-learn Logistic Regression) and an AutoML model. For the custom-coded model, I'll use HyperDrive to optimize the hyperparameters, targeting *accuracy* as the primary metric. For the AutoML model, I'll supply the same dataset and let AutoML select the best model and hyperparameters. I'll limit the time for optimization simply to manage costs on the compute.

## Setup the workspace, compute and experiment

These next following sections I'll provision the components necessary to conduct the ML experiments.

- Get a reference to the previously provisioned ML Workspace
- Setup the compute cluster for the ML experiments

Once those are complete, then I'll begin to define the first experiment - the custom-coded ML HyperDrive experiment.

In [10]:
from azureml.core import Workspace, Experiment

ws = Workspace.from_config()

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')


Workspace name: winequality
Azure region: eastus
Subscription id: 4f84b10b-f2d2-47a8-8dbb-52b1a8dc25de
Resource group: cip-ee


In [11]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# 
# This provisioning uses the STANDARD_D2_V2 vm size for cost management purposes.
# We could have selected a larger vm for the cluster for more compute to conduct more concurrent experiments
# 

cluster_name = "bank-marketing-cluster"
compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2", min_nodes=0, max_nodes=4)

try:
    my_compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    my_compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    my_compute_target.wait_for_completion(show_output=True)



Found existing compute target.


## Setup the HyperDrive experiment

This section sets up the HyperDrive experiment. Key hyperparameters to experiment with are the values for *C* and *max_iter*

Analysis of values for C and max_iter (following is from [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)):

- C is the inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. 
- max_iter is maximum number of iterations taken for the solvers to converge.

Observations:

- Tested a variety of values for C, ranging from 0.01 to 100. Lower values produced higher accuracy scores, hence narrowed the range on later runs to 0.001 to 1.
- Tested a variety of values for max_iter, ranging from 100 to 1600. Seemingly past ~1000, the algorithm failed to gain any more accuracy during further iterations. Optimal value seems to be around 600-800, depending on the C value.
- Tested higher values for max_total_runs, but didn't observe any higher accuracy with more runs.


In [12]:
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import choice, uniform
from azureml.core import Environment, ScriptRunConfig
import os

# Specify parameter sampler
ps = RandomParameterSampling({
    'C': uniform(0.001, 1),
    'max_iter': choice(100, 200, 400, 800)
})

# Specify a Policy
policy = BanditPolicy(slack_factor=0.2, evaluation_interval=1)

if "training" not in os.listdir():
    os.mkdir("./training")

# Setup environment for your training run
sklearn_env = Environment.from_conda_specification(name='sklearn-env', file_path='conda_dependencies.yml')

# Create a ScriptRunConfig Object to specify the configuration details of your training job
src = ScriptRunConfig(source_directory='.', script='train.py', environment=sklearn_env, compute_target=my_compute_target)

# Create a HyperDriveConfig using the src object, hyperparameter sampler, and policy.
hyperdrive_config = HyperDriveConfig(run_config=src,
                                     hyperparameter_sampling=ps,
                                     primary_metric_name='Accuracy',
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=50,
                                     max_concurrent_runs=4,
                                     policy=policy)




### Execute the HyperDrive experiment 

With all the hyperparameters set, we now submit the experiment for execution.

I've chosen to use *wait_for_completion()* method to prevent the next section from executing prior to this being done. This was a personal choice and not specifically needed for successful experiments.

### Results

The LogisticRegression ML model with given hyperparameters seems to find a maximum accuracy score around 90.9%. 

In [4]:
# Submit your hyperdrive run to the experiment and show run details with the widget.
from azureml.widgets import RunDetails

exp = Experiment(workspace=ws, name="udacity-project")

# Submit the HyperDriveConfig object to run the experiment
hyperdrive_run = exp.submit(config=hyperdrive_config, show_output=True)

# Use the RunDetails widget to display the run details
RunDetails(hyperdrive_run).show()
hyperdrive_run.wait_for_completion(show_output=False)


_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO',…

{'runId': 'HD_07139f9c-ccc0-4b10-9593-219a2c63f855',
 'target': 'bank-marketing-cluster',
 'status': 'Completed',
 'startTimeUtc': '2023-03-04T02:19:06.247193Z',
 'endTimeUtc': '2023-03-04T02:36:05.290382Z',
 'services': {},
 'properties': {'primary_metric_config': '{"name":"Accuracy","goal":"maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '27b36c0c-ffac-4b7c-ac0a-a7c0d10e923b',
  'user_agent': 'python/3.8.10 (Linux-5.15.0-1031-azure-x86_64-with-glibc2.17) msrest/0.7.1 Hyperdrive.Service/1.0.0 Hyperdrive.SDK/core.1.48.0',
  'space_size': 'infinite_space_size',
  'score': '0.909711684370258',
  'best_child_run_id': 'HD_07139f9c-ccc0-4b10-9593-219a2c63f855_9',
  'best_metric_status': 'Succeeded',
  'best_data_container_id': 'dcid.HD_07139f9c-ccc0-4b10-9593-219a2c63f855_9'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'configuration': None,
  'attribution': None,
  'te

KeyError: 'log_files'

### Save the best model from HyperDrive experiment.

In [5]:
import joblib
# Get your best run and save the model from that run.

### YOUR CODE HERE ###
best_run = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run.get_details()['runDefinition']['arguments'])
print(best_run.get_file_names())

best_run.register_model(model_name='hyperdrive-bank', model_path='outputs/model.joblib')

['--C', '0.600213072142369', '--max_iter', '800']
['logs/azureml/dataprep/0/backgroundProcess.log', 'logs/azureml/dataprep/0/backgroundProcess_Telemetry.log', 'logs/azureml/dataprep/0/rslex.log.2023-03-04-02', 'outputs/model.joblib', 'system_logs/cs_capability/cs-capability.log', 'system_logs/hosttools_capability/hosttools-capability.log', 'system_logs/lifecycler/execution-wrapper.log', 'system_logs/lifecycler/lifecycler.log', 'system_logs/metrics_capability/metrics-capability.log', 'system_logs/snapshot_capability/snapshot-capability.log', 'user_logs/std_log.txt']


Model(workspace=Workspace.create(name='winequality', subscription_id='4f84b10b-f2d2-47a8-8dbb-52b1a8dc25de', resource_group='cip-ee'), name=hyperdrive-bank, id=hyperdrive-bank:1, version=1, tags={}, properties={})

##

## Setup the AutoML experiment

These next sections setup the AutoML experiment for execution using the same data.

For AutoML, no model is explicity chosen by the ML engineer - the AutoML capabilities select the best model and hyperparameter combinations. This greatly speeds the delivery of an optimal ML model for the given dataset and objectives.

### Prepare the data

Here we prepare the data by 

1. retrieving it from the URI and creating a TabularDataset object.
2. Cleaning the data as in the previous experiment
3. Joining the x and y dataframes back together and converting them into a TabularDataset for AutoML purposes

While this is likely not an optimal process, I'll use it here for expedience with the given code.

In [27]:
from azureml.data.dataset_factory import TabularDatasetFactory

# Create TabularDataset using TabularDatasetFactory
# Data is available at: 
# "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"

ds = TabularDatasetFactory.from_delimited_files(path="https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv", separator=",")


In [29]:
from train import clean_data

# Use the clean_data function to clean your data.
x, y = clean_data(ds)

x_complete = x.join(y)

default_ds = ws.get_default_datastore()
x_tab_ds = TabularDatasetFactory.register_pandas_dataframe(dataframe=x_complete, target=default_ds, name="Bank Marketing Data", show_progress=False)

Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/028d3623-5eea-442d-ae05-ec950b43d05b/
Column header contains '.' This period will be translated to '_' as we write the data out to parquet files: 'emp.var.rate' -> 'emp_var_rate'
Column header contains '.' This period will be translated to '_' as we write the data out to parquet files: 'cons.price.idx' -> 'cons_price_idx'
Column header contains '.' This period will be translated to '_' as we write the data out to parquet files: 'cons.conf.idx' -> 'cons_conf_idx'
Column header contains '.' This period will be translated to '_' as we write the data out to parquet files: 'nr.employed' -> 'nr_employed'
Column header contains '.' This period will be translated to '_' as we write the data out to parquet files: 'job_admin.' -> 'job_admin_'
Column header contains '.' This period will be translated to '_' as we write the data out to parquet files: 'education_basic.4y'

### Setup parameters for the AutoML experiment

I found this section to have the most options to consider - thankfully Microsoft provides great documentation on [*How to Configure AutoML Training*](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#primary-metric)

Selections made here include:

- task --> classification
- primary_metric --> accuracy
- cross_validations --> 5

AutoML does the rest!


In [30]:
from azureml.train.automl import AutoMLConfig

# Set parameters for AutoMLConfig
# NOTE: DO NOT CHANGE THE experiment_timeout_minutes PARAMETER OR YOUR INSTANCE WILL TIME OUT.
# If you wish to run the experiment longer, you will need to run this notebook in your own
# Azure tenant, which will incur personal costs.
automl_config = AutoMLConfig(
    experiment_timeout_minutes=30,
    task="classification",
    compute_target=my_compute_target,
    primary_metric="accuracy",
    training_data=x_tab_ds,
    label_column_name="y",
    n_cross_validations=3)

### Submit the AutoML job

AutoML goes and does its work now. 

**Observations:** Primary observation is that AutoML selected the same "best" algorithm each time. The first run I tried, selected a *VotingEnsaemble* ML algorithm as the best (highest accuracy). The second run also selected *VotingEnsemble* as the best algorithm. Further experiments may select a different algorithm with additional time allocated (future work).

**Results:** the AutoML experiment was able to achieve a slightly higher accuracy score, ~91.7 utilizing a VotingEnsemble

In [31]:
# Submit your automl run

remote_run = exp.submit(automl_config, show_output=True)

Submitting remote run.
No run_configuration provided, running on bank-marketing-cluster with default configuration
Running on remote compute: bank-marketing-cluster


Experiment,Id,Type,Status,Details Page,Docs Page
udacity-project,AutoML_fc24f7e8-ccc8-4c9a-a9ec-41ee14047023,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias towards one class.
+------------------------------+--------------------------------+-------------------------------------

### Save the best model

Final step is to save the best model (as measured by accuracy as the primary metric).

In [37]:
import joblib
# Get your best run and save the model from that run.

best_run = remote_run.get_best_child(metric='accuracy')
print(best_run.get_details()['runDefinition']['arguments'])
print(best_run.get_file_names())

best_run.register_model(model_name='automl-bank', model_path='outputs/model.plk')


[]
['accuracy_table', 'automl_driver.py', 'confusion_matrix', 'explanation/06f74946/classes.interpret.json', 'explanation/06f74946/expected_values.interpret.json', 'explanation/06f74946/features.interpret.json', 'explanation/06f74946/global_names/0.interpret.json', 'explanation/06f74946/global_rank/0.interpret.json', 'explanation/06f74946/global_values/0.interpret.json', 'explanation/06f74946/local_importance_values_sparse.interpret.json', 'explanation/06f74946/per_class_names/0.interpret.json', 'explanation/06f74946/per_class_rank/0.interpret.json', 'explanation/06f74946/per_class_values/0.interpret.json', 'explanation/06f74946/rich_metadata.interpret.json', 'explanation/06f74946/true_ys_viz.interpret.json', 'explanation/06f74946/visualization_dict.interpret.json', 'explanation/06f74946/ys_pred_proba_viz.interpret.json', 'explanation/06f74946/ys_pred_viz.interpret.json', 'explanation/69397d88/classes.interpret.json', 'explanation/69397d88/eval_data_viz.interpret.json', 'explanation/69

ModelPathNotFoundException: ModelPathNotFoundException:
	Message: Could not locate the provided model_path outputs/model.plk in the set of files uploaded to the run: ['accuracy_table', 'automl_driver.py', 'confusion_matrix', 'explanation/06f74946/classes.interpret.json', 'explanation/06f74946/expected_values.interpret.json', 'explanation/06f74946/features.interpret.json', 'explanation/06f74946/global_names/0.interpret.json', 'explanation/06f74946/global_rank/0.interpret.json', 'explanation/06f74946/global_values/0.interpret.json', 'explanation/06f74946/local_importance_values_sparse.interpret.json', 'explanation/06f74946/per_class_names/0.interpret.json', 'explanation/06f74946/per_class_rank/0.interpret.json', 'explanation/06f74946/per_class_values/0.interpret.json', 'explanation/06f74946/rich_metadata.interpret.json', 'explanation/06f74946/true_ys_viz.interpret.json', 'explanation/06f74946/visualization_dict.interpret.json', 'explanation/06f74946/ys_pred_proba_viz.interpret.json', 'explanation/06f74946/ys_pred_viz.interpret.json', 'explanation/69397d88/classes.interpret.json', 'explanation/69397d88/eval_data_viz.interpret.json', 'explanation/69397d88/expected_values.interpret.json', 'explanation/69397d88/features.interpret.json', 'explanation/69397d88/global_names/0.interpret.json', 'explanation/69397d88/global_rank/0.interpret.json', 'explanation/69397d88/global_values/0.interpret.json', 'explanation/69397d88/local_importance_values.interpret.json', 'explanation/69397d88/per_class_names/0.interpret.json', 'explanation/69397d88/per_class_rank/0.interpret.json', 'explanation/69397d88/per_class_values/0.interpret.json', 'explanation/69397d88/rich_metadata.interpret.json', 'explanation/69397d88/true_ys_viz.interpret.json', 'explanation/69397d88/visualization_dict.interpret.json', 'explanation/69397d88/ys_pred_proba_viz.interpret.json', 'explanation/69397d88/ys_pred_viz.interpret.json', 'logs/azureml/azureml_automl.log', 'outputs/conda_env_v_1_0_0.yml', 'outputs/engineered_feature_names.json', 'outputs/env_dependencies.json', 'outputs/featurization_summary.json', 'outputs/generated_code/conda_environment.yaml', 'outputs/generated_code/script.py', 'outputs/generated_code/script_run_notebook.ipynb', 'outputs/internal_cross_validated_models.pkl', 'outputs/model.pkl', 'outputs/pipeline_graph.json', 'outputs/run_id.txt', 'outputs/scoring_file_pbi_v_1_0_0.py', 'outputs/scoring_file_v_1_0_0.py', 'outputs/scoring_file_v_2_0_0.py', 'system_logs/cs_capability/cs-capability.log', 'system_logs/hosttools_capability/hosttools-capability.log', 'system_logs/lifecycler/execution-wrapper.log', 'system_logs/lifecycler/lifecycler.log', 'system_logs/metrics_capability/metrics-capability.log', 'system_logs/snapshot_capability/snapshot-capability.log', 'user_logs/std_log.txt']
                See https://aka.ms/run-logging for more details.
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Could not locate the provided model_path outputs/model.plk in the set of files uploaded to the run: ['accuracy_table', 'automl_driver.py', 'confusion_matrix', 'explanation/06f74946/classes.interpret.json', 'explanation/06f74946/expected_values.interpret.json', 'explanation/06f74946/features.interpret.json', 'explanation/06f74946/global_names/0.interpret.json', 'explanation/06f74946/global_rank/0.interpret.json', 'explanation/06f74946/global_values/0.interpret.json', 'explanation/06f74946/local_importance_values_sparse.interpret.json', 'explanation/06f74946/per_class_names/0.interpret.json', 'explanation/06f74946/per_class_rank/0.interpret.json', 'explanation/06f74946/per_class_values/0.interpret.json', 'explanation/06f74946/rich_metadata.interpret.json', 'explanation/06f74946/true_ys_viz.interpret.json', 'explanation/06f74946/visualization_dict.interpret.json', 'explanation/06f74946/ys_pred_proba_viz.interpret.json', 'explanation/06f74946/ys_pred_viz.interpret.json', 'explanation/69397d88/classes.interpret.json', 'explanation/69397d88/eval_data_viz.interpret.json', 'explanation/69397d88/expected_values.interpret.json', 'explanation/69397d88/features.interpret.json', 'explanation/69397d88/global_names/0.interpret.json', 'explanation/69397d88/global_rank/0.interpret.json', 'explanation/69397d88/global_values/0.interpret.json', 'explanation/69397d88/local_importance_values.interpret.json', 'explanation/69397d88/per_class_names/0.interpret.json', 'explanation/69397d88/per_class_rank/0.interpret.json', 'explanation/69397d88/per_class_values/0.interpret.json', 'explanation/69397d88/rich_metadata.interpret.json', 'explanation/69397d88/true_ys_viz.interpret.json', 'explanation/69397d88/visualization_dict.interpret.json', 'explanation/69397d88/ys_pred_proba_viz.interpret.json', 'explanation/69397d88/ys_pred_viz.interpret.json', 'logs/azureml/azureml_automl.log', 'outputs/conda_env_v_1_0_0.yml', 'outputs/engineered_feature_names.json', 'outputs/env_dependencies.json', 'outputs/featurization_summary.json', 'outputs/generated_code/conda_environment.yaml', 'outputs/generated_code/script.py', 'outputs/generated_code/script_run_notebook.ipynb', 'outputs/internal_cross_validated_models.pkl', 'outputs/model.pkl', 'outputs/pipeline_graph.json', 'outputs/run_id.txt', 'outputs/scoring_file_pbi_v_1_0_0.py', 'outputs/scoring_file_v_1_0_0.py', 'outputs/scoring_file_v_2_0_0.py', 'system_logs/cs_capability/cs-capability.log', 'system_logs/hosttools_capability/hosttools-capability.log', 'system_logs/lifecycler/execution-wrapper.log', 'system_logs/lifecycler/lifecycler.log', 'system_logs/metrics_capability/metrics-capability.log', 'system_logs/snapshot_capability/snapshot-capability.log', 'user_logs/std_log.txt']\n                See https://aka.ms/run-logging for more details."
    }
}