# Automated ML

## General setup

In [1]:
# Imports
from azureml.core import Workspace, Experiment, Model
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.dataset import Dataset
from azureml.core.compute_target import ComputeTargetException
from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
from azureml.core.webservice import AciWebservice
import joblib
import json
import requests
import pandas

In [3]:
# Creation of compute cluster to carry our the automated ML
ws = Workspace.from_config()
compute_name = "udacity-cluster"
try:
    compute = ComputeTarget(workspace=ws, name=compute_name)
    print('Compute cluster {} already exists!'.format(compute_name))
except ComputeTargetException:
    config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    compute = ComputeTarget.create(ws, compute_name, config)
    
compute.wait_for_completion()

Compute cluster udacity-cluster already exists!


## Dataset

### Overview
The dataset used was generated using the notebook https://github.com/zgoey/azure_ml_capstone/blob/master/generate_data.ipynb using the file:

https://unpkg.com/color-name-list/dist/colornames.csv. 

The notebook runs code to label colors with one of the basic shades from the set {White, Black, Grey, Yellow, Red, Blue, Green, Brown, Pink, Orange, Purple} (see https://thelandofcolor.com/11-basic-color-names/). It does so by looking at the color name and taking the shade that occurs last in this string. So, for instance, the color 'Azure Green Blue' is assigned the label 'Blue'. Thus, a list of labeled RGB-values is built, where each RGB-triple is assigned to one of the basic shade classes. 

We have uploaded the output of the notebook to:

https://github.com/zgoey/azure_ml_capstone/blob/master/color_shades.csv

and we will download it in raw form from this repo into our Azure workspace. It will then be used to train a classifier that can assign basic color shades to RGB-triplets. Such a classifier can then be used by color-blind people to detemine what color they are looking at. The end-application that we have in mind is something like http://www.hikarun.com/e/. This, however, is a program that can only run under Windows. The advantage of having a web service doing the classification is that it can be accessed on a much wider range of devices.

In [4]:
#TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.
dataset_name = 'color_shades'
if dataset_name in ws.datasets.keys():
        dataset = ws.datasets[dataset_name] 
else:
        url = "https://raw.githubusercontent.com/zgoey/azure_ml_capstone/master/color_shades.csv"
        dataset = Dataset.File.from_files(url)        
        dataset.register(workspace = ws, name = dataset_name,
                                 description = 'RGB values labeled with color shade names',
                                 create_new_version = True)

datastore = ws.get_default_datastore()
os.makedirs('data', exist_ok=True)
dataset.download(target_path='data', overwrite=True)[0]
datastore.upload(src_dir='data', target_path='data')
tabular_dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, ('data/color_shades.csv'))])

Uploading an estimated of 1 files
Uploading data/color_shades.csv
Uploaded data/color_shades.csv, 1 files out of an estimated total of 1
Uploaded 1 files


In [5]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'udacity-capstone'

experiment = Experiment(ws, experiment_name)

## AutoML Configuration

We will now set up an AutoML experiment, where the task is set to classification, since we have a limited set of 11 labels, which we wish to discern. We set the compute target to the compute that we created earlier in this notebook, and we set the training data to the dataset that we just downloaded from our Github repo. Our target column is set to "Shade", since that is what we wish to predict. 

In the AutoML settings, we choose accuracy as our primary metric, which is the most common measure to use for classification tasks. We apply 5-fold cross-validation to get a more stable accuracy estimate, as compared to using a simple train/validation set set-up. To be sure that we do not run our experiment forever (thereby incurring unreasonable costs), we limit the time that the experiment will run to 1 hour. Finally, we set the maximum number of concurrent iterations to four, to make maximal usage of the concurrency capabilities of our compute.

In [6]:
# TODO: Put your automl settings here
automl_settings = {
    "n_cross_validations": 5,
    "primary_metric": 'accuracy',
    "experiment_timeout_hours": 1.0,
    "max_concurrent_iterations": 4,
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(task = 'classification',
                             compute_target = compute,
                             training_data = tabular_dataset,
                             label_column_name = 'Shade',
                             **automl_settings)

In [7]:
# TODO: Submit your experiment
remote_run = experiment.submit(automl_config)

Running on remote.


## Run Details

In the cell below, we use the `RunDetails` widget to show the different runs. As is often seen in AutoML experiments, an ensemble classifier (StackEnsembleClassifier) is the best model found, which is in accordance to what is described in literature about ensemble learning (see https://en.wikipedia.org/wiki/Ensemble_learning)

In [8]:
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

In [9]:
remote_run.wait_for_completion()

{'runId': 'AutoML_6cf35dbd-5e50-4a63-a772-0d1be50f3904',
 'target': 'udacity-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-02-16T14:57:57.281433Z',
 'endTimeUtc': '2021-02-16T16:13:56.936709Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'udacity-cluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"f52c786e-0dc0-4980-9e1f-ccad573f4c86\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"data/color_shades.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"capstone\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"c0e92620-6229-4209-b236-c48f10a3d133\\\\\\", \\\\\\"workspac

## Best Model

In the cells below, we get the best model from the automl experiments and display all the properties of the model.



In [10]:
# Retrieve and save your best automl model.
best_automl_run, best_automl_model = remote_run.get_output()
print('Best model metrics:\n', best_automl_run.get_metrics(), '\n')
print('Best model steps:\n', best_automl_model.steps, '\n')

Best model metrics:
 {'f1_score_weighted': 0.8092782810618608, 'precision_score_micro': 0.8105011933174223, 'log_loss': 0.607989939509952, 'norm_macro_recall': 0.7304223132853682, 'AUC_weighted': 0.9779051226205958, 'accuracy': 0.8105011933174223, 'recall_score_micro': 0.8105011933174223, 'precision_score_macro': 0.7700711473682904, 'matthews_correlation': 0.7812119842534787, 'precision_score_weighted': 0.8105093614395967, 'AUC_macro': 0.9735161254116786, 'AUC_micro': 0.9796606307779061, 'f1_score_micro': 0.8105011933174223, 'average_precision_score_macro': 0.8133978936290689, 'average_precision_score_weighted': 0.8654654546995131, 'weighted_accuracy': 0.8593384849728235, 'average_precision_score_micro': 0.8778217396223912, 'recall_score_macro': 0.7549293757139711, 'f1_score_macro': 0.7604385312244206, 'balanced_accuracy': 0.7549293757139711, 'recall_score_weighted': 0.8105011933174223, 'confusion_matrix': 'aml://artifactId/ExperimentRun/dcid.AutoML_6cf35dbd-5e50-4a63-a772-0d1be50f3904

In [18]:
# Zoom in on best model to get full view on estimators
print(best_automl_model.steps[-1][1].get_params(deep=False))

{'base_learners': [('155', Pipeline(memory=None,
         steps=[('standardscalerwrapper',
                 <azureml.automl.runtime.shared.model_wrappers.StandardScalerWrapper object at 0x7f320de90a20>),
                ('xgboostclassifier',
                 XGBoostClassifier(base_score=0.5, booster='gbtree',
                                   colsample_bylevel=1, colsample_bynode=1,
                                   colsample_bytree=1, eta=0.3, gamma=0.01,
                                   learning_rate=0.1, max_delta_step=0,
                                   max_depth=6, max_leaves=7,
                                   min_child_weight=1, missing=nan,
                                   n_estimators=100, n_jobs=1, nthread=None,
                                   objective='multi:softprob', random_state=0,
                                   reg_alpha=1.4583333333333335,
                                   reg_lambda=0.625, scale_pos_weight=1,
                                   seed=N

In [19]:
# Save the best model
os.makedirs('models', exist_ok=True)
joblib.dump(value=best_automl_model, filename="models/automl_color_shades.pkl")


In [20]:
# Save scoring script (needed for deployment)
best_automl_run.download_file('outputs/scoring_file_v_1_0_0.py', 'automl_score.py')

In [21]:
# Save environment (needed for deployment)
best_automl_run.download_file('outputs/conda_env_v_1_0_0.yml', 'automl_env.yml')

## Model Deployment

We now register the model, create an inference config and deploy the model as a web service.

In [22]:
model_name = best_automl_run.properties['model_name']
description = 'Best model for color shade classification found by AutoML'
tags = None
model = remote_run.register_model(model_name = model_name, description = description, tags = tags)
automl_env = Environment.from_conda_specification(name="automl_env", file_path="automl_env.yml")
inference_config = InferenceConfig(entry_script='automl_score.py',environment= automl_env)

aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1, 
                                               memory_gb = 1, 
                                               description = 'AutoML for color shade classification')

aci_service_name = 'automl-color-shade'
print(aci_service_name)
aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)

automl-color-shade
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running...........................................................................................................................................................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy


In the cell below, we send a request to the web service you deployed to test it.

In [24]:
data = {
    "data":
    [
        {
            'Red': "255",
            'Green': "10",
            'Blue': "2",
        },
    ],
}


input_data = json.dumps(data)
headers = {"Content-Type": "application/json"}
resp = requests.post(aci_service.scoring_uri, input_data, headers=headers)
print(resp.json())


{"result": ["Red"]}


In the cell below, we print the logs of the web service and delete the service.

In [25]:
print(aci_service.get_logs())
aci_service.delete()

2021-02-17T01:09:20,537627155+00:00 - rsyslog/run 
2021-02-17T01:09:20,540695633+00:00 - iot-server/run 
2021-02-17T01:09:20,536201972+00:00 - gunicorn/run 
2021-02-17T01:09:20,559814944+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_7785023fceb74e4facc1b1a577b1faf9/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7785023fceb74e4facc1b1a577b1faf9/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7785023fceb74e4facc1b1a577b1faf9/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7785023fceb74e4facc1b1a577b1faf9/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7785023fceb74e4facc1b1a577b1faf9/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd

## Cleanup

In [28]:
# Clean up compute cluster
try:
    compute = ComputeTarget(workspace=ws, name=compute_name)
    try:
        compute.delete()
    except ComputeTargetException as e:
        print(e.message)
        print("Failed to clean up compute cluster {}!".format(compute_name))
    compute.wait_for_completion(show_output=True, is_delete_operation=True)
except ComputeTargetException:
    print('Compute cluster {} no longer exists!'.format(compute_name))

Deleting..Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

..Current provisioning state of AmlCompute is "Deleting"

.....
SucceededProvisioning operation finished, operation "Succeeded"
