# Automated ML

Importing all modules needed.

In [1]:
from azureml.core import Workspace, Experiment, Dataset, Model
from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig
from azureml.data.dataset_factory import TabularDatasetFactory
from train import clean_data
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
import pandas as pd
import os
from azureml.widgets import RunDetails
from azureml.core.webservice import AciWebservice
import requests
import json

Loading the workspace.

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'automl'

experiment=Experiment(ws, experiment_name)

## Dataset

### Overview

The dataset used is the [UCI Glass Identification](https://archive.ics.uci.edu/ml/datasets/Glass+Identification) dataset. All data importing and treating is done by the [train.py](https://github.com/reis-r/nd00333-capstone/blob/master/train.py) script. This will be the script used by our Hyperdrive run. The objective will be to classify the glass type according to it's composition and other characteristics. This dataset was chosen because it will not take too much time for cleaning, and it's a very known dataset for experimenting with machine learning.

In [3]:
# Select the data URL
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
ds = TabularDatasetFactory.from_delimited_files(path=data_url, header=False)
x, y = clean_data(ds)

# Use of X= and y= are deprecated on AutoMLConfig instantiation
# See more at the official docs: https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py
training_data = x.join(y)
training_data.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


It's not allowed to create an automl remote job from a pandas dataframe. The dataset is properly registered first.

In [4]:
# Create a dataset from the clean data
if "automl-data" not in os.listdir():
    os.mkdir("./automl-data")
    
training_data.to_csv("automl-data/prepared-automl.csv", index=False)

datastore = ws.get_default_datastore()
datastore.upload(src_dir="automl-data", target_path="automl-data")
training_data = Dataset.Tabular.from_delimited_files(path = [(datastore, ("automl-data/prepared-automl.csv"))])

Uploading an estimated of 1 files
Uploading automl-data/prepared-automl.csv
Uploaded automl-data/prepared-automl.csv, 1 files out of an estimated total of 1
Uploaded 1 files


## AutoML Configuration

The automl run is configured with accuracy as the primary metric, that way, it's possible to compare it to the Hyperdrive-optimized SVC model from the other notebook in the project.
The concurrency settings were choosen acording to the compute cluster used. There should not be more concurrent iterations than the number of processing unities on the compute cluster. A experiment timeout of one hour was set, so that the experiment does not take too long. This also helps to avoide extra charges.

In [5]:
# Create a computer cluster
cluster_name = "automl"
# Check if a compute cluster already exists
try:
    print("Trying to connect to an existing cluster...")
    compute_cluster = ComputeTarget(workspace=ws, name=cluster_name)
except ComputeTargetException:
    print("Creating a compute cluster...")
    compute_configuration = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2", max_nodes=4)
    compute_cluster = ComputeTarget.create(ws, cluster_name, compute_configuration)
    compute_cluster.wait_for_completion(show_output=True)
print("Success!")

Trying to connect to an existing cluster...
Creating a compute cluster...
Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
Success!


In [6]:
 automl_settings = {
       "n_cross_validations": 3,
       "primary_metric": 'accuracy',
       "enable_early_stopping": True,
       "experiment_timeout_hours": 1.0,
       "max_concurrent_iterations": 4,
       "max_cores_per_iteration": 1,
   }

# automl config here
automl_config = AutoMLConfig(task="classification",
                             compute_target = compute_cluster,
                             training_data=training_data,
                             label_column_name="Type",
                             **automl_settings)

In [7]:
# TODO: Submit your experiment
remote_run = experiment.submit(automl_config)

Running on remote.


## Run Details

We can watch the development of the training using a widget:

In [8]:
# See the progress of the run

RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

In [9]:
best_run, fitted_model = remote_run.get_output()
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl,AutoML_fe62415d-cb89-433a-8b59-ba4737fbd123_36,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [10]:
print("Best run:")
print(best_run)
print("\n")
print("Fitted model:")
print(fitted_model)

Best run:
Run(Experiment: automl,
Id: AutoML_fe62415d-cb89-433a-8b59-ba4737fbd123_36,
Type: azureml.scriptrun,
Status: Completed)


Fitted model:
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                    min_samples_leaf=0.01,
                                                                                                    min_samples_split=0.103

We obtained an accuracy of 79%, that's better than our hyperdrive run. Thus, this will be the deployed model. But first, we retrieve and save the best performing model along with the scoring file and the conda environment we will use:

In [11]:
best_run.get_file_names()

['accuracy_table',
 'automl_driver.py',
 'azureml-logs/55_azureml-execution-tvmps_02fe4edc80e81ddddf29320c64cc9a52594985ec7366a74892f4e06c6cc7b49a_d.txt',
 'azureml-logs/65_job_prep-tvmps_02fe4edc80e81ddddf29320c64cc9a52594985ec7366a74892f4e06c6cc7b49a_d.txt',
 'azureml-logs/70_driver_log.txt',
 'azureml-logs/75_job_post-tvmps_02fe4edc80e81ddddf29320c64cc9a52594985ec7366a74892f4e06c6cc7b49a_d.txt',
 'azureml-logs/process_info.json',
 'azureml-logs/process_status.json',
 'confusion_matrix',
 'logs/azureml/106_azureml.log',
 'logs/azureml/azureml_automl.log',
 'logs/azureml/dataprep/python_span_62c633af-5c29-4fa3-9bca-d87fd0475412.jsonl',
 'logs/azureml/dataprep/python_span_e301ce47-2ea8-41c7-9fd1-735024f0a78d.jsonl',
 'logs/azureml/job_prep_azureml.log',
 'logs/azureml/job_release_azureml.log',
 'outputs/conda_env_v_1_0_0.yml',
 'outputs/env_dependencies.json',
 'outputs/internal_cross_validated_models.pkl',
 'outputs/model.pkl',
 'outputs/pipeline_graph.json',
 'outputs/scoring_file_v_

In [12]:
if "outputs" not in os.listdir():
    os.mkdir("./outputs")
    
pickle_filename = "outputs/model.pkl"
scoring_file = "outputs/scoring_file_v_1_0_0.py"
conda_file = "outputs/conda_env_v_1_0_0.yml"
best_run.download_file(pickle_filename, output_file_path="outputs/automl_model.pkl")
best_run.download_file(scoring_file, output_file_path="outputs/automl_scoring.py")
best_run.download_file(conda_file, output_file_path="outputs/automl_conda.yml")
print("Best model saved.")

Best model saved.


## Model deployment

For deployment, first we need to register our model:

In [13]:
model = remote_run.register_model(model_name="automl-glass", description="Glass Type by AutoML")
print(remote_run.model_id)

automl-glass


In [14]:
conda_env = Environment.from_conda_specification(name = "automl_env",
                                                 file_path = "outputs/automl_conda.yml")

In [15]:
inference_config = InferenceConfig(entry_script="outputs/automl_scoring.py", environment=conda_env)

In [16]:
# Deploying the model as a web service to an Azure Container Instance (ACI) with authentication
service_name = 'glass-prediction'
aci_config = AciWebservice.deploy_configuration(cpu_cores=1,
                                                memory_gb=1,
                                                auth_enabled=True,
                                                enable_app_insights=True)

In [17]:
service = Model.deploy(workspace=ws,
             name="glass-prediction-automl",
             models=[model],
             inference_config=inference_config,
             deployment_config=aci_config,
             overwrite=True)
service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running........................................................................................................................................................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [18]:
print(service.state)

Healthy


In [19]:
print(service.get_logs())

2021-01-30T17:43:06,864940480+00:00 - iot-server/run 
2021-01-30T17:43:06,867891589+00:00 - rsyslog/run 
2021-01-30T17:43:06,867087787+00:00 - gunicorn/run 
2021-01-30T17:43:06,889410761+00:00 - nginx/run 
rsyslogd: /azureml-envs/azureml_4f26b5b9300bb3a3b89415c34dafac63/lib/libuuid.so.1: no version information available (required by rsyslogd)
/usr/sbin/nginx: /azureml-envs/azureml_4f26b5b9300bb3a3b89415c34dafac63/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_4f26b5b9300bb3a3b89415c34dafac63/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_4f26b5b9300bb3a3b89415c34dafac63/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_4f26b5b9300bb3a3b89415c34dafac63/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml

## Consuming the model

In [20]:
auth_key = service.get_keys()[0]
scoring_uri = service.scoring_uri

In [21]:
data = data = {
    "data": [
        {
            "RI": 1.52101,
            "Na": 13.64,
            "Mg": 4.49,
            "Al": 1.1,
            "Si": 71.78,
            "K": 0.06,
            "Ca": 8.75,
            "Ba": 0.0,
            "Fe": 0.0
            }
    ]
}
input_data = json.dumps(data)

In [22]:
headers = {'Content-Type': 'application/json'}

In [23]:
# Set the authorization header
headers['Authorization'] = f'Bearer {auth_key}'

In [24]:
# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.json())

{"result": [1]}


In [25]:
# Delete the compute target for cost-saving
compute_cluster.delete()

In [26]:
# Delete the web service deployment
service.delete()

In [27]:
print(compute_cluster.get_status())

Current provisioning state of AmlCompute is "Deleting"

None
