# Automated ML

In [1]:
from azureml.core import Workspace, Experiment, Dataset
from azureml.data.dataset_factory import TabularDatasetFactory
from train import clean_data
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
import pandas as pd
import os
from azureml.widgets import RunDetails

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'automl'

experiment=Experiment(ws, experiment_name)

## Dataset

### Overview

The dataset used is the [UCI Glass Identification](https://archive.ics.uci.edu/ml/datasets/Glass+Identification) dataset. All data importing and treating is done by the [train.py](https://github.com/reis-r/nd00333-capstone/blob/master/train.py) script. This will be the script used by our Hyperdrive run. The objective will be to classify the glass type according to it's composition and other characteristics. This dataset was chosen because it will not take too much time for cleaning, and it's a very known dataset for experimenting with machine learning.

In [3]:
# Select the data URL
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
ds = TabularDatasetFactory.from_delimited_files(path=data_url, header=False)
x, y = clean_data(ds)

# Use of X= and y= are deprecated on AutoMLConfig instantiation
# See more at the official docs: https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py
training_data = x.join(y)
training_data.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


It's not allowed to create an automl remote job from a pandas dataframe.

In [4]:
# Create a dataset from the clean data
if "automl-data" not in os.listdir():
    os.mkdir("./automl-data")
    
training_data.to_csv('automl-data/prepared-automl.csv')

datastore = ws.get_default_datastore()
datastore.upload(src_dir='automl-data', target_path='automl-data')
training_data = Dataset.Tabular.from_delimited_files(path = [(datastore, ('automl-data/prepared-automl.csv'))])

Uploading an estimated of 1 files
Uploading automl-data/prepared-automl.csv
Uploaded automl-data/prepared-automl.csv, 1 files out of an estimated total of 1
Uploaded 1 files


## AutoML Configuration

The automl run is configured with accuracy as the primary metric, that way, it's possible to compare it to the Hyperdrive-optimized SVC model from the other notebook in the project.
The concurrency settings were choosen acording to the compute cluster used. There should not be more concurrent iterations than the number of processing unities on the compute cluster. A experiment timeout of one hour was set, so that the experiment does not take too long. This also helps to avoide extra charges.

In [5]:
# Create a computer cluster
cluster_name = "automl"
# Check if a compute cluster already exists
try:
    print("Trying to connect to an existing cluster...")
    compute_cluster = ComputeTarget(workspace=ws, name=cluster_name)
except ComputeTargetException:
    print("Creating a compute cluster...")
    compute_configuration = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    compute_cluster = ComputeTarget.create(ws, cluster_name, compute_configuration)
    compute_cluster.wait_for_completion(show_output=True)
print("Success!")

Trying to connect to an existing cluster...
Creating a compute cluster...
Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
Success!


In [6]:
 automl_settings = {
       "n_cross_validations": 3,
       "primary_metric": 'accuracy',
       "enable_early_stopping": True,
       "experiment_timeout_hours": 1.0,
       "max_concurrent_iterations": 4,
       "max_cores_per_iteration": 1,
   }

# automl config here
automl_config = AutoMLConfig(task="classification",
                             compute_target = compute_cluster,
                             training_data=training_data,
                             label_column_name="Type",
                             **automl_settings)

In [7]:
# TODO: Submit your experiment
remote_run = experiment.submit(automl_config)

Running on remote.


## Run Details

In [8]:
# See the progress of the run

RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

In [10]:
best_run, fitted_model = remote_run.get_output()
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl,AutoML_89b818da-96ca-4f06-81a5-999e5a71dd1a_36,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [11]:
print("Best run:")
print(best_run)
print("\n")
print("Fitted model:")
print(fitted_model)

Best run:
Run(Experiment: automl,
Id: AutoML_89b818da-96ca-4f06-81a5-999e5a71dd1a_36,
Type: azureml.scriptrun,
Status: Completed)


Fitted model:
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                               n_jobs=1,
                                                                                               nthread=None,
                                

In [12]:
if "outputs" not in os.listdir():
    os.mkdir("./outputs")
    
pickle_filename = "outputs/model.pkl"
best_run.download_file(pickle_filename, output_file_path="outputs/hiperdrive-model.pkl")
print("Best model saved.")

Best model saved.


In [13]:
# Delete the compute target for cost-saving
compute_cluster.delete()

In [19]:
print(compute_cluster.get_status())

Current provisioning state of AmlCompute is "Deleting"

None
