## Training a Machine Learning Model with Automated ML


In this notebook we'll be using Azure Automated ML to train a machine learning model capable of determining the best cluster for a COVID-19 scientific article. It builds upon the work done in the *Data Preparation* notebook.

We'll import Azure ML SDK modules needed, and do a quick sanity-check on the SDK version

In [1]:
import azureml.core
from azureml.core import Dataset, Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
from azureml.widgets import RunDetails
from azureml.automl.core.featurization.featurizationconfig import FeaturizationConfig

print("AML SDK version:", azureml.core.VERSION)

AML SDK version: 1.48.0


We'll start by retrieving the ML workspace used to manage our work

In [2]:
# Retrieve your ML workspace
ws = Workspace.from_config()
print(ws)

Workspace.create(name='mlw-gai1-f4xzq', subscription_id='23529470-ba17-4d8a-9f0c-064e63a49c33', resource_group='rg-gai1-f4xzq')


In order to be able to launch an Automated ML run we need to provision a compute cluster first. If one already exists then we'll use that one, otherwise we'll create a new one

In [3]:
# The name of the compute instance
compute_name = 'aml-compute-cpu'
# The minimum and maximum number of nodes of the compute instance
compute_min_nodes = 0
# Setting the number of maximum nodes to a higher value will allow Automated ML to run more experiments in parallel, but will also inccrease your costs
compute_max_nodes = 4

vm_size = 'STANDARD_DS3_V2'

# Check existing compute targets in the workspace for a compute with this name
if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print(f'Found existing compute target: {compute_name}')    
else:
    print(f'A new compute target is needed: {compute_name}')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # Create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # Wait for provisioning to complete
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)


Found existing compute target: aml-compute-cpu


## Configuring the Automated ML experiment

We'll use the `COVID19Articles_Train` dataset that we registered in the previous notebook for training the model. In order to speed up training we'll ignore all columns except the word vectors calculated using Doc2Vec.

In [4]:
# Retrieve the COVID19Articles_Train dataset from the workspace
train_data = Dataset.get_by_name(ws, 'COVID19Articles_Train')

# Ignore all columns except the word vectors
columns_to_ignore = ['sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license', 'abstract', 'publish_time', 'authors', 'journal', 'mag_id',
                     'who_covidence_id', 'arxiv_id', 'pdf_json_files', 'pmc_json_files', 'url', 's2_id' ]
train_data = train_data.drop_columns(columns_to_ignore) 


# Configura Automated ML
automl_config = AutoMLConfig(task = "classification",
                             # Use weighted area under curve metric to evaluate the models
                             primary_metric='AUC_weighted',
                             
                             # Use all columns except the ones we decided to ignore
                             training_data = train_data,
                             
                             # The values we're trying to predict are in the `cluster` column
                             label_column_name = 'cluster',
                             
                             # Evaluate the model with 5-fold cross validation
                             n_cross_validations=5,
                             
                             # The experiment should be stopped after 15 minutes, to minimize cost
                             experiment_timeout_hours=.25,
                             
                             # Automated ML can try at most 4 models at the same time, this is also limited by the compute instance's maximum number of nodes
                             max_concurrent_iterations=4,
                             
                             # An iteration should be stopped if it takes more than 5 minutes
                             iteration_timeout_minutes=5,
                             
                             compute_target=compute_target
                            )

Once we have configured the Automated ML run, we can submit it in one of the workspace's experiments. Note that this step should take around 15 minutes, according to the `experiment_timeout_minutes` setting.

**NOTE**:

If this is the first time you are launching an experiment run in the Azure Machine Learning workspace, additional time will be needed to start the Compute Cluster and deploy the container images required to execute.

In [5]:
# Use the `COVID19_Classification` dataset
exp = Experiment(ws, 'COVID19_Classification')
run = exp.submit(automl_config, show_output=True)

# Retrieve the best performing run and its corresponding model from the aggregated Automated ML run
best_run, best_model = run.get_output()

Submitting remote run.
No run_configuration provided, running on aml-compute-cpu with default configuration
Running on remote compute: aml-compute-cpu


Experiment,Id,Type,Status,Details Page,Docs Page
COVID19_Classification,AutoML_6562f9c9-6719-467b-9eaf-28445c88169e,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias towards one class.
+------------------------------+--------------------------------+--------------------------------------+
|Size of the smallest class    |Name/Label of the smallest class|Number of samples in the training data|
|1                             |6, 8, 9                 

After the Automated ML run has finished, we can visualize its models and see how they measure up according to several metrics. Remember, the higher the *AUC_weighter*, the better.

In [None]:
RunDetails(run).show()