In this notebook, we show how to perform text classification by fine-tuning a BERT-based model, running the fine-tuning procedure as a distributed training job on Azure ML.

For more details about distributed training on Azure ML, please see [here]( https://github.com/microsoft/DistributedDeepLearning/).

Please notice that this notebook was create in a hosted Jupyter environment on Azure ML. This environment already has all packages we need here such as NumPy, Pandas, Scikit-Learn, and PyTorch. For details about this hosted environment, please see [here]( https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-sdk-setup).

In [1]:
from azureml.core.authentication import InteractiveLoginAuthentication

from azureml.core import Workspace, Dataset, Experiment, Run, Environment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.model import Model
from azureml.core.conda_dependencies import CondaDependencies

from azureml.train.dnn import PyTorch
from azureml.train.hyperdrive import GridParameterSampling
from azureml.train.hyperdrive import HyperDriveConfig
from azureml.train.hyperdrive import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice

from azureml.core.runconfig import MpiConfiguration

from azureml.widgets import RunDetails

import pandas as pd

To be able to interact with Azure ML, we first need to get a reference to our [workspace]( https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace).

We use the [Azure ML SDK]( https://docs.microsoft.com/en-us/python/api/overview/azureml-sdk/?view=azure-ml-py) for that. If you don’t have it installed into your development environment, please follow the instructions [here]( https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-environment#local). If you want to run the code on a managed VM instance, which already has the SDK, please see [here]( https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-sdk-setup).

You need to replace the values for *subscription_id*, *resource_group*, and *workspace_name* with the values for your own corresponding resources.

In [2]:
interactive_auth = InteractiveLoginAuthentication()

subscription_id = '<your azure subscription id>'
resource_group = '<your azure ml workspace resource group>'
workspace_name = '<your azure ml workspace name>'

workspace = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name,
                      auth=interactive_auth)

Here we instantiate an [Experiment]( https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiments) object, which will later be used to submit our model training execution.

In [3]:
exp = Experiment(workspace = workspace, name = 'bert_text_classification_distributed')

The next step is to create our remote [Compute Target]( https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-targets).

Here we create one of the type [Azure Machine Learning Compute]( https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute). Once created, this resource is persisted and accessible by its name in subsequent calls.

In [4]:
cluster_name = 'aml-compute-01'

try:
    compute_target = ComputeTarget(workspace = workspace, name = cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size = 'STANDARD_NC6', min_nodes = 8, max_nodes = 8)
    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output = True, min_node_count = 8, timeout_in_minutes = 20)

Creating a new compute target...
Creating
Succeeded....................
AmlCompute wait for completion finished
Minimum number of nodes requested have been provisioned


Now we create an [Estimator]( https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-azure-machine-learning-architecture#estimators) object, which facilitate the creation of run configurations, by defining run scripts, its parameters and the target run environment.

AML service provides a generic Estimator, as well as specialized ones that facilitate the usage of several popular python ML packages. Here we use the [PyTorch Estimator]( https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.estimator?view=azure-ml-py), as we are going to train a PyTorch based model.

The code that runs on the remote compute target is specified in the *entry_script* field. It is basically the same code that we use in the notebook *02-data-classification*, where we explain how to perform the fine-tuning step-by-step, but here without the visualizations and including the necessary arguments parsing, specific Azure ML logging and saving the model artifacts of interest.

In [5]:
script_folder = './training_script'

script_params = {
    '--dataset_name': 'Consumer Complaints Dataset'
}

estimator = PyTorch(source_directory = script_folder,
                    compute_target = compute_target,
                    entry_script = 'train_horovod.py',
                    script_params = script_params,
                    use_gpu = True,
                    node_count=8,
                    process_count_per_node=1,
                    distributed_training=MpiConfiguration(),
                    pip_packages = ['sklearn', 'transformers', 'azureml-dataprep[fuse,pandas]'])



We are going to submit our Estimator for remote run on AML Compute. Instead of doing that directly, we will wrap it using the [HyperDrive]( https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters) functionality for automated model hyperparameter search.

The first step is to define how to sample the hyperparameter space. AML service provides several strategies already built in. Here we will use standard [Grid Sampling]( https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters#sampling-the-hyperparameter-space).

The hyperparameter space is defined by the [choice]( https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.parameter_expressions?view=azure-ml-py) function. For simplicity and to exemplify only one combination of parameters, we have only one value for each *choice*. You could define as many values you want for each one in a list.

In [6]:
param_sampling = GridParameterSampling({
    'batch_size': choice(32),
    'learning_rate': choice(1e-5),
    'adam_epsilon': choice(1e-8),
    'num_epochs': choice(5)})

After defining the estimator and grid sampling strategy, we can pass them to the [Hyper Drive configuration]( https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.hyperdriveconfig?view=azure-ml-py) object. There are several options to configure here, such as the [termination policy]( https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters#specify-early-termination-policy), [resources]( https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters#allocate-resources) to allocate the job on, and the [primary metric]( https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters#specify-primary-metric) to be optimized.

In [7]:
hyperdrive_run_config = HyperDriveConfig(estimator = estimator,
                                         hyperparameter_sampling = param_sampling,
                                         policy = None,
                                         primary_metric_name = 'validation loss',
                                         primary_metric_goal = PrimaryMetricGoal.MINIMIZE,
                                         max_total_runs = 1,
                                         max_concurrent_runs = 1)

The remaining step is to submit the Experiment defined before, passing the configuration fot the hyperparameter search.

In [8]:
hyperdrive_run = exp.submit(hyperdrive_run_config)

We can monitor the execution through a Jupyter [graphical widget]( https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters#visualize-experiment), available through the *RunDetails* class.

In [9]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

After all executions generated by the hyperparameter search finish, we can inspect them and print the hyperparameters and correspondig model performance metrics in a table.

Here this table is ordered by the best model according to the Mean Absolute Error computed for the test dataset.

In [10]:
hyperdrive_run.wait_for_completion(show_output = False)

children = list(hyperdrive_run.get_children())
metricslist = {}
i = 0

for single_run in children:
    results = {k: v for k, v in single_run.get_metrics().items() if isinstance(v, float)}
    parameters = single_run.get_details()['runDefinition']['arguments']
    try:
        results['batch_size'] = parameters[3]
        results['learning_rate'] = parameters[5]
        results['adam_epsilon'] = parameters[7]
        results['num_epochs'] = parameters[9]
        metricslist[i] = results
        i += 1
    except:
        pass

rundata = pd.DataFrame(metricslist).sort_index(1).T.sort_values(by = ['validation loss'], ascending = True)
rundata

Unnamed: 0,adam_epsilon,batch_size,learning_rate,num_epochs,validation loss
0,1e-08,32,1e-05,5,0.489663


We can also access directly the best run from our hyperdrive execution and then have access to the generated log files and the outputs we create explicitly.

We can also save this reference number, and use it later to retrieve the bets run and associated artifacts saves during its execution.

In [11]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run.id

'bert_text_classification_distributed_1580183281896998_0'

All files that we write to the special "outputs" folder are made available for each hyperdrive run. Here we list those generated by the best run.

In [15]:
run = Run(Experiment(workspace = workspace, name = 'bert_text_classification_distributed'), 'bert_text_classification_distributed_1580183281896998_0')
run.get_file_names()

['azureml-logs/55_azureml-execution-tvmps_40be941cd6e7fdbb4a4c29f7d88615d0234114e775eac1b9759e345478708556_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_6ac160721fbb4a462c53968dfe47dc899a2747e4f07ee85869d6942530b2d55c_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_c4aa518317696ff9d5fcc5b1903280575b50c5bd08c89d02701ae6c40888ca78_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_c696483192492da13bd69d5e0fc181d5b1850480a5e33e52e16f4cf8b65c87a3_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_ce123e5e7010f950bfba0584d0f6585a5dfee40a3ffeb359f3692e49628260af_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_d641a9cc30742828d4156e8cc219c536752fc449b75da6e5131d604a8381ec31_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_f29b0667db008a7626bd458db1f1ed62a414471b40aa64cbdb5fcf2220384a21_d.txt',
 'azureml-logs/55_azureml-execution-tvmps_f64683dc0d6661da73ef63e9bd5144d6a5689db1de46424730bd71aa5bb0bc10_d.txt',
 'azureml-logs/65_job_prep-tvmps_40be941cd6e7fdbb4a4c29f7d88615d0234114e775eac1b

As we didn’t specify the compute target to scale down automatically, we can explicitly delete it. This will stop and delete all associated compute resources.

In [16]:
compute_target.delete()

We can then retrieve the saved model, corresponding configurations, and logged metrics that we explicitly saved in the run script and use them to recreate the model and evaluate it on the test data, as we did in the previous notebook when showing the fine-tuning process step-by-step.

In [17]:
model_folder = './model_aml'
os.makedirs(model_folder, exist_ok = True)

for f in run.get_file_names()[-8:]:
    run.download_file(f, model_folder)