# Training Models

The central goal of machine learning is to train predictive models that can be used by applications. In Azure Machine Learning,  you can use scripts to train models leveraging common machine learning frameworks like Scikit-Learn, Tensorflow, PyTorch, SparkML, and others. You can run these training scripts as experiments in order to track metrics and outputs - in particular, the trained models.

## Before You Start

Before you start this lab, ensure that you have completed the *Create an Azure Machine Learning Workspace* and *Create a Compute Instance* tasks in [Lab 1: Getting Started with Azure Machine Learning](./labdocs/Lab01.md). Then open this notebook in Jupyter on your Compute Instance.

## Connect to Your Workspace

The first thing you need to do is to connect to your workspace using the Azure ML SDK.

> **Note**: If you do not have a current authenticated session with your Azure subscription, you'll be prompted to authenticate. Follow the instructions to authenticate using the code provided.

In [1]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

Ready to use Azure ML 1.17.0 to work with MS-DP100-CERT-PRACTICE


In [2]:
for compute_name in ws.compute_targets:
    compute = ws.compute_targets[compute_name]
    print(compute.name, ":", compute.type)

DP100-LearningPath1 : ComputeInstance
DP100-Cmpt-Clust : AmlCompute


## Create a Training Script

You're going to use a Python script to train a machine learning model based on the diabates data, so let's start by creating a folder for the script and data files.

In [2]:
import os, shutil

# Create a folder for the experiment files
training_folder = 'diabetes-training'
os.makedirs(training_folder, exist_ok=True)

# Copy the data file into the experiment folder
shutil.copy('data/diabetes.csv', os.path.join(training_folder, "diabetes.csv"))

'diabetes-training\\diabetes.csv'

Now you're ready to create the training script and save it in the folder.

In [5]:
%%writefile $training_folder/diabetes_training.py
# Import libraries
from azureml.core import Run
import pandas as pd
import numpy as np
import joblib
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Get the experiment run context
run = Run.get_context()

# load the diabetes dataset
print("Loading Data...")
diabetes = pd.read_csv('diabetes.csv')

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Set regularization hyperparameter
reg = 0.01

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# Save the trained model in the outputs folder
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Overwriting diabetes-training/diabetes_training.py


## Use an Estimator to Run the Script as an Experiment

You can run experiment scripts using a **RunConfiguration** and a **ScriptRunConfig**, or you can use an **Estimator**, which abstracts both of these configurations in a single object.

In this case, we'll use a generic **Estimator** object to run the training experiment. Note that the default environment for this estimator does not include the **scikit-learn** package, so you need to explicitly add that to the configuration. The conda environment is built on-demand the first time the estimator is used, and cached for future runs that use the same configuration; so the first run will take a little longer. On subsequent runs, the cached environment can be re-used so they'll complete more quickly.

In [9]:
from azureml.train.estimator import Estimator
from azureml.core import Experiment

# Create an estimator
estimator = Estimator(source_directory=training_folder,
                      entry_script='diabetes_training.py',
                      compute_target='DP100-LearningPath1',
                      conda_packages=['scikit-learn']
                      )

# Create an experiment
experiment_name = 'diabetes-training'
experiment = Experiment(workspace = ws, name = experiment_name)

# Run the experiment based on the estimator
run = experiment.submit(config=estimator)
run.wait_for_completion(show_output=True)



RunId: diabetes-training_1604602741_dc67a0af
Web View: https://ml.azure.com/experiments/diabetes-training/runs/diabetes-training_1604602741_dc67a0af?wsid=/subscriptions/90b02aed-4374-4b38-901e-7a6a8b9ce3b9/resourcegroups/DP100/workspaces/MS-DP100-CERT-PRACTICE

Streaming azureml-logs/20_image_build_log.txt

2020/11/05 18:59:17 Downloading source code...
2020/11/05 18:59:19 Finished downloading source code
2020/11/05 18:59:19 Creating Docker network: acb_default_network, driver: 'bridge'
2020/11/05 18:59:20 Successfully set up Docker network: acb_default_network
2020/11/05 18:59:20 Setting up Docker configuration...
2020/11/05 18:59:21 Successfully set up Docker configuration
2020/11/05 18:59:21 Logging in to registry: 242f6b50a5634f0595d4cf914edab868.azurecr.io
2020/11/05 18:59:22 Successfully logged into 242f6b50a5634f0595d4cf914edab868.azurecr.io
2020/11/05 18:59:22 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'
2020/11/05 18:


mkl-2019.4           | 204.1 MB  | ########## | 100% 

libedit-3.1          | 171 KB    |            |   0% 
libedit-3.1          | 171 KB    | ########## | 100% 

readline-7.0         | 387 KB    |            |   0% 
readline-7.0         | 387 KB    | ########## | 100% 

six-1.15.0           | 13 KB     |            |   0% 
six-1.15.0           | 13 KB     | ########## | 100% 

mkl-service-2.3.0    | 208 KB    |            |   0% 
mkl-service-2.3.0    | 208 KB    | ########## | 100% 

wheel-0.35.1         | 36 KB     |            |   0% 
wheel-0.35.1         | 36 KB     | ########## | 100% 
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done

Ran pip subprocess with arguments:
['/azureml-envs/azureml_4b824bcb98517d791c41923f24d65461/bin/python', '-m', 'pip', 'install', '-U', '-r', '/azureml-environment-setup/condaenv.px_h7am0.requirements.txt']
Pip subprocess output:
Collecting azureml-defaults
  Downloading az

Removing intermediate container ced8c70f90b3
 ---> 79aa50423bbc
Step 9/14 : ENV PATH /azureml-envs/azureml_4b824bcb98517d791c41923f24d65461/bin:$PATH
 ---> Running in 127b2b2f9cc0
Removing intermediate container 127b2b2f9cc0
 ---> de575a25d3e5
Step 10/14 : ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/azureml_4b824bcb98517d791c41923f24d65461
 ---> Running in afd5b0aa5a80
Removing intermediate container afd5b0aa5a80
 ---> 9944e87bb342
Step 11/14 : ENV LD_LIBRARY_PATH /azureml-envs/azureml_4b824bcb98517d791c41923f24d65461/lib:$LD_LIBRARY_PATH
 ---> Running in dfda30dc2fc2
Removing intermediate container dfda30dc2fc2
 ---> 62e273468236
Step 12/14 : COPY azureml-environment-setup/spark_cache.py azureml-environment-setup/log4j.properties /azureml-environment-setup/
 ---> 7926db0d9af4
Step 13/14 : ENV AZUREML_ENVIRONMENT_IMAGE True
 ---> Running in 9809166c1a13
Removing intermediate container 9809166c1a13
 ---> c6dee4b94833
Step 14/14 : CMD ["bash"]
 ---> Running in f881ec8056ae
Removing 

{'runId': 'diabetes-training_1604602741_dc67a0af',
 'target': 'DP100-LearningPath1',
 'status': 'Completed',
 'startTimeUtc': '2020-11-05T19:05:57.042147Z',
 'endTimeUtc': '2020-11-05T19:08:06.379124Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '4ef95129-e7a7-4e33-b3b9-92767b2dfeb5',
  'azureml.git.repository_uri': 'https://github.com/yuvrajpandya/azure-ml-labs.git',
  'mlflow.source.git.repoURL': 'https://github.com/yuvrajpandya/azure-ml-labs.git',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': '12175095c48386678739e95d47821d87dfc6a1af',
  'mlflow.source.git.commit': '12175095c48386678739e95d47821d87dfc6a1af',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'diabetes_training.py',
  'command': [],
  'useAbsolutePath': False,


As with any experiment run, you can use the **RunDetails** widget to view information about the run and get a link to it in Azure Machine Learning studio.

In [10]:
from azureml.widgets import RunDetails

RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

You can also retrieve the metrics and outputs from the **Run** object.

In [11]:
# Get logged metrics
metrics = run.get_metrics()
for key in metrics.keys():
        print(key, metrics.get(key))
print('\n')
for file in run.get_file_names():
    print(file)

Regularization Rate 0.01
Accuracy 0.774
AUC 0.8484929598487486


azureml-logs/20_image_build_log.txt
azureml-logs/55_azureml-execution-tvmps_46c3950b5d38b68aae5c0d09e1823d9afec73f14b0de396927a0fa84a3d234b8_d.txt
azureml-logs/65_job_prep-tvmps_46c3950b5d38b68aae5c0d09e1823d9afec73f14b0de396927a0fa84a3d234b8_d.txt
azureml-logs/70_driver_log.txt
azureml-logs/75_job_post-tvmps_46c3950b5d38b68aae5c0d09e1823d9afec73f14b0de396927a0fa84a3d234b8_d.txt
azureml-logs/process_info.json
azureml-logs/process_status.json
logs/azureml/101_azureml.log
logs/azureml/job_prep_azureml.log
logs/azureml/job_release_azureml.log
outputs/diabetes_model.pkl


## Register the Trained Model

Note that the outputs of the experiment include the trained model file (**diabetes_model.pkl**). You can register this model in your Azure Machine Learning workspace, making it possible to track model versions and retrieve them later.

In [12]:
from azureml.core import Model

# Register the model
run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model',
                   tags={'Training context':'Estimator'},
                   properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})

# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

diabetes_model version: 1
	 Training context : Estimator
	 AUC : 0.8484929598487486
	 Accuracy : 0.774


amlstudio-predict-auto-price version: 1
	 CreatedByAMLStudio : true


AutoML91841e3d60 version: 1




## Create a Parameterized Training Script

You can increase the flexibility of your training experiment by adding parameters to your script, enabling you to repeat the same training experiment with different settings. In this case, you'll add a parameter for the regularization rate used by the Logistic Regression algorithm when training the model.

Again, lets start by creating a folder for the parameterized script and the training data.

In [3]:
import os, shutil

# Create a folder for the experiment files
training_folder = 'diabetes-training-params'
os.makedirs(training_folder, exist_ok=True)

# Copy the data file into the experiment folder
shutil.copy('data/diabetes.csv', os.path.join(training_folder, "diabetes.csv"))

'diabetes-training-params\\diabetes.csv'

Now let's create a script containing a parameter for the regularization rate hyperparameter.

In [4]:
%%writefile $training_folder/diabetes_training.py
# Import libraries
from azureml.core import Run
import pandas as pd
import numpy as np
import joblib
import os
import argparse
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Get the experiment run context
run = Run.get_context()

# Set regularization hyperparameter
parser = argparse.ArgumentParser()
parser.add_argument('--reg_rate', type=float, dest='reg', default=0.01)
args = parser.parse_args()
reg = args.reg

# load the diabetes dataset
print("Loading Data...")
# load the diabetes dataset
diabetes = pd.read_csv('diabetes.csv')

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Writing diabetes-training-params/diabetes_training.py


## Use a Framework-Specific Estimator

You used a generic **Estimator** class to run the training script, but you can also take advantage of framework-specific estimators that include environment definitions for common machine learning frameworks. In this case, you're using Scikit-Learn, so you can use the **SKLearn** estimator. This means that you don't need to specify the **scikit-learn** package in the configuration.

> **Note**: Once again, the training experiment uses a new environment; which must be created the first time it is run.

In [6]:
from azureml.core import Experiment
from azureml.train.sklearn import SKLearn
from azureml.widgets import RunDetails

# Create an estimator
estimator = SKLearn(source_directory=training_folder,
                    entry_script='diabetes_training.py',
                    script_params = {'--reg_rate': 0.1},
                    compute_target='DP100-LearningPath1'
                    )

# Create an experiment
experiment_name = 'diabetes-training'
experiment = Experiment(workspace = ws, name = experiment_name)

# Run the experiment
run = experiment.submit(config=estimator)

# Show the run details while running
RunDetails(run).show()
run.wait_for_completion()



_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

{'runId': 'diabetes-training_1604718475_ce992e9b',
 'target': 'DP100-LearningPath1',
 'status': 'Completed',
 'startTimeUtc': '2020-11-07T03:24:47.541098Z',
 'endTimeUtc': '2020-11-07T03:26:24.165479Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '9b15c36a-fa49-4ddb-a089-2aa4de03bad6',
  'azureml.git.repository_uri': 'https://github.com/yuvrajpandya/azure-ml-labs.git',
  'mlflow.source.git.repoURL': 'https://github.com/yuvrajpandya/azure-ml-labs.git',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': '12175095c48386678739e95d47821d87dfc6a1af',
  'mlflow.source.git.commit': '12175095c48386678739e95d47821d87dfc6a1af',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'diabetes_training.py',
  'command': [],
  'useAbsolutePath': False,


Once again, you can get the metrics and outputs from the run.

In [7]:
# Get logged metrics
metrics = run.get_metrics()
for key in metrics.keys():
        print(key, metrics.get(key))
print('\n')
for file in run.get_file_names():
    print(file)

Regularization Rate 0.1
Accuracy 0.7736666666666666
AUC 0.8483904671874223


azureml-logs/20_image_build_log.txt
azureml-logs/55_azureml-execution-tvmps_46c3950b5d38b68aae5c0d09e1823d9afec73f14b0de396927a0fa84a3d234b8_d.txt
azureml-logs/65_job_prep-tvmps_46c3950b5d38b68aae5c0d09e1823d9afec73f14b0de396927a0fa84a3d234b8_d.txt
azureml-logs/70_driver_log.txt
azureml-logs/75_job_post-tvmps_46c3950b5d38b68aae5c0d09e1823d9afec73f14b0de396927a0fa84a3d234b8_d.txt
azureml-logs/process_info.json
azureml-logs/process_status.json
logs/azureml/100_azureml.log
logs/azureml/job_prep_azureml.log
logs/azureml/job_release_azureml.log
outputs/diabetes_model.pkl


## Register A New Version of the Model

Now that you've trained a new model, you can register it as a new version in the workspace.

In [8]:
from azureml.core import Model

# Register the model
run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model',
                   tags={'Training context':'Parameterized SKLearn Estimator'},
                   properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})

# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

diabetes_model version: 2
	 Training context : Parameterized SKLearn Estimator
	 AUC : 0.8483904671874223
	 Accuracy : 0.7736666666666666


diabetes_model version: 1
	 Training context : Estimator
	 AUC : 0.8484929598487486
	 Accuracy : 0.774


amlstudio-predict-auto-price version: 1
	 CreatedByAMLStudio : true


AutoML91841e3d60 version: 1




## Clean Up

If you've finished exploring, you can close this notebook and shut down your Compute Instance.