# Lab 4 - Model Training with AutomatedML

In this lab you will use the automated machine learning (*AutomatedML*) capabilities within the Azure Machine Learning service.

Automated machine learning picks an algorithm and hyperparameters for you and generates a model ready for deployment. 



![AutomatedML](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/automated-machine-learning.png)



We will continue with the same scenario as in  Lab 1 and Lab 2.

## Connect to the workspace

In [1]:
# Verify AML SDK Installed
# view version history at https://pypi.org/project/azureml-sdk/#history 
import azureml.core
print("SDK Version:", azureml.core.VERSION)

SDK Version: 1.0.23


In [2]:
from azureml.core import Workspace

# Read the workspace config from file
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


Found the config file in: /home/byteb/events/MachineLearningOps/.azureml/config.json
MLOpsFatosIsmali
DSIMLOpsHack
westeurope
051aa254-957d-4431-a6df-6caa8963bdd7


## Train a model using AutomatedML


To train a model using AutoML you need only provide a configuration for AutoML that defines items such as the type of model (classification or regression), the performance metric to optimize, exit criteria in terms of max training time and iterations and desired performance, any algorithms that should not be used, and the path into which to output the results. This configuration is specified using the AutomMLConfig class, which is then used to drive the submission of an experiment via experiment.submit. When AutoML finishes the parent run, you can easily get the best performing run and model from the returned run object by using run.get_output().

### Create/Get Azure ML Compute cluster

We are reusing the cluster created in Lab 1. In case you removed the cluster the below code snippet is going to re-create it.

In [3]:
# Create an Azure ML Compute cluster

# Create Azure ML cluster
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os

# choose a name for your cluster
cluster_name = "cpu-cluster"
cluster_min_nodes = 1
cluster_max_nodes = 3
vm_size = "STANDARD_DS11_V2"

# Check if the cluster exists. If yes connect to it
if cluster_name in ws.compute_targets:
    compute_target = ws.compute_targets[cluster_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('Found existing compute target, using this compute target instead of creating:  ' + cluster_name)
    else:
        print("Error: A compute target with name ",cluster_name," was found, but it is not of type AmlCompute.")
else:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size, 
                                                                min_nodes = cluster_min_nodes, 
                                                                max_nodes = cluster_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current BatchAI cluster status, use the 'status' property    
    print(compute_target.status.serialize())

Found existing compute target, using this compute target instead of creating:  cpu-cluster


### Create Get Data script

If you are using a remote compute to run your Automated ML experiments - which is our scenario, the data fetch must be wrapped in a separate python script that implements get_data() function. This script is run on the remote compute where the automated ML experiment is run. get_data() eliminates the need to fetch the data over the wire for each iteration.

In [4]:
import os
project_folder = './project'
script_name = 'get_data.py'
os.makedirs(project_folder, exist_ok=True)

In [9]:
%%writefile $project_folder/get_data.py
import pandas as pd
import numpy as np
import os


def get_data():
 
    # Load the dataset
    data_folder = os.environ["AZUREML_DATAREFERENCE_workspaceblobstore"]
    file_name = os.path.join(data_folder, 'banking_train.csv')
    df = pd.read_csv(file_name)

    # Preprocess the data
    feature_columns = [
                   # Demographic
                   'age', 
                   'job', 
                   'education', 
                   'marital',  
                   'housing', 
                   'loan', 
                   # Previous campaigns
                   'month',
                   'campaign',
                   'poutcome',
                   # Economic indicators
                   'emp_var_rate',
                   'cons_price_idx',
                   'cons_conf_idx',
                   'euribor3m',
                   'nr_employed']

    df = df[feature_columns + ['y']]
    features = df.drop(['y'], axis=1)                                         
    
    # Flatten labes
    labels = np.ravel(df.y)    
    
    return { "X" : features, "y" : labels}


Writing ./project/get_data.py


### Configure datastore and data reference

The training files have been uploaded to the workspace's default datastore during the previous labs. We will configure AutomatedML to automatically download the files onto the nodes of the cluster.

In [5]:
from azureml.core import Datastore
from azureml.core.runconfig import DataReferenceConfiguration

ds = ws.get_default_datastore()
print("Using the default datastore for training data: ")
print(ds.name, ds.datastore_type, ds.account_name, ds.container_name)

dr = DataReferenceConfiguration(datastore_name=ds.name, 
                   path_on_datastore='datasets', 
                   path_on_compute='datasets',
                   mode='download', # download files from datastore to compute target
                   overwrite=True)


Using the default datastore for training data: 
workspaceblobstore AzureBlob mlopsfatosisma6452541516 azureml-blobstore-ba8ddba8-ba9b-45c0-a53f-08c3c660a28d


### Create Docker run configuration
We will run Automated ML jobs in a custom docker image that will include dependencies required by get_data() script.

In [6]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core import Run
from azureml.core import ScriptRunConfig

# create a new RunConfig object
run_config = RunConfiguration(framework="python")

# Azure ML Compute cluster for Automated ML jobs require docker.
run_config.environment.docker.enabled = True

# Set compute target to Azure ML Compute cluster
run_config.target = compute_target

# Set data references
run_config.data_references = {ds.name: dr}


### Fix lightgbm (for Mac users only)
The following code fixes a lightgbm dependency image. This fix is only to be executed by Mac users.

In [None]:
import sys
!conda install --yes --prefix {sys.prefix} -c conda-forge lightgbm


### Configure Automated ML run.

Automated ML runs can be controlled using a number of configuration parameters. 


For our run we will use the following configuration:
- Train a classification task
- Execute at most 25 iterations
- Use *normalized macro recall* as a primary performance metrics
- Use 5-fold cross validation for model evaluation
- Run the iterations on 3 nodes of a cluster
- Use 1 core per iteration
- Automatically pre-process data
- Exit if the primary metrics is higher than 0.9
- Limit the model selection to *SVM*, *LogisticRegression*, *LightGBM*, *TensorFlowDNN* and *RandomForest* models

We configured the last setting to demonstrate *white listing* capabilities of *AutomatedML*. Unless you have a strong basis for excluding or choosing certain models you are usually better of leaving the decision to *AutomatedML* - assuming that you have enough time and resources for running through many (more than 100) iterations.


We have configured our run to automatically pre-process data.

As a result, the following data preprocessing steps are performed automatically:
1.	Drop high cardinality or no variance features
    * Drop features with no useful information from training and validation sets. These include features with all values missing, same value across all rows or with extremely high cardinality (e.g., hashes, IDs or GUIDs).
1.	Missing value imputation
    *	For numerical features, impute missing values with average of values in the column.
    *	For categorical features, impute missing values with most frequent value.
1.	Generate additional features
    * For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.
    * For Text features: Term frequency based on word unigram, bi-grams, and tri-gram, Count vectorizer.
1.	Transformations and encodings
    * Numeric features with very few unique values transformed into categorical features.
    * Depending on cardinality of categorical features, perform label encoding or (hashing) one-hot encoding.


In [7]:
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
import logging


automl_config = AutoMLConfig(run_configuration = run_config,
                             task = 'classification',
                             debug_log = 'automl_errors.log',
                             primary_metric = 'norm_macro_recall',
                             iterations = 25,
                             n_cross_validations = 5,
                             max_concurrent_iterations = cluster_max_nodes,
                             max_cores_per_iteration = 1,
                             preprocess = True,
                             experiment_exit_score = 0.99,
                             #blacklist_models = ['KNN','MultinomialNaiveBayes', 'BernoulliNaiveBayes'],
                             whitelist_models = ['LogisticRegression', 'RandomForest', 'LightGBM', 'SVM', 'TensorFlowDNN'],
                             verbosity = logging.INFO,
                             path = project_folder,
                             data_script = os.path.join(project_folder, script_name))



### Run AutomatedML job.

In [10]:
from azureml.core import Experiment

experiment_name = "propensity_to_buy_automatedml"
exp = Experiment(ws, experiment_name)
tags = {"Desc": "automated ml"}
run = exp.submit(config=automl_config, tags=tags)
run

Experiment,Id,Type,Status,Details Page,Docs Page
propensity_to_buy_automatedml,AutoML_1e07dc57-2221-444f-a89b-3e205c5e7696,automl,Preparing,Link to Azure Portal,Link to Documentation


The call to experiment returns `AutoMLRun` object that can be used to track the run.

Since the call is asynchronous, it reports a **Preparing** or **Running** state as soon as the job is started.

Here is what's happening while you wait:

- **Image creation**: A Docker image is created matching the Python environment specified by the RunConfiguration. The image is uploaded to the workspace. This happens only once for each Python environment since the container is cached for subsequent runs.  During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs.

- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. 

- **Running**: In this stage, the Automated ML takes over and starts running experiments



You can check the progress of a running job in multiple ways: Azure Portal, AML Widgets or streaming logs.

### Monitor the run.

We will use AML Widget to monitor the run. The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

The widget is asynchronous - it does not block the notebook. You can execute other cells while the widget is running.

**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details.

In [11]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

### Cancelling Runs

You can cancel ongoing remote runs using the `cancel` and `cancel_iteration` functions.

In [None]:
# Cancel the ongoing experiment and stop scheduling new iterations.
# run.cancel()

# Cancel iteration 1 and move onto iteration 2.
# run.cancel_iteration(1)

### Analyze the run

You can  use SDK methods to fetch all the child runs and see individual metrics that we log.

In [12]:
import pandas as pd

children = list(run.get_children())
metricslist = {}
for child in children:
    properties = child.get_properties()
    metrics = {k: v for k, v in child.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
AUC_macro,0.78,0.77,0.78,0.8,0.77,0.79,0.78,0.8,0.78,0.79,...,0.79,0.79,0.79,0.79,0.79,0.79,0.79,0.8,0.77,0.79
AUC_micro,0.78,0.77,0.78,0.8,0.77,0.79,0.78,0.8,0.78,0.79,...,0.79,0.79,0.79,0.79,0.79,0.79,0.79,0.8,0.77,0.79
AUC_weighted,0.78,0.77,0.78,0.8,0.77,0.79,0.78,0.8,0.78,0.79,...,0.79,0.79,0.79,0.79,0.79,0.79,0.79,0.8,0.77,0.79
accuracy,0.73,0.72,0.89,0.9,0.73,0.89,0.89,0.9,0.89,0.83,...,0.9,0.9,0.9,0.9,0.82,0.9,0.9,0.9,0.72,0.84
average_precision_score_macro,0.35,0.33,0.38,0.45,0.35,0.42,0.36,0.44,0.33,0.44,...,0.44,0.44,0.44,0.44,0.43,0.44,0.44,0.45,0.34,0.44
average_precision_score_micro,0.35,0.33,0.38,0.45,0.35,0.42,0.36,0.44,0.33,0.44,...,0.44,0.44,0.44,0.44,0.43,0.44,0.44,0.45,0.34,0.44
average_precision_score_weighted,0.35,0.33,0.38,0.45,0.35,0.42,0.36,0.44,0.33,0.44,...,0.44,0.44,0.44,0.44,0.43,0.44,0.44,0.45,0.34,0.44
balanced_accuracy,0.72,0.72,0.52,0.59,0.72,0.5,0.5,0.58,0.5,0.74,...,0.6,0.6,0.6,0.6,0.74,0.59,0.6,0.59,0.72,0.74
f1_score_macro,0.6,0.59,0.51,0.63,0.6,0.47,0.47,0.61,0.47,0.67,...,0.63,0.63,0.63,0.63,0.67,0.62,0.63,0.63,0.59,0.69
f1_score_micro,0.73,0.72,0.89,0.9,0.73,0.89,0.89,0.9,0.89,0.83,...,0.9,0.9,0.9,0.9,0.82,0.9,0.9,0.9,0.72,0.84


### Waiting until the run finishes

`wait_for_complettion` method will block till the run finishes. 

In [None]:
# Wait until the run finishes.
# run.wait_for_completion(show_output = True)

## Explore the results

### Retrieve the best model

Below we select the best pipeline from our iterations. The get_output method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on get_output allow you to retrieve the best run and fitted model for any logged metric or for a particular iteration.

In [13]:
best_run, fitted_model = run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: propensity_to_buy_automatedml,
Id: AutoML_1e07dc57-2221-444f-a89b-3e205c5e7696_24,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(is_onnx_compatible=None, logger=None, task=None)), ('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=None,
               estimators=[('13', Pipeline(memory=None,
     steps=[('maxabsscaler', MaxAbsScaler(copy=True)), ('l...666666666667, 0.06666666666666667, 0.13333333333333333, 0.06666666666666667, 0.13333333333333333]))])


#### Best model on any other metric

Show the run and the model which has the smallest log_loss value:

In [14]:
lookup_metric = "AUC_weighted"
specific_run, specific_model = run.get_output(metric = lookup_metric)
print(specific_run)
print(specific_model)

Run(Experiment: propensity_to_buy_automatedml,
Id: AutoML_1e07dc57-2221-444f-a89b-3e205c5e7696_7,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(is_onnx_compatible=None, logger=None, task=None)), ('StandardScalerWrapper', <automl.client.core.common.model_wrappers.StandardScalerWrapper object at 0x7fade01b1518>), ('LightGBMClassifier', <automl.client.core.common.model_wrappers.LightGBMClassifier object at 0x7fade01a56a0>)])


#### Model from a Specific Iteration

In [15]:
iteration = 3
third_run, third_model = run.get_output(iteration=iteration)
print(third_run)
print(third_model)

Run(Experiment: propensity_to_buy_automatedml,
Id: AutoML_1e07dc57-2221-444f-a89b-3e205c5e7696_3,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(is_onnx_compatible=None, logger=None, task=None)), ('MaxAbsScaler', MaxAbsScaler(copy=True)), ('LightGBMClassifier', <automl.client.core.common.model_wrappers.LightGBMClassifier object at 0x7fade01f34a8>)])


### Test the model
Load the test data



In [16]:
import pandas as pd
import os

folder = '../datasets'
filename = 'banking_test.csv'
pathname = os.path.join(folder, filename)
df = pd.read_csv(pathname, delimiter='\s*,\s*', header=0, encoding='ascii', engine='python')


feature_columns = [
                   # Demographic
                   'age',
                   'job', 
                   'education', 
                   'marital',  
                   'housing', 
                   'loan', 
                   # Previous campaigns
                   'month',
                   'campaign',
                   'poutcome',
                   # Economic indicators
                   'emp_var_rate',
                   'cons_price_idx',
                   'cons_conf_idx',
                   'euribor3m',
                   'nr_employed']

df_test = df[feature_columns + ['y']]
df_test.head()


Unnamed: 0,age,job,education,marital,housing,loan,month,campaign,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
0,49,services,basic.4y,married,no,no,nov,2,failure,-0.1,93.2,-42.0,4.08,5195.8,0
1,52,retired,basic.9y,married,no,no,jul,3,nonexistent,1.4,93.92,-42.7,4.96,5228.1,0
2,72,retired,university.degree,divorced,no,no,aug,1,nonexistent,-2.9,92.2,-31.4,0.88,5076.2,0
3,26,unemployed,high.school,married,yes,no,jul,7,nonexistent,-1.7,94.22,-40.3,0.82,4991.6,1
4,38,management,university.degree,married,no,yes,nov,1,nonexistent,-0.1,93.2,-42.0,4.02,5195.8,0


Test the best model

In [17]:
from sklearn.metrics import accuracy_score, recall_score

#feature_columns = feature_columns + ['y']
print(df_test.columns)

y_pred = fitted_model.predict(df_test)


print("Accuracy: ", accuracy_score(df_test.y, y_pred))
print("Recall: ", recall_score(df_test.y, y_pred))

Index(['age', 'job', 'education', 'marital', 'housing', 'loan', 'month',
       'campaign', 'poutcome', 'emp_var_rate', 'cons_price_idx',
       'cons_conf_idx', 'euribor3m', 'nr_employed', 'y'],
      dtype='object')
Accuracy:  0.8401310997815004
Recall:  0.5980603448275862


## Register the best performing model for later use and deployment

The best model can now be registered into *Model Registry*. 

If neither `metric` nor `iteration` are specified in the `register_model` call, the iteration with the best primary metric is registered.

You can annotate the model with arbitrary tags.


In [18]:
# notice the use of the root run (not best_run) to register the best model
tags = {"Department": "Marketing"}
model = run.register_model(description='AutoML trained propensity to buy classifier',
                          tags=tags)


Registering model AutoML1e07dc572best
