Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/manymodels/02_Training/02_Training_Pipeline.png)

# Training Pipeline - Automated ML
_**Training many models using Automated Machine Learning**_

---

This notebook demonstrates how to train and register 11,973 models using Automated Machine Learning. We will utilize the AutoMLPipelineBuilder to parallelize the process of training 11,973 models. For this notebook we are using an orange juice sales dataset to predict the orange juice quantity for each brand and each store. For more information about the data refer to the Data Preparation Notebook.

<span style="color:red"><b>NOTE: There are limits on how many runs we can do in parallel per workspace, and we currently recommend to set the parallelism to maximum of 20 runs per experiment per workspace. If users want to have more parallelism and increase this limit they might encounter Too Many Requests errors (HTTP 429). </b></span>

<span style="color:red"><b> Please ensure you have the latest version of the SDK to ensure AutoML dependencies are consistent.</b></span>

In [1]:
# !pip install --upgrade azureml-sdk

In [2]:
# !pip install --upgrade azureml-train-automl

Install the azureml-contrib-automl-pipeline-steps package that is needed for many models.

In [3]:
!pip install azure-ai-ml



### Prerequisites

At this point, you should have already:

1. Created your AML Workspace using the [00_Setup_AML_Workspace notebook](../../00_Setup_AML_Workspace.ipynb)
2. Run [01_Data_Preparation.ipynb](../../01_Data_Preparation.ipynb) to create the dataset

## 1.0 Set up workspace and dataset

In [4]:
from azure.ai.ml import MLClient, Input, command, Output
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.automl import forecasting
from azure.ai.ml.entities._job.automl.tabular.forecasting_settings import ForecastingSettings

# add environment variable to enable private preview feature
import os
os.environ["AZURE_ML_CLI_PRIVATE_FEATURES_ENABLED"] = "true"

In [5]:
import os
import datetime

#Set Environment Variables
os.environ['AZURE_TENANT_ID']="72f988bf-86f1-41af-91ab-2d7cd011db47"
os.environ['AZURE_CLIENT_ID']="d6305f3e-e78e-41c1-8f3b-9ce4c9222ff5"
os.environ['AZURE_CLIENT_SECRET']="Mm78Q~o4rGgJv7hnqrBBz1uAjWw8AcIhZbIpWa4V"

subscription_id=os.getenv("SUBSCRIPTION_ID", default="80a3336a-33ac-4098-a7e7-64eb71d80cee")
resource_group=os.getenv("RESOURCE_GROUP", default="tgrgml")
workspace_name=os.getenv("WORKSPACE_NAME", default="mlw-basic-prod-202209110348")
workspace_region=os.getenv("WORKSPACE_REGION", default="australiaeast")

In [6]:
#Set up MLClient

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

try:
    credential = DefaultAzureCredential()
except Exception as e:
    credential = InteractiveBrowserCredential()

ml_client = MLClient(credential, subscription_id, resource_group, workspace_name)

In [7]:
#Check workspace values

import pandas as pd

workspace = ml_client.workspaces.get(name=ml_client.workspace_name)

output = {}
output["Workspace"] = ml_client.workspace_name
output["Subscription ID"] = ml_client.connections._subscription_id
output["Resource Group"] = workspace.resource_group
output["Location"] = workspace.location
pd.set_option("display.max_colwidth", None)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

Unnamed: 0,Unnamed: 1
Workspace,mlw-basic-prod-202209110348
Subscription ID,80a3336a-33ac-4098-a7e7-64eb71d80cee
Resource Group,tgrgml
Location,australiaeast


## 2.0 Call the registered filedataset

We use 11,973 datasets and AutoMLPipelineBuilder to build 11,973 time-series to predict the quantity of each store brand.

Each dataset represents a brand's 2 years orange juice sales data that contains 7 columns and 122 rows. 

You will need to register the datasets in the Workspace first. The Data Preparation notebook demonstrates how to register two datasets to the workspace. 

The registered 'oj_data_small' file dataset contains the first 10 csv files and 'oj_data' contains all 11,973 csv files. You can choose to pass either filedatasets_10_models_input or filedatasets_all_models_inputs in the AutoMLPipelineBuilder.

We recommend to **start with filedatasets_10_models** and make sure everything runs successfully, then scale up to filedatasets_all_models.

In [8]:
#Get datasets
train_ds = ml_client.data.get("train", "1")
inference_ds = ml_client.data.get("inference", "1")

In [9]:
# from azureml.core.dataset import Dataset

# filedst_10_models = Dataset.get_by_name(ws, name='oj_data_small_train')
# filedst_10_models_input = filedst_10_models.as_named_input('train_10_models')

#filedst_all_models = Dataset.get_by_name(ws, name='oj_data_train')
#filedst_all_models_inputs = filedst_all_models.as_named_input('train_all_models')

## 3.0 Build the training pipeline
Now that the dataset, WorkSpace, and datastore are set up, we can put together a pipeline for training. 

### Choose a compute target

Currently AutoMLPipelineBuilder only supports AMLCompute. You can change to a different compute cluster if one fails.

This is the compute target we will pass into our AutoMLPipelineBuilder.

In [10]:
cpu_cluster_name = "tg-cpu-cluster"
compute = ml_client.compute.get(cpu_cluster_name)
print(compute)

AmlCompute({'type': 'amlcompute', 'created_on': None, 'provisioning_state': 'Succeeded', 'provisioning_errors': None, 'name': 'tg-cpu-cluster', 'description': None, 'tags': {}, 'properties': {}, 'id': '/subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourceGroups/tgrgml/providers/Microsoft.MachineLearningServices/workspaces/mlw-basic-prod-202209110348/computes/tg-cpu-cluster', 'base_path': './', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7faeb8626380>, 'resource_id': None, 'location': 'australiaeast', 'size': 'STANDARD_DS3_V2', 'min_instances': 0, 'max_instances': 2, 'idle_time_before_scale_down': 120.0, 'identity': None, 'ssh_public_access_enabled': True, 'ssh_settings': None, 'network_settings': <azure.ai.ml.entities._compute.compute.NetworkSettings object at 0x7faec8eea740>, 'tier': 'dedicated'})


In [11]:
# Already created in Notebook 00 
# 
#  from azure.ai.ml.entities import ComputeInstance, AmlCompute

# # Choose a name for your cluster.
# amlcompute_cluster_name = "aml-cpucluster"

# found = False
# # Check if this compute target already exists in the workspace.
# cts = ws.compute_targets
# if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
#     found = True
#     print('Found existing compute target.')
#     compute = cts[amlcompute_cluster_name]
    
# if not found:
#     print('Creating a new compute target...')
#     provisioning_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D16S_V3',
#                                                            min_nodes=2,
#                                                            max_nodes=20)
#     # Create the cluster.
#     compute = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
# print('Checking cluster status...')
# # Can poll for a minimum number of nodes and for a specific timeout.
# # If no min_node_count is provided, it will use the scale settings for the cluster.
# compute.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)
    
# For a more detailed view of current AmlCompute status, use get_status().

## Train

This dictionary defines the [AutoML settings](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py#parameters), for this forecasting task we add the name of the time column and the maximum forecast horizon.

|Property|Description|
|-|-|
|**task**|forecasting|
|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**blocked_models**|Models in blocked_models won't be used by AutoML. All supported models can be found at [here](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.constants.supportedmodels.forecasting?view=azure-ml-py).|
|**iterations**|Number of models to train. This is optional but provides customer with greater control.|
|**iteration_timeout_minutes**|Maximum amount of time in minutes that the model can train. This is optional and depends on the dataset. We ask customer to explore a bit to get approximate times for training the dataset. For OJ dataset we set it 20 minutes|
|**experiment_timeout_hours**|Maximum amount of time in hours that the experiment can take before it terminates.|
|**label_column_name**|The name of the label column.|
|**n_cross_validations**|Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way.|
|**enable_early_stopping**|Flag to enable early termination if the score is not improving in the short term.|
|**time_column_name**|The name of your time column.|
|**max_horizon**|The number of periods out you would like to predict past your training data. Periods are inferred from your data.|
|**grain_column_names**|The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp.|
|**partition_column_names**|The names of columns used to group your models. For timeseries, the groups must not split up individual time-series. That is, each group must contain one or more whole time-series.|
|**track_child_runs**|Flag to disable tracking of child runs. Only best run (metrics and model) is tracked if the flag is set to False.|
|**pipeline_fetch_max_batch_size**|Determines how many pipelines (training algorithms) to fetch at a time for training, this helps reduce throttling when training at large scale.|

In [12]:
%pip install --pre mltable

[31mERROR: Ignored the following versions that require a different python version: 0.1.0b1 Requires-Python >=3.6,<3.9; 0.1.0b2 Requires-Python >=3.6,<3.9; 0.1.0b3 Requires-Python >=3.6,<3.9[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement mltable (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for mltable[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [18]:
# load mltable from local folder
from azure.ai.ml.entities import mltable
from mltable import load
tbl = load('./data/upload_train_data/dataset.csv')

# load mltable from azureml datastore uri
from mltable import load
tbl = load('azureml://subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourcegroups/tgrgml/workspaces/mlw-basic-prod-202209110348/datastores/<datastore-name>/paths/<mltable-path-on-datastore>/')

ImportError: cannot import name 'mltable' from 'azure.ai.ml.entities' (/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/azure/ai/ml/entities/__init__.py)

In [None]:
from azure.ai.ml.dsl import pipeline
# Define pipeline
@pipeline(
    description="AutoML Forecasting Pipeline",
    )
def automl_forecasting(
    forecasting_train_data,
):
    # define forecasting settings
    forecasting_settings = ForecastingSettings(time_column_name="WeekStarting", forecast_horizon=12, frequency="H", target_lags=[ 12 ], target_rolling_window_size=4)
    
    # define the automl forecasting task with automl function
    forecasting_node = forecasting(
        training_data=forecasting_train_data,
        target_column_name="demand",
        primary_metric="normalized_root_mean_squared_error",
        n_cross_validations=2,
        forecasting_settings=forecasting_settings,
        # currently need to specify outputs "mlflow_model" explictly to reference it in following nodes 
        outputs={"best_model": Output(type="mlflow_model")},
    )
    
    forecasting_node.set_limits(timeout_minutes=180)

    # define command function for registering the model
    command_func = command(
        inputs=dict(
            model_input_path = Input(type="mlflow_model"),
            model_base_name= "forecasting_example_model"
        ),
        code="./register.py",
        command="python register.py " +
                "--model_input_path ${{inputs.model_input_path}} " +
                "--model_base_name ${{inputs.model_base_name}}" ,

        environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:1"
    )
    register_model = command_func(model_input_path = forecasting_node.outputs.best_model)

data_folder = "./data"
pipeline = automl_forecasting(
    forecasting_train_data=Input(path=f"{data_folder}/upload_train_data/", type="uri_file"),
)

# set pipeline level compute
pipeline.settings.default_compute="tg-cpu-cluster"

Class ForecastingJob: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [None]:
pipeline_job = ml_client.jobs.create_or_update(
    pipeline, experiment_name="many-models-forecasting-experiment"
)
pipeline_job
ml_client.jobs.stream(pipeline_job.name)

RunId: gifted_map_50vl03487x
Web View: https://ml.azure.com/runs/gifted_map_50vl03487x?wsid=/subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourcegroups/tgrgml/workspaces/mlw-basic-prod-202209110348

Streaming logs/azureml/executionlogs.txt

[2022-09-17 11:36:21Z] Submitting 1 runs, first five are: a29bdadb:b6f300bd-314e-4219-a256-d5212220a88c
[2022-09-17 11:39:37Z] Execution of experiment failed, update experiment status and cancel running nodes.

Execution Summary
RunId: gifted_map_50vl03487x
Web View: https://ml.azure.com/runs/gifted_map_50vl03487x?wsid=/subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourcegroups/tgrgml/workspaces/mlw-basic-prod-202209110348


JobException: Exception : 
 {
    "error": {
        "code": "UserError",
        "message": "Pipeline has some failed steps. See child run or execution logs for more details.",
        "message_format": "Pipeline has some failed steps. {0}",
        "message_parameters": {},
        "reference_code": "PipelineHasStepJobFailed",
        "details": []
    },
    "environment": "australiaeast",
    "location": "australiaeast",
    "time": "2022-09-17T11:39:37.812836Z",
    "component_name": ""
} 

In [None]:
# from azure.ai.ml import automl

# # Create the AutoML forecasting job with the related factory-function.

# forecasting_job = automl.forecasting(
#     compute=compute_name,
#     name="mm-forecasting-job",
#     experiment_name=exp_name,
#     training_data=train_ds,
#     # validation_data = my_validation_data_input,
#     target_column_name="Quantity",
#     primary_metric="NormalizedRootMeanSquaredError",
#     n_cross_validations=3,
#     enable_model_explainability=True
# )

# # Limits are all optional
# forecasting_job.set_limits(
#     timeout_minutes=600,
#     trial_timeout_minutes=20,
#     max_trials=max_trials,
#     # max_concurrent_trials = 4,
#     # max_cores_per_trial: -1,
#     enable_early_termination=True,
# )

# # Specialized properties for Time Series Forecasting training
# forecasting_job.set_forecast_settings(
#     time_column_name="WeekStarting",
#     forecast_horizon=24,
#     frequency="H",
#     target_lags=[12],
#     target_rolling_window_size=4,
#     # ADDITIONAL FORECASTING TRAINING PARAMS ---
#     # time_series_id_column_names=["tid1", "tid2", "tid2"],
#     # short_series_handling_config=ShortSeriesHandlingConfiguration.DROP,
#     # use_stl="season",
#     # seasonality=3,
# )

# # Training properties are optional
# forecasting_job.set_training(blocked_training_algorithms=["ExtremeRandomTrees"])

Class ForecastingJob: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [None]:
# import logging

# partition_column_names = ['Store', 'Brand']

# automl_settings = {
#     "task" : 'forecasting',
#     "primary_metric" : 'normalized_root_mean_squared_error',
#     "iteration_timeout_minutes" : 10, # This needs to be changed based on the dataset. We ask customer to explore how long training is taking before settings this value
#     "iterations" : 15,
#     "experiment_timeout_hours" : 1,
#     "label_column_name" : 'Quantity',
#     "n_cross_validations" : 3,
#     # "verbosity" : logging.INFO, 
#     # "debug_log": 'automl_oj_sales_debug.txt',
#     "time_column_name": 'WeekStarting',
#     "max_horizon" : 20,
#     "track_child_runs": False,
#     "partition_column_names": partition_column_names,
#     "grain_column_names": ['Store', 'Brand'],
#     "pipeline_fetch_max_batch_size": 15
# }

### Build many model training steps

AutoMLPipelineBuilder is used to build the many models train step. You will need to determine the number of workers and nodes appropriate for your use case. The process_count_per_node is based off the number of cores of the compute VM. The node_count will determine the number of master nodes to use, increasing the node count will speed up the training process.

* <b>experiment</b>: Current experiment.

* <b>automl_settings</b>: AutoML settings dictionary.

* <b>train_data</b>: Train dataset.

* <b>compute_target</b>: Compute target for training.

* <b>partition_column_names</b>: Partition column names.

* <b>node_count</b>: The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long.

* <b>process_count_per_node</b>: The number of processes per node.

* <b>run_invocation_timeout</b>: The run() method invocation timeout in seconds. The timeout should be set to maximum training time of one AutoML run(with some buffer), by default it's 60 seconds.

* <b>output_datastore</b>: Output datastore to output the training results.

* <b>train_env(Optional)</b>: Optionally can provide train environment definition to use for training.

<span style="color:red"><b>NOTE: There are limits on how many runs we can do in parallel per workspace, and we currently recommend to set the parallelism to maximum of 320 runs per experiment per workspace. If users want to have more parallelism and increase this limit they might encounter Too Many Requests errors (HTTP 429). </b></span>


In [None]:
#!pip install azureml.contrib.automl.pipeline.steps

In [None]:
from azureml.contrib.automl.pipeline.steps import AutoMLPipelineBuilder

train_steps = AutoMLPipelineBuilder.get_many_models_train_steps(experiment=experiment,
                                                                automl_settings=automl_settings,
                                                                train_data=filedst_10_models_input,
                                                                compute_target=compute,
                                                                partition_column_names=partition_column_names,
                                                                node_count=2,
                                                                process_count_per_node=8,
                                                                run_invocation_timeout=3700,
                                                                output_datastore=dstore)

## 4.0 Run the training pipeline

### Submit the pipeline to run

Next we submit our pipeline to run. The whole training pipeline takes about 1h 11m using a STANDARD_D16S_V3 VM with our current AutoMLPipelineBuilder setting.

In [None]:
from azureml.pipeline.core import Pipeline
#from azureml.widgets import RunDetails

pipeline = Pipeline(workspace=ws, steps=train_steps)
run = experiment.submit(pipeline)
#RunDetails(run).show()

You can run the folowing command if you'd like to monitor the training process in jupyter notebook. It will stream logs live while training. 

**Note**: This command may not work for Notebook VM, however it should work on your local laptop.

In [None]:
run.wait_for_completion(show_output=True)

Succesfully trained, registered Automated ML models. 

## 5.0 Review outputs of the training pipeline

The training pipeline will train and register models to the Workspace. You can review trained models in the Azure Machine Learning Studio under 'Models'.
If there are any issues with training, you can go to 'many-models-training' run under the pipeline run and explore logs under 'Logs'.
You can look at the stdout and stderr output under logs/user/worker/<ip> for more details


## 6.0 Get list of AutoML runs along with registered model names and tags

The following code snippet will iterate through all the automl runs for the experiment and list the details.

**Framework** - AutoML, **Dataset** - input data set, **Run** - AutoML run id, **Status** - AutoML run status,  **Model** - Registered model name, **Tags** - Tags for model, **StartTime** - Start time, **EndTime** - End time, **ErrorType** - ErrorType, **ErrorCode** - ErrorCode, **ErrorMessage** - Error Message

In [None]:
from scripts.helper import get_training_output
import os

training_results_name = "training_results"
training_output_name = "many_models_training_output"

training_file = get_training_output(run, training_results_name, training_output_name)
all_columns = ["Framework", "Dataset", "Run", "Status", "Model", "Tags", "StartTime", "EndTime" , "ErrorType", "ErrorCode", "ErrorMessage" ]
df = pd.read_csv(training_file, delimiter=" ", header=None, names=all_columns)
training_csv_file = "training.csv"
df.to_csv(training_csv_file)
print("Training output has", df.shape[0], "rows. Please open", os.path.abspath(training_csv_file), "to browse through all the output.")

## 7.0 Publish and schedule the pipeline (Optional)

### 7.1 Publish the pipeline

Once you have a pipeline you're happy with, you can publish a pipeline so you can call it programmatically later on. See this [tutorial](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline#publish-a-pipeline) for additional information on publishing and calling pipelines.

In [None]:
# published_pipeline = pipeline.publish(name = 'automl_train_many_models',
#                                      description = 'train many models',
#                                      version = '1',
#                                      continue_on_step_failure = False)

### 7.2 Schedule the pipeline
You can also [schedule the pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-schedule-pipelines) to run on a time-based or change-based schedule. This could be used to automatically retrain models every month or based on another trigger such as data drift.

In [None]:
# from azureml.pipeline.core import Schedule, ScheduleRecurrence
    
# training_pipeline_id = published_pipeline.id

# recurrence = ScheduleRecurrence(frequency="Month", interval=1, start_time="2020-01-01T09:00:00")
# recurring_schedule = Schedule.create(ws, name="automl_training_recurring_schedule", 
#                             description="Schedule Training Pipeline to run on the first day of every month",
#                             pipeline_id=training_pipeline_id, 
#                             experiment_name=experiment.name, 
#                             recurrence=recurrence)

## 8.0 Bookkeeping of workspace (Optional)

### 8.1 Cancel any runs that are running

To cancel any runs that are still running in a given experiment.

In [None]:
# from scripts.helper import cancel_runs_in_experiment
# failed_experiment =  'Please modify this and enter the experiment name'
# # Please note that the following script cancels all the currently running runs in the experiment
# cancel_runs_in_experiment(ws, failed_experiment)