# Exercise09 : ML Pipeline

With AML pipeline, you can create ML workflows for such as following purposes.

- You can build retraining pipeline for MLOps integration.
- You can build batch-scoring pipeline instead of real-time scoring in "[Exercise08 : Publish as a Web Service](./exercise08_publish_model.ipynb)".

> Note : See [here](https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/ai/mlops-python) for the reference architecture integrating with CI/CD tools.

ML pipeline can be invoked by the following methods. 

- Time-based schedule invocation
- On-demand invocation by the published endpoint (REST)
- Trigger-based invocation, such as, file change or other combined events (with Azure Event Grid, Azure Logic Apps, etc)

In this exercise, we create a simple training pipeline, which returns model metrics in top-level (pipeline's) outputs.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## 1. Variable's Setting

Replace below's branket's string and set the required variables.

Using ```az login --service-principal```, you would be able to involve ML pipeline in CI/CD utilities (such as, in GitHub actions) without login UI.

> Note : By the following ```az configure --defaults```, you can skip setting for ```--resource-group``` and ```--workspace-name``` options in each ```az ml``` command.<br>
> ```az configure --defaults group=$resource_group workspace=$aml_workspace```

In [1]:
my_resource_group = "{AML-RESOURCE-GROUP-NAME}"
my_workspace = "{AML-WORSPACE-NAME}"

## 2. Create compute

Create your new AML compute for running pipeline.

When the pipeline is invoked, the compute will be started. When the pipeline is completed, this compute will be automatically scaled down to zero.

In [2]:
!az ml compute create --name mycluster01 \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace \
  --type amlcompute \
  --min-instances 0 \
  --max-instances 1 \
  --size Standard_D2_v2

{
  "id": "/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/computes/mycluster01",
  "idle_time_before_scale_down": 120,
  "location": "eastus",
  "max_instances": 1,
  "min_instances": 0,
  "name": "mycluster01",
  "network_settings": {},
  "provisioning_state": "Succeeded",
  "resourceGroup": "AML-rg",
  "size": "STANDARD_D2_V2",
  "ssh_public_access_enabled": true,
  "tier": "dedicated",
  "type": "amlcompute"
}
[0m

## 3. Create an environment

First, create a custom environment (with TensorFlow 1.15) to run scripts.

In [6]:
%%writefile 09_conda_pydata.yml
name: project_environment
dependencies:
- python=3.6
- pip:
  - tensorflow==1.15
channels:
- anaconda
- conda-forge

Writing 09_conda_pydata.yml


In [7]:
%%writefile 09_env_register.yml
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: test-remote-cpu-env
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
conda_file: 09_conda_pydata.yml
description: This is example

Writing 09_env_register.yml


In [8]:
!az ml environment create --file 09_env_register.yml \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

{
  "conda_file": {
    "channels": [
      "anaconda",
      "conda-forge"
    ],
    "dependencies": [
      "python=3.6",
      {
        "pip": [
          "azureml-defaults",
          "tensorflow==1.15"
        ]
      }
    ],
    "name": "project_environment"
  },
  "creation_context": {
    "created_at": "2022-06-07T05:29:52.216059+00:00",
    "created_by": "Tsuyoshi Matsuzaki",
    "created_by_type": "User",
    "last_modified_at": "2022-06-07T05:29:52.216059+00:00",
    "last_modified_by": "Tsuyoshi Matsuzaki",
    "last_modified_by_type": "User"
  },
  "description": "This is example",
  "id": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/environments/test-remote-cpu-env/versions/1",
  "image": "mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04",
  "name": "test-remote-cpu-env",
  "os_type": "linux",
  "resourceGroup": "AML-rg",
  "tags": {},
  "

## 4. Save scripts

In this example, I create a pipeline for model training, evaluation, and model registration.<br>
In this pipeline, the following steps will be executed.

1. The model is trained.
2. The model accuracy is evaluated. The model metrics is set as pipeline's output.

Each source code will then be saved as follows.

- training script ```./pipeline_script/train.py```
- evaluation script ```./pipeline_script/evaluate.py```

Model name (sub folder name in model dir) is saved in model info file (JSON text), which is passed into the next steps.

In [3]:
import os
script_folder = './pipeline_script'
os.makedirs(script_folder, exist_ok=True)

In [4]:
%%writefile pipeline_script/train.py
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import sys
import os
import shutil
import argparse
import math
import json

import tensorflow as tf

FLAGS = None
batch_size = 100

#
# Define functions for Estimator
#

def _my_input_fn(filepath, num_epochs):
    # image - 784 (=28 x 28) elements of grey-scaled integer value [0, 1]
    # label - digit (0, 1, ..., 9)
    data_queue = tf.train.string_input_producer(
        [filepath],
        num_epochs = num_epochs) # data is repeated and it raises OutOfRange when data is over
    data_reader = tf.TFRecordReader()
    _, serialized_exam = data_reader.read(data_queue)
    data_exam = tf.parse_single_example(
        serialized_exam,
        features={
            'image_raw': tf.FixedLenFeature([], tf.string),
            'label': tf.FixedLenFeature([], tf.int64)
        })
    data_image = tf.decode_raw(data_exam['image_raw'], tf.uint8)
    data_image.set_shape([784])
    data_image = tf.cast(data_image, tf.float32) * (1. / 255)
    data_label = tf.cast(data_exam['label'], tf.int32)
    data_batch_image, data_batch_label = tf.train.batch(
        [data_image, data_label],
        batch_size=batch_size)
    return {'inputs': data_batch_image}, data_batch_label

def _get_input_fn(filepath, num_epochs):
    return lambda: _my_input_fn(filepath, num_epochs)

def _my_model_fn(features, labels, mode):
    # with tf.device(...): # You can set device if using GPUs

    # define network and inference
    # (simple 2 fully connected hidden layer : 784->128->64->10)
    with tf.name_scope('hidden1'):
        weights = tf.Variable(
            tf.truncated_normal(
                [784, FLAGS.first_layer],
                stddev=1.0 / math.sqrt(float(784))),
            name='weights')
        biases = tf.Variable(
            tf.zeros([FLAGS.first_layer]),
            name='biases')
        hidden1 = tf.nn.relu(tf.matmul(features['inputs'], weights) + biases)
    with tf.name_scope('hidden2'):
        weights = tf.Variable(
            tf.truncated_normal(
                [FLAGS.first_layer, FLAGS.second_layer],
                stddev=1.0 / math.sqrt(float(FLAGS.first_layer))),
            name='weights')
        biases = tf.Variable(
            tf.zeros([FLAGS.second_layer]),
            name='biases')
        hidden2 = tf.nn.relu(tf.matmul(hidden1, weights) + biases)
    with tf.name_scope('softmax_linear'):
        weights = tf.Variable(
            tf.truncated_normal(
                [FLAGS.second_layer, 10],
                stddev=1.0 / math.sqrt(float(FLAGS.second_layer))),
        name='weights')
        biases = tf.Variable(
            tf.zeros([10]),
            name='biases')
        logits = tf.matmul(hidden2, weights) + biases
 
    # compute evaluation matrix
    predicted_indices = tf.argmax(input=logits, axis=1)
    if mode != tf.estimator.ModeKeys.PREDICT:
        label_indices = tf.cast(labels, tf.int32)
        accuracy = tf.metrics.accuracy(label_indices, predicted_indices)
        tf.summary.scalar('accuracy', accuracy[1]) # output to TensorBoard
 
        loss = tf.losses.sparse_softmax_cross_entropy(
            labels=labels,
            logits=logits)
 
    # define operations
    if mode == tf.estimator.ModeKeys.TRAIN:
        #global_step = tf.train.create_global_step()
        #global_step = tf.contrib.framework.get_or_create_global_step()
        global_step = tf.train.get_or_create_global_step()        
        optimizer = tf.train.GradientDescentOptimizer(
            learning_rate=FLAGS.learning_rate)
        train_op = optimizer.minimize(
            loss=loss,
            global_step=global_step)
        return tf.estimator.EstimatorSpec(
            mode,
            loss=loss,
            train_op=train_op)
    if mode == tf.estimator.ModeKeys.EVAL:
        eval_metric_ops = {
            'accuracy': accuracy
        }
        return tf.estimator.EstimatorSpec(
            mode,
            loss=loss,
            eval_metric_ops=eval_metric_ops)
    if mode == tf.estimator.ModeKeys.PREDICT:
        probabilities = tf.nn.softmax(logits, name='softmax_tensor')
        predictions = {
            'classes': predicted_indices,
            'probabilities': probabilities
        }
        export_outputs = {
            'prediction': tf.estimator.export.PredictOutput(predictions)
        }
        return tf.estimator.EstimatorSpec(
            mode,
            predictions=predictions,
            export_outputs=export_outputs)

def _my_serving_input_fn():
    inputs = {'inputs': tf.placeholder(tf.float32, [None, 784])}
    return tf.estimator.export.ServingInputReceiver(inputs, inputs)

_my_evaluation_input_fn = (tf.estimator.experimental.build_raw_supervised_input_receiver_fn(
    {'inputs': tf.placeholder(dtype=tf.float32, shape=[None, 784])},
    tf.placeholder(dtype=tf.int32, shape=[None])))

#
# Main
#

parser = argparse.ArgumentParser()
parser.add_argument(
    '--data_folder',
    type=str,
    default='./data',
    help='Folder path for input data')
parser.add_argument(
    '--chkpoint_folder',
    type=str,
    default='./logs',  # AML experiments logs folder
    help='Folder path for checkpoint files')
parser.add_argument(
    '--model_temp',
    type=str,
    default='./outputs',
    help='Folder path for model temporary output')
parser.add_argument(
    '--model_folder',
    type=str,
    help='Folder path for model output')
parser.add_argument(
    '--learning_rate',
    type=float,
    default='0.07',
    help='Learning Rate')
parser.add_argument(
    '--first_layer',
    type=int,
    default='128',
    help='Neuron number for the first hidden layer')
parser.add_argument(
    '--second_layer',
    type=int,
    default='64',
    help='Neuron number for the second hidden layer')
FLAGS, unparsed = parser.parse_known_args()

# Clean checkpoint and model folder if exists
if os.path.exists(FLAGS.chkpoint_folder) :
    for file_name in os.listdir(FLAGS.chkpoint_folder):
        file_path = os.path.join(FLAGS.chkpoint_folder, file_name)
        if os.path.isfile(file_path):
            os.remove(file_path)
        elif os.path.isdir(file_path):
            shutil.rmtree(file_path)
if os.path.exists(FLAGS.model_temp) :
    for file_name in os.listdir(FLAGS.model_temp):
        file_path = os.path.join(FLAGS.model_temp, file_name)
        if os.path.isfile(file_path):
            os.remove(file_path)
        elif os.path.isdir(file_path):
            shutil.rmtree(file_path)

# Read TF_CONFIG
run_config = tf.estimator.RunConfig()

# Create Estimator
mnist_fullyconnected_classifier = tf.estimator.Estimator(
    model_fn=_my_model_fn,
    model_dir=FLAGS.chkpoint_folder,
    config=run_config)
train_spec = tf.estimator.TrainSpec(
    input_fn=_get_input_fn(os.path.join(FLAGS.data_folder, 'train.tfrecords'), 2),
    max_steps=60000 * 2 / batch_size)
eval_spec = tf.estimator.EvalSpec(
    input_fn=_get_input_fn(os.path.join(FLAGS.data_folder, 'test.tfrecords'), 1),
    steps=10000 * 1 / batch_size,
    start_delay_secs=0)

# Run training !
tf.estimator.train_and_evaluate(
    mnist_fullyconnected_classifier,
    train_spec,
    eval_spec
)

# Save model and parameters
model_folder = mnist_fullyconnected_classifier.experimental_export_all_saved_models(
    export_dir_base=FLAGS.model_temp,
    input_receiver_fn_map={
        tf.estimator.ModeKeys.EVAL: _my_evaluation_input_fn,
        tf.estimator.ModeKeys.PREDICT: _my_serving_input_fn
    })
print('current working directory is ', os.getcwd())

# Copy model to model_folder
model_folder_path = model_folder.decode("utf-8")
model_folder_name = os.path.basename(model_folder_path)
dest_path = os.path.join(FLAGS.model_folder, "generated_model")
shutil.move(model_folder_path, dest_path)
print('model is saved ', dest_path)

Writing pipeline_script/train.py


In [5]:
%%writefile pipeline_script/evaluate.py
import os
import argparse
import json

import tensorflow as tf

FLAGS = None
batch_size = 100

def _my_input_fn(filepath, num_epochs):
    # image - 784 (=28 x 28) elements of grey-scaled integer value [0, 1]
    # label - digit (0, 1, ..., 9)
    data_queue = tf.train.string_input_producer(
        [filepath],
        num_epochs = num_epochs) # data is repeated and it raises OutOfRange when data is over
    data_reader = tf.TFRecordReader()
    _, serialized_exam = data_reader.read(data_queue)
    data_exam = tf.parse_single_example(
        serialized_exam,
        features={
            'image_raw': tf.FixedLenFeature([], tf.string),
            'label': tf.FixedLenFeature([], tf.int64)
        })
    data_image = tf.decode_raw(data_exam['image_raw'], tf.uint8)
    data_image.set_shape([784])
    data_image = tf.cast(data_image, tf.float32) * (1. / 255)
    data_label = tf.cast(data_exam['label'], tf.int32)
    data_batch_image, data_batch_label = tf.train.batch(
        [data_image, data_label],
        batch_size=batch_size)
    return {'inputs': data_batch_image}, data_batch_label

def _get_input_fn(filepath, num_epochs):
    return lambda: _my_input_fn(filepath, num_epochs)

parser = argparse.ArgumentParser()
parser.add_argument(
    '--data_folder',
    type=str,
    default='./data',
    help='Folder path for input data')
parser.add_argument(
    '--model_folder',
    type=str,
    default='./model',
    help='Folder path for model base dir')
parser.add_argument(
    '--output_info',
    type=str,
    default='./output_info',
    help='File path for model registration info')
FLAGS, unparsed = parser.parse_known_args()

# Load model
model_folder_path = os.path.join(FLAGS.model_folder, "generated_model")
est = tf.contrib.estimator.SavedModelEstimator(model_folder_path)

# Evaluate and output !
eval_results = est.evaluate(
    input_fn=_get_input_fn(os.path.join(FLAGS.data_folder, 'test.tfrecords'), 1),
    steps=10000 * 1 / batch_size)
print(
    "Accuracy: {}, Loss: {}".format(
        eval_results['metrics/accuracy'], eval_results['loss']
    )
)
output_info = {
    'accuracy' : float(eval_results['metrics/accuracy']),
    'loss' : float(eval_results['loss'])
}
output_json = json.dumps(output_info)
f = open(FLAGS.output_info,"w")
f.write(output_json)
f.close()

Writing pipeline_script/evaluate.py


## 5. Build and Run ML pipeline

Now let's compose pipeline in yaml, and submit a job for the generated pipeline.

> Note : In this example, I also use the registered data asset  (train.tfrecords, test.tfrecords) named ```mnist_tfrecords_data``` to mount in your compute target. Run "[Exercise02 : Prepare Data](./exercise02_prepare_data.ipynb)" for dataset preparation.

In [8]:
%%writefile 09_training_pipeline_job.yml
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: training-pipeline01
experiment_name: training-pipeline01
compute: azureml:mycluster01
inputs:
  mnist_tf:
    type: uri_folder
    path: azureml:mnist_tfrecords_data@latest
outputs:
  model_metrics:
jobs:
  train_model:
    name: train_model
    display_name: train_model
    command: >-
      python train.py
      --data_folder ${{inputs.tf_dataset}}
      --model_folder ${{outputs.model_dir}}
    code: pipeline_script
    environment: azureml:test-remote-cpu-env@latest
    inputs:
      tf_dataset: ${{parent.inputs.mnist_tf}}
    outputs:
      model_dir:
  evaluate_model:
    name: evaluate_model
    display_name: evaluate_model
    command: >-
      python evaluate.py
      --data_folder ${{inputs.tf_dataset}}
      --model_folder ${{inputs.model_dir}}
      --output_info ${{outputs.model_info}}/metrics.txt
    code: pipeline_script
    environment: azureml:test-remote-cpu-env@latest
    inputs:
      tf_dataset: ${{parent.inputs.mnist_tf}}
      model_dir: ${{parent.jobs.train_model.outputs.model_dir}}
    outputs:
      model_info: ${{parent.outputs.model_metrics}}

Writing 09_training_pipeline_job.yml


Submit a job to run this pipeline.

In [9]:
!az ml job create --file 09_training_pipeline_job.yml \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

[32mUploading pipeline_script (0.01 MBs): 100%|█| 10154/10154 [00:00<00:00, 358227.6[0m
[39m

{
  "compute": "azureml:mycluster01",
  "creation_context": {
    "created_at": "2022-06-07T07:10:55.905612+00:00",
    "created_by": "Tsuyoshi Matsuzaki",
    "created_by_type": "User"
  },
  "display_name": "training-pipeline01",
  "experiment_name": "training-pipeline01",
  "id": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/jobs/loyal_grass_ln86f4gx22",
  "inputs": {
    "mnist_tf": {
      "mode": "ro_mount",
      "path": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/data/mnist_tfrecords_data/versions/1",
      "type": "uri_folder"
    }
  },
  "jobs": {
    "evaluate_model": {
      "$schema": "{}",
      "code": "{}",
      "command": "{}",
      "component": "azureml:ce483600-4a0f-8caa-e0c6-

Go to [AML studio UI](https://ml.azure.com/) and see pipeline results in jobs. (See below.)

![Pipeline results](https://tsmatz.github.io/images/github/azure-ml-tensorflow-complete-sample/20220225_Experiment_Pipeline.jpg)

You can extract model metrics in pipeline outputs.<br>
If it's passed in this training pipeline, you can then invoke the next stage in MLOps integration.

## 6. Remove Compute

You don't need to remove your AML compute for saving money, because the nodes will be automatically terminated, when it's inactive.<br>
But if you want to clean up, please run as follows.

In [10]:
!az ml compute delete --name mycluster01 \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace \
  --yes

Deleting compute mycluster01 
.................................................Done.
(4m 8s)

[0m