This is a modified version of https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/tabular-dataset-inference-iris.ipynb
Valerie Carey modified this script to demonstrate a serious Azure bug involving models with vs. without datasets.

Original Link Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.



<h1> Parallel Run Azure Bug With Model Including a Dataset - Summary </h1>

Modified version of https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/tabular-dataset-inference-iris.ipynb

I am finding that a ParallelRunStep hangs forever when you pass the name of a registered model that contains any dataset(s) in an argument called "--model_name".  The parallel run appears to fail to schedule any mini batches and hangs wihtout scheduling.  The error occurs whether or not you ever use the model in the script.  Here I show a dummy script that contains arguments and does nothing.  The script passes if "--model_name" corresponds to a registered model without datasets, but will hang forever for the same model with a dataset.

In addition, the argument "--model_name" seems to be reserved even though that is not documented anywhere as far as I can tell.  If you pass some dummy string to the argument called "--model_name", the script will create an error.  This is true even if that argument isn't actually used anywhere in the script, or if you didn't intend it to refer to a registered model.


<h4> Prerequisites </h4>

I have deleted the parts of the example code that connect to a workspace, get blob storage, create a compute instance, etc.  I assume you have these resources already.  You can modify the code below to connect to your own resources

<h2> Connect to workspace </h2>
Modify the code below to connect to your own workspace

In [1]:
# Check core SDK version number. I had 1.20.0
import azureml.core

print("SDK version:", azureml.core.VERSION)

SDK version: 1.20.0


In [2]:
# Enter your info here!
WORKSPACE_NAME = 'YOUR-INFO-HERE'
WORKSPACE_SUBSCRIPTION_ID = "YOUR-INFO-HERE"
WORKSPACE_RESOURCE_GROUP = "YOUR-INFO-HERE"

from azureml.core.workspace import Workspace

ws = Workspace.get(name=WORKSPACE_NAME,
               subscription_id=WORKSPACE_SUBSCRIPTION_ID,
               resource_group=WORKSPACE_RESOURCE_GROUP)

<h2> Set your datastore and compute </h2>
Modify the code below to specify your own datastore and blob storage locations

In [3]:
from azureml.core import Datastore

# input datastore
iris_data = Datastore.get(ws, 'YOUR-INFO-HERE')

# path on the datastore where you will store the model object and data files
iris_data_path = 'YOUR/INFO/HERE'

In [4]:
# Name of the compute target which should be provisioned already, and have at least 2 nodes
COMPUTE_TARGET_NAME = 'YOUR-INFO-HERE'

<h2> Set output folder </h2>
I use the same datastore for outputs as inputs.  You could put in your own information

In [5]:
from azureml.pipeline.core import PipelineData
output_folder = PipelineData(name='inferences', datastore=iris_data)

<h2> Get the data </h2>
Use the "iris" dataset as a simple input. This data isn't "used" here, except to enable a parallel run to happen (I batch on this data)

In [6]:
# Get a temporary location for storage of downloaded items
import tempfile
iris_data_tmpdir = tempfile.mkdtemp()
print(iris_data_tmpdir)

/tmp/tmp93ua7pvt


In [7]:
# Get the iris dataset, as CSV and Parquet

from sklearn import datasets

iris = datasets.load_iris()

import pandas as pd
iris_df = pd.DataFrame(data=iris['data'], columns = iris['feature_names'])
print(len(iris_df))

# Save as CSV
iris_data_local_csv =  os.path.join(iris_data_tmpdir, 'iris.csv')
iris_df.to_csv(iris_data_local_csv, sep = ',', index = False)


150


In [8]:
# list temporary directory contents
os.listdir(iris_data_tmpdir)

['iris.csv']

<h4> Create tablular dataset for batching </h4>
Move the iris data to blob storate and create dataset

In [9]:
# Move files to blob storage
iris_data.upload_files(files=[iris_data_local_csv],
                              target_path=iris_data_path, overwrite=True)

Uploading an estimated of 1 files
Uploading /tmp/tmp93ua7pvt/iris.csv
Uploaded /tmp/tmp93ua7pvt/iris.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_b4016c6fa500442fa98018b1713f41d7

In [10]:
# Create dataset
from azureml.core.dataset import Dataset

iris_ds_csv = Dataset.Tabular.from_delimited_files(path=[(iris_data,  
                                                           '/'.join([iris_data_path, 'iris.csv']))], 
                                                           validate=False)

<h2> Register Models </h2>
Note that models are used nowhere in the code.  However, how and whether a model is registered affects whether a simple test script runs. Therefore I register the iris model a couple different ways.
I get data from the sklearn iris dataset
https://github.com/Azure-Samples/Machine-Learning-Operationalization/blob/master/samples/python/code/iris/model.pkl

In [11]:
# Get the model object from a Github repo

import requests 
model_url = 'https://github.com/Azure-Samples/Machine-Learning-Operationalization/raw/master/samples/python/code/iris/model.pkl'
r = requests.get(model_url, allow_redirects=True)
model_data_local = os.path.join(iris_data_tmpdir, 'model.pkl')

open(model_data_local, 'wb').write(r.content)

924

<h4> Register usual model </h4>

In [12]:
from azureml.core.model import Model

model1 = Model.register(model_path = model_data_local,
                       model_name = "iris-prs", 
                       tags = {'pretrained': "iris"},
                       workspace = ws)

Registering model iris-prs


<h4> Register model plus a dataset </h4>
I register the same model, but add a dataset.  I use the same dataset as above although you could add any dataset or more than 1.  Somehow adding a dataset makes the parallel run step fail, even if a model isn't used ¯\_(ツ)_/¯

In [13]:
json_dummy_file = os.path.join(iris_data_tmpdir, '12_file_name.json')
with open(json_dummy_file, 'w') as fp: 
    pass

In [45]:
model2 = Model.register(model_path = model_data_local,
                       model_name = "iris-prs2", 
                        datasets = [('data', iris_ds_csv)],
                       tags = {'pretrained': "iris"},
                       workspace = ws)

Registering model iris-prs2


<h2> Create experiment script </h2>
I use this notebook to output a .py file. This doesn't actually do anything with the input data.  The purpose is to show that it finishes with certain arguments, hangs forever with others

In [15]:
# Get a temporary location for storage of downloaded items
import tempfile
py_tmpdir = tempfile.mkdtemp()
print(py_tmpdir)

py_outfile = os.path.join(py_tmpdir, 'iris_score.py')

/tmp/tmpzv_n48e3


In [16]:
%%writefile $py_outfile
import io
import argparse
import pandas as pd

from azureml_user.parallel_run import EntryScript

def init():

    logger = EntryScript().logger
    logger.info("init() is called.")

    # Define 2 input parameters to show special behavior of "model_name" string
    parser = argparse.ArgumentParser(description="Iris model serving")
    parser.add_argument('--arg_name', dest="arg_name", required=False)
    parser.add_argument('--model_name', dest="arg_name", required=False)
    args, unknown_args = parser.parse_known_args()


def run(input_data):
    # Return nonsense data frame
    result=pd.DataFrame({'A' : [1]})
    return result


Writing /tmp/tmpzv_n48e3/iris_score.py


In [17]:
os.listdir(py_tmpdir)

['iris_score.py']

<h2> Set up for batch run </h2>

In [18]:
# Set the environment
from azureml.core import Environment
from azureml.core.runconfig import CondaDependencies

predict_conda_deps = CondaDependencies.create(pip_packages=["scikit-learn==0.20.3",
                                                            "azureml-core", "azureml-dataset-runtime[pandas,fuse]"])

predict_env = Environment(name="predict_environment")
predict_env.python.conda_dependencies = predict_conda_deps
predict_env.docker.enabled = True
predict_env.spark.precache_packages = False

In [19]:
# Configure the parallel run

from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig

parallel_run_config = ParallelRunConfig(
    source_directory=py_tmpdir,
    entry_script='iris_score.py',  # the user script to run against each input
    mini_batch_size='1KB',
    error_threshold=-1,       # Don't worry about errors as I return dummy data
    output_action='append_row',
    append_row_file_name="iris_outputs.txt",
    environment=predict_env,
    compute_target=COMPUTE_TARGET_NAME, 
    node_count=2,
    run_invocation_timeout=600
)

<h2> Test 1 - Passing Case with Simple Registered Model </h2>
I pass the name of a model which contains no datasets to the script. Note that this works fine, and uses 2-3 mini batches/processes

In [20]:
# Create the pipeline step
distributed_csv_iris_step = ParallelRunStep(
    name='example-iris-csv',
    inputs=[iris_ds_csv.as_named_input('iris_data')],
    output=output_folder,
    parallel_run_config=parallel_run_config,
    arguments=['--model_name', 'iris-prs'], # Passing simple model's name
    allow_reuse=False
)

In [21]:
# Run the pipeline
from azureml.core import Experiment
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace=ws, steps=[distributed_csv_iris_step])

pipeline_run = Experiment(ws, 'iris-prs').submit(pipeline)

Created step example-iris-csv [eab74b8b][bead9866-bfc7-441a-aee0-fe2790422b2e], (This step will run and generate new outputs)
Submitted PipelineRun 4ca25603-c3cd-4804-8aad-c648a5473c63
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/iris-prs/runs/4ca25603-c3cd-4804-8aad-c648a5473c63?wsid=/subscriptions/47a09743-a743-494e-945d-4022653e134e/resourcegroups/rg-flex-flightrisk-data-001/workspaces/mlw-flightrisk-dev


In [None]:
## Wait the run for completion 
pipeline_run.wait_for_completion(show_output=False)

At this point you can view the experiment and it should be successful.  Note that more than 1 mini batches are used!!  I get 3 mini batches.  The script does nothing, which is fine.  The important thing is that it finishes.

<h2> Test 2 - Hanging Case for Model Containing Datasets </h2>
The parallel run step will hang forever, when run with a model containing a dataset

In [49]:
# Create the pipeline step
distributed_csv_iris_step2 = ParallelRunStep(
    name='example-iris-csv',
    inputs=[iris_ds_csv.as_named_input('iris_data')],
    output=output_folder,
    parallel_run_config=parallel_run_config,
    arguments=['--model_name', 'iris-prs2'],   # Using model with dataset's name
    allow_reuse=False
)

In [47]:
pipeline2 = Pipeline(workspace=ws, steps=[distributed_csv_iris_step2])

pipeline_run2 = Experiment(ws, 'iris-prs').submit(pipeline2)

Created step example-iris-csv [7c2bb11a][1fb2b8c1-6bb6-4ce8-9c31-b57108218ec2], (This step will run and generate new outputs)
Submitted PipelineRun dedf58d4-c082-4cc1-88b7-bf2594c4a4f7
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/iris-prs/runs/dedf58d4-c082-4cc1-88b7-bf2594c4a4f7?wsid=/subscriptions/47a09743-a743-494e-945d-4022653e134e/resourcegroups/rg-flex-flightrisk-data-001/workspaces/mlw-flightrisk-dev


In [48]:
## Wait the run for completion 
pipeline_run2.wait_for_completion(show_output=False)

PipelineRunId: dedf58d4-c082-4cc1-88b7-bf2594c4a4f7
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/iris-prs/runs/dedf58d4-c082-4cc1-88b7-bf2594c4a4f7?wsid=/subscriptions/47a09743-a743-494e-945d-4022653e134e/resourcegroups/rg-flex-flightrisk-data-001/workspaces/mlw-flightrisk-dev
{'runId': 'dedf58d4-c082-4cc1-88b7-bf2594c4a4f7', 'status': 'Canceled', 'startTimeUtc': '2021-03-19T14:37:58.401337Z', 'endTimeUtc': '2021-03-19T14:48:25.470937Z', 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{}'}, 'inputDatasets': [], 'outputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://mlwflightriskd8499840071.blob.core.windows.net/azureml/ExperimentRun/dcid.dedf58d4-c082-4cc1-88b7-bf2594c4a4f7/logs/azureml/executionlogs.txt?sv=2019-02-02&sr=b&sig=Ejqole4owHBjsk5Dy8wgnyRwKsWJk1ec%2FrqbQNrWV2U%3D&st=2021-03-19T14%3A30%3A21Z&se=2021-03-19T22%3A40%3A21Z&sp=r', 'logs/azureml/stderrlogs.txt'

'Canceled'

This script never finishes.  If you look at the step's logs, there is a file called log/job_progress_overview.XXXXXXXX.txt (the XXXXXXXX's are digits- appears to be a date).  If you open that log, you will see something like:
<pre>
  2021-03-19T14:39:25.891499 Start the simulator.
  2021-03-19T14:39:26.798246 The overviewer on 10.0.0.7 started.
  2021-03-19T14:39:57.211507 Scheduled 0 mini batches in 31 seconds.
  2021-03-19T14:39:57.311094 Processed 0 mini batches in 0:00:31.
  2021-03-19T14:40:07.444707 Scheduled 0 mini batches in 42 seconds.
  2021-03-19T14:40:07.536334 Processed 0 mini batches in 0:00:42.
  ...
  2021-03-19T14:47:59.784273 Scheduled 0 mini batches in 514 seconds.
  2021-03-19T14:47:59.905612 Processed 0 mini batches in 0:08:34.
  2021-03-19T14:48:10.040731 Scheduled 0 mini batches in 524 seconds.
  2021-03-19T14:48:10.180070 Processed 0 mini batches in 0:08:44.
  </pre>
From the above, and comparing to logs from successful runs, it appears that the scheduling never happens.  The error occurs around this step - but you get no "error" just a hang.  "User" logs are created and you can see the entry_script_log's, which show that the init() function in the script was called.

<h2> Test 3: Passing Case for Model Name Argument Not Called model_name </h2>
The "--model-name" argument seems to be some sort of reserved word - the long name works if you rename the argument from "--model_name" to "--arg_name"

In [50]:
# Create the pipeline step
distributed_csv_iris_step3 = ParallelRunStep(
    name='example-iris-csv',
    inputs=[iris_ds_csv.as_named_input('iris_data')],
    output=output_folder,
    parallel_run_config=parallel_run_config,
    arguments=['--arg_name', 'iris-prs2'],   # Using model with dataset's name again
    allow_reuse=False
)

In [51]:
pipeline3 = Pipeline(workspace=ws, steps=[distributed_csv_iris_step3])

pipeline_run3 = Experiment(ws, 'iris-prs').submit(pipeline3)

Created step example-iris-csv [95b854ba][268993fa-e787-47b3-a8f2-7a20841352bc], (This step will run and generate new outputs)
Submitted PipelineRun fcf717eb-b630-4f11-ba41-31419000c5e0
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/iris-prs/runs/fcf717eb-b630-4f11-ba41-31419000c5e0?wsid=/subscriptions/47a09743-a743-494e-945d-4022653e134e/resourcegroups/rg-flex-flightrisk-data-001/workspaces/mlw-flightrisk-dev


In [None]:
## Wait the run for completion 
pipeline_run3.wait_for_completion(show_output=False)

This seems fine.  "model_name" is a reserved argument name?  That's news to me! I tried to find where that is documented and failed

<h2> Test 4: Error Case when model_name Not a Registered Model </h2>
If you pass in an arbitrary string for the "--model-name" argument, the script will error out after a long wait.  This is true even for my test script where the argument is unused!  The user needs to be aware that model_name is a reserved argument and must refer to an actual model!  The time it takes to error out is long enough that I thought this case was also a hang for a while. 

In [53]:
# Create the pipeline step
distributed_csv_iris_step4 = ParallelRunStep(
    name='example-iris-csv',
    inputs=[iris_ds_csv.as_named_input('iris_data')],
    output=output_folder,
    parallel_run_config=parallel_run_config,
    arguments=['--model_name', 'this_is_not_a_real_model'],   # Pass a string that's not a model name
    allow_reuse=False
)

In [54]:
pipeline4 = Pipeline(workspace=ws, steps=[distributed_csv_iris_step4])

pipeline_run4 = Experiment(ws, 'iris-prs').submit(pipeline4)

Created step example-iris-csv [446c0950][9985c7bb-6c03-413c-9fa3-83cb8bd74256], (This step will run and generate new outputs)
Submitted PipelineRun 56b9bc91-1973-4069-8748-eee7d8ed10cb
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/iris-prs/runs/56b9bc91-1973-4069-8748-eee7d8ed10cb?wsid=/subscriptions/47a09743-a743-494e-945d-4022653e134e/resourcegroups/rg-flex-flightrisk-data-001/workspaces/mlw-flightrisk-dev


In [None]:
## Wait the run for completion 
pipeline_run4.wait_for_completion(show_output=False)

At this point you get another hang, but you see slightly different behavior.  The "user" entry script logs aren't created -- it appears no part of the script is called. If you look in logs/sys/node_launcher/error.txt, you see:
<pre>
  {
    "error": {
        "message": "ModelNotFound: Model with name this_is_not_a_real_model not found in provided workspace"
        }
  }
</pre>
This shows the process is looking for a model - the model_name argument is reserved and assumptions are made about its meaning.   Note for the "model with dataset" hang (test 2), there is no error file in logs/sys/node_launcher.