# Azure Machine Learning Pipelines with Auto ML
In this demonstration, we will be looking at how ot discover high performing models with Azure Machine learning pipelines.
This demonstration is adapted from AutoML pipeline to fit the scenario for this activate title regression
1. Use automated ML in an Azure Machine Learning pipeline in Python - https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-automlstep-in-pipelines

## Referencing Machine Learning workspace from config file created in previous steps
In this step we are getting details of machine learning workspace previously created from the config file

#### The below cell can be executed if you are running the notebook locally in this machine and you created the workspace using the portal. Replace subscription-id, resource-group and workspace-name

In [None]:
from azureml.core import Workspace

#subscription_id = '<subscription_id>'
#resource_group  = '<resource_group>'
#workspace_name  = '<workspace_name>'

#try:
#    ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
#    ws.write_config()
#    print('Library configuration succeeded')
#except:
#    print('Workspace not found')

In [None]:
import azureml.core
from azureml.core import Workspace, Datastore

ws = Workspace.from_config()

In [None]:
ws.get_details()

# Getting Datastore, Blobstore and Filestore in the workspace
In this step, we will define the datastore, blobstore and file store

In [None]:
# Default datastore 
def_data_store = ws.get_default_datastore()

# Get the blob storage associated with the workspace
def_blob_store = Datastore(ws, "workspaceblobstore")

# Get file storage associated with the workspace
def_file_store = Datastore(ws, "workspacefilestore")

Examine the data store details

In [None]:
def_data_store

# Upload dataset and register the dataset in the workspace
In the following steps, we will upload the training and test dataset in the workspace blobstore and create a dataset that can be used further in the pipeline

In [None]:
def_blob_store.upload_files(
    ["./LengthOfStay_trimmed.csv"],
    target_path="train-dataset",
    overwrite=True)

 If there isn't already a dataset named 'patient_los_dataset' registered, then it creates one. The code downloads CSV data from the Web, uses them to instantiate a TabularDataset and then registers the dataset with the workspace. Finally, the function Dataset.get_by_name() assigns the Dataset to patient_los_dataset.

In [None]:
from azureml.core import Dataset
ws = Workspace.from_config()
datastore = Datastore.get(ws, 'workspaceblobstore')
patient_los_dataset = Dataset.Tabular.from_delimited_files([(datastore, 'train-dataset/LengthOfStay_trimmed.csv')])
patient_los_dataset.register(workspace = ws,
                                     name = 'patient_los_dataset',
                                     description = 'patient los data',
                                     create_new_version = True)

patient_los_dataset = Dataset.get_by_name(ws, 'patient_los_dataset')


The below code fetches dataset keys Vire which provides information about the dataset registration and version and any tags that may have been used while registering the dataset.

In [None]:
ws.datasets.keys()

## Create compute cluster
In the step below, we will create a compute target to run the pipeline

In [None]:
from azureml.core import Datastore
from azureml.core.compute import AmlCompute, ComputeTarget

datastore = ws.get_default_datastore()

compute_name = 'cpu-cluster'
if not compute_name in ws.compute_targets :
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                                max_nodes=4)
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)

    compute_target.wait_for_completion(
        show_output=True, min_node_count=None, timeout_in_minutes=20)

    # Show the result
    print(compute_target.get_status().serialize())

compute_target = ws.compute_targets[compute_name]

The below code prints metadata about the patient los dataset that we uploaded in the earlier steps

In [None]:
print(patient_los_dataset)

## Configure the training run
This step is to make sure that the remote training run has all the dependencies that are required by the training steps. Dependencies and the runtime context are set by creating and configuring a RunConfiguration object.

In [None]:
#!pip install ruamel.yaml==0.17.4 --user

In [None]:
#``ruamel.yaml<=0.15``

This step configures the training run.The runtime context is set by creating and configuring a RunConfiguration object. Here we set the compute target created earlier.

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core import Environment 

aml_run_config = RunConfiguration()
# Use just-specified compute target ("cpu-cluster")
aml_run_config.target = compute_target

USE_CURATED_ENV = False
if USE_CURATED_ENV :
    curated_environment = Environment.get(workspace=ws, name="AzureML-Tutorial")
    aml_run_config.environment = curated_environment
else:
    aml_run_config.environment.python.user_managed_dependencies = False
    
    # Add some packages relied on by data prep step
    aml_run_config.environment.python.conda_dependencies = CondaDependencies.create(
        conda_packages=['pandas','scikit-learn'], 
        pip_packages=['azureml-sdk[automl,explain]', 'azureml-dataprep[fuse,pandas]'], 
        pin_sdk_version=False)

## Preparing data for Auto ML regression
In this step, we are doing data preparation to drop columns that wont be used for prediction. This can be extended further to do complete data preparation

In [None]:
%%writefile dataprep.py
from azureml.core import Run

import pandas as pd 
import numpy as np 
import pyarrow as pa
import pyarrow.parquet as pq
import argparse
RANDOM_SEED=42
def prepare_train_x(df):
    # drop the predicted values of the dataset that relates to classification 
    train_x = df.drop(['vdate'], axis=1)
    return train_x


parser = argparse.ArgumentParser()
parser.add_argument('--output_path', dest='output_path', required=True)
args = parser.parse_args()
patient_los_dataset = Run.get_context().input_datasets['patient_los_dataset']

df_train = patient_los_dataset.to_pandas_dataframe()

prepare_train_x_df=prepare_train_x(df_train)

os.makedirs(os.path.dirname(args.output_path), exist_ok=True)
pq.write_table(pa.Table.from_pandas(prepare_train_x_df), args.output_path)




print(f"Wrote test to {args.output_path} and train to {args.output_path}")

## Define Data preparation step for pipeline
In this step we are defining data preparation step with the python file created earlier for data preparation

The data preparation code described above must be associated with a PythonScripStep object to be used with a pipeline. The path to which the CSV output is written is generated by a OutputFileDatasetConfig object. The resources prepared earlier, such as the ComputeTarget, the RunConfig, and the 'patient_los_dataset' Dataset are used to complete the specification.
The prepped_data_path object is of type OutputFileDatasetConfig which points to a directory. Notice that it's specified in the arguments parameter. If you review the previous step, you'll see that within the data preparation code, the value of the argument '--output_path' is the directory path at which the CSV file was written.

In [None]:
from azureml.pipeline.core import PipelineData,  InputPortBinding, Pipeline
from azureml.pipeline.steps import PythonScriptStep

#datastore = Datastore.get(ws, 'workspaceblobstore')

prepped_data_path = PipelineData("patient_los_dataset",def_data_store,"direct").as_dataset()


dataprep_step = PythonScriptStep(
    name="dataprep", 
    script_name="dataprep.py", 
    compute_target=compute_target, 
    runconfig=aml_run_config,
    arguments=["--output_path", prepped_data_path],
    inputs=[patient_los_dataset.as_named_input("patient_los_dataset")],
    outputs=[prepped_data_path],
    allow_reuse=True
)

# Train with AutoMLStep

Configuring an automated ML pipeline step is done with the AutoMLConfig class. This flexible class is described in Configure automated ML experiments in Python. Data input and output are the only aspects of configuration that require special attention in an ML pipeline. Input and output for AutoMLConfig in pipelines is discussed in detail below. Beyond data, an advantage of ML pipelines is the ability to use different compute targets for different steps. You might choose to use a more powerful ComputeTarget only for the automated ML process. Doing so is as straightforward as assigning a more powerful RunConfiguration to the AutoMLConfig object's run_configuration parameter.

## Send data to AutoML Step
The snippet below creates a high-performing PipelineOutputTabularDataset from the PipelineOutputFileDataset output of the data preparation step.

In [None]:
prepped_data_path


In [None]:
# type(prepped_data_path) == PipelineOutputFileDataset
# type(prepped_data) == PipelineOutputTabularDataset
prepped_data = prepped_data_path.parse_parquet_files(file_extension=None)

## Specify Automated ML Outputs
The outputs of the AutoMLStep are the final metric scores of the higher-performing model and that model itself. To use these outputs in further pipeline steps, prepare PipelineData objects to receive them.The snippet above creates the two PipelineData objects for the metrics and model output. Each is named, assigned to the default datastore retrieved earlier, and associated with the particular type of TrainingOutput from the AutoMLStep. Because we assign pipeline_output_name on these PipelineData objects, their values will be available not just from the individual pipeline step, but from the pipeline as a whole

In [None]:
from azureml.pipeline.core import TrainingOutput

metrics_data = PipelineData(name='metrics_data',
                           datastore=datastore,
                           pipeline_output_name='metrics_output',
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='best_model_data',
                           datastore=datastore,
                           pipeline_output_name='model_output',
                           training_output=TrainingOutput(type='Model'))

## Configure and Create Automated ML Pipeline Step
Once the inputs and outputs are defined, it's time to create the AutoMLConfig and AutoMLStep. The details of the configuration will depend on your task, in this case, it is regression to predict los from the patient los Dataset.
The automl_settings dictionary is passed to the AutoMLConfig constructor as kwargs. The other parameters aren't complex:

- task is set to regression for this example. Other valid values are classification and forecasting
- path and debug_log describe the path to the project and a local file to which debug information will be written
- compute_target is the previously defined compute_target that, in this example, is an inexpensive CPU-based machine. If you're using AutoML's Deep Learning facilities, you would want to change the compute target to be GPU-based
- featurization is set to auto. Indicates that as part of preprocessing, data guardrails and featurization steps are performed automatically. This is the default option.
- label_column_name indicates which column we are interested in predicting
- training_data is set to the OutputTabularDatasetConfig objects made from the outputs of the data preparation step

The AutoMLStep itself takes the AutoMLConfig and has, as outputs, the PipelineData objects created to hold the metrics and model data

In this example, the automated ML process will perform cross-validations on the training_data. You can control the number of cross-validations with the n_cross_validations argument. If you've already split your training data as part of your data preparation steps, you can set validation_data to its own Dataset.

In [None]:
from azureml.train.automl import AutoMLConfig
from azureml.pipeline.steps import AutoMLStep
import logging
automl_settings = {
       "n_cross_validations":5,
       "primary_metric": 'r2_score',
       "enable_early_stopping": True,
       #change the timeout for shorter run of the experiment if there is no time to demonstrate
       "experiment_timeout_hours": 1.0,
       "max_concurrent_iterations": 4,
       "max_cores_per_iteration": -1,
       "verbosity": logging.INFO
   }

automl_config = AutoMLConfig(task = 'regression',
                               path = '.',
                               compute_target = compute_target,
                               training_data = prepped_data,
                           #    run_configuration = aml_run_config,
                               featurization = 'auto',
                               debug_log = 'automated_ml_errors.log',
                               label_column_name = 'lengthofstay',
                               **automl_settings
                               )

train_step = AutoMLStep(name='AutoMLregression',
    automl_config=automl_config,
    passthru_automl_config=False,
    outputs=[metrics_data,model_data],
    allow_reuse=True)

## Register the model created by automated ML
The last step in a basic ML pipeline is registering the created model. By adding the model to the workspace's model registry, it will be available in the portal and can be versioned. To register the model, write another PythonScriptStep that takes the model_data output of the AutoMLStep(first and the second cell below this cell performs these steps).

A model is registered in a Workspace. You're probably familiar with using Workspace.from_config() to log on to your workspace on your local machine, but there's another way to get the workspace from within a running ML pipeline. The Run.get_context() retrieves the active Run. This run object provides access to many important objects, including the Workspace used here.


In [None]:
%%writefile register_model.py
from azureml.core.model import Model, Dataset
from azureml.core.run import Run, _OfflineRun
from azureml.core import Workspace
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--model_name", required=True)
parser.add_argument("--model_path", required=True)
args = parser.parse_args()

print(f"model_name : {args.model_name}")
print(f"model_path: {args.model_path}")

run = Run.get_context()
ws = Workspace.from_config() if type(run) == _OfflineRun else run.experiment.workspace

model = Model.register(workspace=ws,
                       model_path=args.model_path,
                       model_name=args.model_name)

print("Registered version {0} of model {1}".format(model.version, model.name))

The model-registering PythonScriptStep uses a PipelineParameter for one of its arguments. Pipeline parameters are arguments to pipelines that can be easily set at run-submission time. Once declared, they're passed as normal arguments.

In [None]:
from azureml.pipeline.core.graph import PipelineParameter

# The model name with which to register the trained model in the workspace.
model_name = PipelineParameter("model_name", default_value="LOSPredict")

register_step = PythonScriptStep(script_name="register_model.py",
                                       name="register_model",
                                       allow_reuse=False,
                                       arguments=["--model_name", model_name, "--model_path", model_data],
                                       inputs=[model_data],
                                       compute_target=compute_target,
                                       runconfig=aml_run_config)

## Create and run the automated ML pipeline
Creating and running a pipeline that contains the AutoML Step

The code below combines the data preparation, automated ML, and model-registering steps into a Pipeline object. It then creates an Experiment object. The Experiment constructor will retrieve the named experiment if it exists or create it if necessary. It submits the Pipeline to the Experiment, creating a Run object that will asynchronously run the pipeline. The wait_for_completion() function blocks until the run completes.

In [None]:
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment
azureml._restclient.snapshots_client.SNAPSHOT_MAX_SIZE_BYTES = 20000000000


pipeline = Pipeline(ws, [dataprep_step, train_step, register_step])
#pipeline = Pipeline(ws, [dataprep_step,train_step])

experiment = Experiment(workspace=ws, name='los_automl_pipeline')

run = experiment.submit(pipeline, show_output=True)
run.wait_for_completion()

## Examine pipeline results
Once the run completes, you can retrieve PipelineData objects that have been assigned a pipeline_output_name. You can download the results and load them for further processing. Downloaded files are written to the subdirectory azureml/{run.id}/. The metrics file is JSON-formatted and can be converted into a Pandas dataframe for examination.

In [None]:
metrics_output_port = run.get_pipeline_output('metrics_output')
model_output_port = run.get_pipeline_output('model_output')

metrics_output_port.download('.', show_progress=True)
model_output_port.download('.', show_progress=True)

Change the run_id in the metrics_file name parameter to the GUID like output you might get. For example, the code below uses this guid from metrics_data output above - azureml/ff426d72-94eb-4fa6-a42e-280c76e20d91/metrics_data


In [None]:
import pandas as pd
import json

#metrics_filename = metrics_output._path_on_datastore
metrics_filename = 'azureml/ff426d72-94eb-4fa6-a42e-280c76e20d91/metrics_data'
with open(metrics_filename) as f:
   metrics_output_result = f.read()
   
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

This model file provided below can be further used for inferencing and metrics analysis and so on. Change the run_id portion in the model_filename parameter

In [None]:
import pickle
import os
#from sklearn.preprocessing import Imputer

#from azureml.automl.core._shared_package_legacy_import import _import_all_legacy_submodules
#model_filename = model_output._path_on_datastore
#change the run id portion to the directory created under ParenDirectory/azureml
model_filename =  'azureml/ff426d72-94eb-4fa6-a42e-280c76e20d91/best_model_data'

with open(model_filename, "rb" ) as f:
        best_model = pickle.load(f)
       

best_model