Dev Environment Setup
---------------------
This is the *one-time* setup. 

1. create the conda env with proper SDK installed `azure` and use the conda environment to run the notebook 
2. down load the workspace config to `~/.azure/ygong/config.json` so that we can instantiate workspace from the config without needing supply the subscription, resourceGroup, etc all the time

```bash
# set up the conda environment for AzureCli
conda create  -n azure python=3.10
conda activate azure
pip install azure-cli # 2.55.0 on Feb 2024 installation
pip install azure-ai-ml
pip install azure-identity
pip install azureml-core

# pip install azureml-sdk # will fail with the depedency issue
```

Analysis Goal
-------------
Competitive Analysis For AzureML Training CUJ on 
* compute resource creation and management 
* how do they manage runtime environment
* workflow integration
* interactive development experience
* metrics and monitoring

In [20]:
import json

file_path = "/Users/yu.gong/.azure/ygong/config.json"
with open(file_path, 'r') as file:
    data = json.load(file)


from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient
from azure.ai.ml.entities import AmlCompute, ComputeInstance, VmSize
subscription_id = data['subscription_id']
resource_group = data['resource_group']
workspace_name = data['workspace_name']

credential = DefaultAzureCredential()
ml_client = MLClient(credential, subscription_id, resource_group, workspace_name)

# from azureml.core import Workspace
# from azureml.core.compute import ComputeTarget, AmlCompute
# from azureml.core.compute_target import ComputeTargetException

# ws = Workspace.from_config("~/.azure/ygong")


Initialize the Compute Target
-----------------------------

The instantiated compute target type is "Azure Machine Learning compute cluster" that could scale up and down automatically. 

The interesting cluster creation configuration are 
1. `max_nodes` and `min_nodes` indicate the elasticity of the cluster 
2. `vm_priority` ("Dedicate" and "Low Priority") is specified at the cluster creation time 
3. *NO RUNTIME* is specified 

In [21]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "cpu-test"
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
cpu_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


Submit a Job 
------------
1. `compute` is optional if not specified, it will be serverless
2. `environment` could be customized image or curated




In [23]:
from azure.ai.ml import command
from azure.ai.ml import Input

registered_model_name = "credit_defaults_model"

job = command(
    inputs=dict(
        data=Input(
            type="uri_file",
            path="https://azuremlexamples.blob.core.windows.net/datasets/credit_card/default_of_credit_card_clients.csv",
        ),
        test_train_ratio=0.2,
        learning_rate=0.25,
        registered_model_name=registered_model_name,
    ),
    code="./",  # location of source code
    command="python train.py --data ${{inputs.data}} --test_train_ratio ${{inputs.test_train_ratio}} --learning_rate ${{inputs.learning_rate}} --registered_model_name ${{inputs.registered_model_name}}",
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
    display_name="credit_default_prediction",
    # compute=cluster_name
)
ml_client.create_or_update(job)



[32mUploading competitive-analysis (0.02 MBs): 100%|██████████| 24755/24755 [00:00<00:00, 72637.24it/s]
[39m



Experiment,Name,Type,Status,Details Page
competitive-analysis,silver_fox_6z2gplqjkw,command,Starting,Link to Azure Machine Learning studio


Create Workflow
---------------


In [26]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

data_v1 = "initial"
data_name = "credit-card"
my_data = Data(name=data_name, version=data_v1, description="Credit card data",
    path="./azure-pipline/data/default_of_credit_card_clients.csv", type=AssetTypes.URI_FILE,)

## create data asset if it doesn't already exist:
try:
    data_asset = ml_client.data.get(name="credit-card", version=v1)
    print(f"Data asset already exists. Name: {my_data.name}, version: {my_data.version}")
except:
    ml_client.data.create_or_update(my_data)

[32mUploading default_of_credit_card_clients.csv[32m (< 1 MB): 100%|██████████| 2.90M/2.90M [00:01<00:00, 1.67MB/s]
[39m



In [30]:
# get a handle of the data asset and print the URI
credit_data = ml_client.data.get(name=data_name, version=data_v1)

from azure.ai.ml.entities import Environment
import os

custom_env_name = "aml-scikit-learn"

pipeline_job_env = Environment(
    name=custom_env_name,
    description="Custom environment for Credit Card Defaults pipeline",
    tags={"scikit-learn": "0.24.2"},
    conda_file=os.path.join("azure-pipline/env/", "conda.yaml"),
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
    version="0.3.0",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)

print(f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}")

Environment with name aml-scikit-learn is registered to workspace, the environment version is 0.3.0


In [31]:
from azure.ai.ml import command
from azure.ai.ml import Input, Output

data_prep_component = command(
    name="data_prep_credit_defaults",
    display_name="Data preparation for training",
    description="reads a .xl input, split the input to train and test",
    inputs={
        "data": Input(type="uri_folder"),
        "test_train_ratio": Input(type="number"),
    },
    outputs=dict(
        train_data=Output(type="uri_folder", mode="rw_mount"),
        test_data=Output(type="uri_folder", mode="rw_mount"),
    ),
    # The source folder of the component
    code="./azure-pipline/components/data_prep",
    command="""python data_prep.py \
            --data ${{inputs.data}} --test_train_ratio ${{inputs.test_train_ratio}} \
            --train_data ${{outputs.train_data}} --test_data ${{outputs.test_data}} \
            """,
    environment=f"{pipeline_job_env.name}:{pipeline_job_env.version}",
)
data_prep_component = ml_client.create_or_update(data_prep_component.component)

from azure.ai.ml import load_component

# Loading the component from the yml file
train_component = load_component(source=os.path.join("./azure-pipline/components/train", "train.yml"))

# Now we register the component to the workspace
train_component = ml_client.create_or_update(train_component)




[32mUploading data_prep (0.0 MBs): 100%|██████████| 1384/1384 [00:00<00:00, 18580.97it/s]
[39m

[32mUploading train (0.0 MBs): 100%|██████████| 3394/3394 [00:00<00:00, 11958.24it/s]
[39m



In [32]:
from azure.ai.ml import dsl, Input, Output


@dsl.pipeline(
    compute="serverless",  # "serverless" value runs pipeline on serverless compute
    description="E2E data_perp-train pipeline",
)
def credit_defaults_pipeline(
    pipeline_job_data_input,
    pipeline_job_test_train_ratio,
    pipeline_job_learning_rate,
    pipeline_job_registered_model_name,
):
    # using data_prep_function like a python call with its own inputs
    data_prep_job = data_prep_component(
        data=pipeline_job_data_input,
        test_train_ratio=pipeline_job_test_train_ratio,
    )

    # using train_func like a python call with its own inputs
    train_job = train_component(
        train_data=data_prep_job.outputs.train_data,  # note: using outputs from previous step
        test_data=data_prep_job.outputs.test_data,  # note: using outputs from previous step
        learning_rate=pipeline_job_learning_rate,  # note: using a pipeline input as parameter
        registered_model_name=pipeline_job_registered_model_name,
    )

    # a pipeline returns a dictionary of outputs
    # keys will code for the pipeline output identifier
    return {
        "pipeline_job_train_data": data_prep_job.outputs.train_data,
        "pipeline_job_test_data": data_prep_job.outputs.test_data,
    }
registered_model_name = "credit_defaults_model"

# Let's instantiate the pipeline with the parameters of our choice
pipeline = credit_defaults_pipeline(
    pipeline_job_data_input=Input(type="uri_file", path=credit_data.path),
    pipeline_job_test_train_ratio=0.25,
    pipeline_job_learning_rate=0.05,
    pipeline_job_registered_model_name=registered_model_name,
)

In [33]:
pipeline_job = ml_client.jobs.create_or_update(
    pipeline,
    # Project's name
    experiment_name="e2e_registered_components",
)
ml_client.jobs.stream(pipeline_job.name)


RunId: goofy_mango_m314hnym79
Web View: https://ml.azure.com/runs/goofy_mango_m314hnym79?wsid=/subscriptions/8e39df30-d249-4143-a081-aa974968d4b8/resourcegroups/azureml-test/workspaces/test

Streaming logs/azureml/executionlogs.txt

[2024-02-25 19:35:20Z] Submitting 1 runs, first five are: 22f08bfb:7214e119-f1ce-4aec-b8ce-ca6b02a406d6
[2024-02-25 19:37:06Z] Completing processing run id 7214e119-f1ce-4aec-b8ce-ca6b02a406d6.
[2024-02-25 19:37:06Z] Submitting 1 runs, first five are: 376da5f6:4868083a-34a9-4663-b079-872b08e7fdb5
[2024-02-25 19:39:05Z] Completing processing run id 4868083a-34a9-4663-b079-872b08e7fdb5.

Execution Summary
RunId: goofy_mango_m314hnym79
Web View: https://ml.azure.com/runs/goofy_mango_m314hnym79?wsid=/subscriptions/8e39df30-d249-4143-a081-aa974968d4b8/resourcegroups/azureml-test/workspaces/test

