# Train and hyperparameter tune with RAPIDS

## Prerequisites

Create an Azure ML [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace)  and setup environmnet on local computer following the steps in [??????]  or run in Compute Instance


In [1]:
# verify Azure ML SDK version

%pip show azure-ai-ml

Name: azure-ai-ml
Version: 1.2.0
Summary: Microsoft Azure Machine Learning Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python
Author: Microsoft Corporation
Author-email: azuresdkengsysadmins@microsoft.com
License: MIT License
Location: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages
Requires: azure-common, azure-core, azure-mgmt-core, azure-storage-blob, azure-storage-file-datalake, azure-storage-file-share, colorama, isodate, jsonschema, marshmallow, msrest, opencensus-ext-azure, pydash, pyjwt, pyyaml, strictyaml, tqdm, typing-extensions
Required-by: 
Note: you may need to restart the kernel to use updated packages.


## Initialize workspace

Initialize `MLClient` class to handle the workspace you created in the prerequisites step. `MLClient.from_config(credential, path)`
creates a workspace object from the details stored in `config.json`

In [2]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential


# Get a handle to the workspace
ml_client = MLClient(
    credential=DefaultAzureCredential(),
    subscription_id="fc4f4a6b-4041-4b1c-8249-854d68edcf62",
    resource_group_name="rapidsai-deployment",
    workspace_name="rapids-aml-cluster",
)

print(
    "Workspace name: " + ml_client.workspace_name,
    "Subscription id: " + ml_client.subscription_id,
    "Resource group: " + ml_client.resource_group_name,
    sep="\n",
)

Workspace name: rapids-aml-cluster
Subscription id: fc4f4a6b-4041-4b1c-8249-854d68edcf62
Resource group: rapidsai-deployment


## Access data from Datastore URI

In this example, we will use 20 million rows (samples) of the airline dataset. The [datastore uri](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls#access-data-from-a-datastore-uri-like-a-filesystem-preview) below references a data storage location (path) containing the parquet files, you can download to your local computer or mount the files to your AML compute.


In [3]:
datastore_name = "workspaceartifactstore"
dataset = "airline_20000000.parquet"

# Datastore uri format:
data_uri = f"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.workspace_name}/datastores/{datastore_name}/paths/{dataset}"

print("data uri:", "\n", data_uri)

data uri: 
 azureml://subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapidsai-deployment/workspaces/rapids-aml-cluster/datastores/workspaceartifactstore/paths/airline_20000000.parquet


## Create AML compute

You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this notebook, we will use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training using a dynamically scalable pool of compute resources.

This notebook will use 10 nodes for hyperparameter optimization, you can modify `max_instances` based on available quota in the desired region. Similar to other Azure ML services, there are limits on AmlCompute, this [article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) includes details on the default limits and how to request more quota.

`size` describes the virtual machine type and size that will be used in the cluster. RAPIDS requires NVIDIA Pascal or newer architecture, so 
you will need to select compute targets from one of the  
[GPU virtual machines in Azure](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) provisioned with P40 and V100 GPUs : `NC_v2`, `NC_v3`, `ND` or `ND_v2` 

Let's create an `AmlCompute` cluster of `Standard_NC12s_v3` GPU VMs:

In [4]:
from azure.ai.ml.entities import AmlCompute

# specify aml compute name.
gpu_compute_target = "gpu-cluster"

try:
    # let's see if the compute target already exists
    gpu_target = ml_client.compute.get(gpu_compute_target)
    print(f"found compute target. Will use {gpu_compute_target}")
except:
    print("Creating a new gpu compute target...")

    gpu_target = AmlCompute(
        name="gpu-cluster",
        type="amlcompute",
        size="STANDARD_NC12S_V3",
        max_instances=5,
        idle_time_before_scale_down=300,
    )
    ml_client.compute.begin_create_or_update(gpu_target).result()

    print(
        f"AMLCompute with name {gpu_target.name} is created, the compute size is {gpu_target.size}"
    )

found compute target. Will use gpu-cluster



## Prepare training script

Create a project directory with your code to run on the remote resource. This includes the training script and additional files your training script depends on. In this example, the training script is provided:

`train_rapids.py`- entry script for RAPIDS Environment, includes loading dataset into cuDF dataframe, training with Random Forest and inference using cuML.

Copy the training script `train_rapids.py` into your project directory:

In [None]:
import os

project_folder = "./train_rapids"  # create folder in same dir
os.makedirs(project_folder, exist_ok=True)

In [None]:
import shutil


notebook_path = os.path.realpath("__file__" + "/../../code")
rapids_script = os.path.join(notebook_path, "train_rapids.py")
azure_script = os.path.join(notebook_path, "rapids_csp_azure.py")


shutil.copy(rapids_script, project_folder)
shutil.copy(azure_script, project_folder)

## Train model on the remote compute

Now that you have your data and training script prepared, you are ready to train on your remote compute:

### Create experiment

Track all the runs in your workspace

In [None]:
experiment_name = "test_rapids_gpu_cluster"

### Use Custom Docker Image

We'll be using a [custom](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python#create-an-environment-from-a-docker-image) RAPIDS docker image to setup the environment. This is available in [rapidsai/rapidsai repo](https://hub.docker.com/r/rapidsai/rapidsai/) on DockerHub.

In [None]:
from azure.ai.ml.entities import Environment, BuildContext

env_docker_image = Environment(
    build=BuildContext(path="./docker"),
    name="rapids-docker-image-2302",
    description="Rapids v23.02 Environment",
)

ml_client.environments.create_or_update(env_docker_image)

### Submit the training job 

We will configure and run a training job using the`command`class. The [command](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml?view=azure-python#azure-ai-ml-command) can be used to run standalone jobs or as a function inside pipelines.
`inputs` is a dictionary of command-line arguments to pass to the training script.


In [None]:
from azure.ai.ml import command, Input


command_job = command(
    environment="rapids-docker-image-2302:2",
    experiment_name=experiment_name,
    code=project_folder,
    command="python train_rapids.py --data_dir ${{inputs.data_dir}} --n_bins ${{inputs.n_bins}} --compute ${{inputs.compute}} --cv_folds ${{inputs.cv_folds}}\
                         --n_estimators ${{inputs.n_estimators}}  --max_depth ${{inputs.max_depth}}  --max_features ${{inputs.max_features}}",
    inputs={
        "data_dir": Input(type="uri_file", path=data_uri),
        "n_bins": 32,
        "compute": "single-GPU",  # multi-GPU for algorithms via Dask
        "cv_folds": 5,
        "n_estimators": 50,
        "max_depth": 10,
        "max_features": 1.0,
    },
    compute="gpu-cluster",
)


# submit the command
returned_job = ml_client.jobs.create_or_update(command_job)

# get a URL for the status of the job
returned_job.studio_url

## Tune model hyperparameters

We can optimize our model's hyperparameters and improve the accuracy using Azure Machine Learning's hyperparameter tuning capabilities.

### Start a hyperparameter sweep

Let's define the hyperparameter space to sweep over. We will tune `n_estimators`, `max_depth` and `max_features` parameters. In this example we will use random sampling to try different configuration sets of hyperparameters and maximize `Accuracy`.

In [None]:
from azure.ai.ml.sweep import Choice, Uniform, MedianStoppingPolicy

command_job_for_sweep = command_job(
    n_estimators=Choice(values=range(50, 500)),
    max_depth=Choice(values=range(5, 19)),
    max_features=Uniform(min_value=0.2, max_value=1.0),
)

# apply the sweep parameter to obtain the sweep_job
sweep_job = command_job_for_sweep.sweep(
    compute="gpu-cluster",
    sampling_algorithm="random",
    primary_metric="Accuracy",
    goal="Maximize",
)

# define the limits for this sweep
sweep_job.set_limits(max_total_trials=10, max_concurrent_trials=5, timeout=300)

This will launch the RAPIDS training script with parameters that were specified in the cell above.

In [None]:
# submit the hpo job
returned_sweep_job = ml_client.create_or_update(sweep_job)

# get a URL for the status of the job
returned_sweep_job.studio_url

## Monitor SweepJobs runs

Monitor and view the progress of the machine learning training run with Mlflow

### Find and register best model

In [None]:
# Download best trial model output

best_sweep = ml_client.jobs.download(returned_sweep_job.name, output_name="best_model")
print(best_sweep)

List the model files uploaded during the run:

Register the folder (and all files in it) as a model named `train-rapids` under the workspace for deployment

## Delete cluster

In [None]:
# gpu_target.delete()