# Train and hyperparameter tune with RAPIDS

## Prerequisites

- Create an Azure ML Workspace and setup environmnet on local computer following the steps in [Azure README.md](https://github.com/rapidsai/cloud-ml-examples/blob/main/azure/README.md)

In [None]:
# !pip install azure-ai-ml

In [None]:
# check Azure ML SDK version

!pip show azure-ai-ml

## Initialize workspace

Load and initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the prerequisites step.

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Enter details of your Azure Machine Learning workspace
subscription_id = '<SUBSCRIPTION_ID>'
resource_group = '<RESOURCE_GROUP>'
workspace = '<AZUREML_WORKSPACE_NAME>'
datastore_name = '<datastore>'
path_on_datastore 'airline_20000000.parquet'

# connect to the workspace
ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace)

## Create a FileDataset

In this example, we will use 20 million rows (samples) of the [airline dataset](http://kt.ijs.si/elena_ikonomovska/data.html). The [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py) below references parquet files that have been uploaded to a public [Azure Blob storage](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-overview), you can download to your local computer or mount the files to your AML compute.

In [None]:
# from azureml.fsspec import AzureMachineLearningFileSystem

# fs = AzureMachineLearningFileSystem("azureml://subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapids-deployment-doc/workspaces/skirui-azureml-rapids/datastores/workspaceartifactstore/paths/airline_20000000.parquet")
# fs.ls()

In [None]:
# long-form Datastore uri format:
uri = f"azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}"

In [None]:
from azureml.fsspec import AzureMachineLearningFileSystem

fs = AzureMachineLearningFileSystem(uri)
fs.ls()

## Create AML compute

You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this notebook, we will use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training using a dynamically scalable pool of compute resources.

This notebook will use 10 nodes for hyperparameter optimization, you can modify `max_instances` based on available quota in the desired region. Similar to other Azure services, there are limits on AmlCompute, this [article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) includes details on the default limits and how to request more quota.

`size` describes the virtual machine type and size that will be used in the cluster. RAPIDS requires NVIDIA Pascal or newer architecture, you will need to specify compute targets from one of `NC_v2`, `NC_v3`, `ND` or `ND_v2` [GPU virtual machines in Azure](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu); these are VMs that are provisioned with P40 and V100 GPUs. 

Let's create an `AmlCompute` cluster of `Standard_NC12s_v3` GPU VMs:

In [1]:
from azure.ai.ml.entities import AmlCompute

# specify aml compute name.  # choose a name for your cluster
gpu_compute_target = "gpu-cluster"

# check if desired amlcompute exists, if not create new
try:
    ml_client.compute.get(gpu_compute_target)
except Exception:
    print("Creating a new gpu compute target...")

    gpu_compute = AmlCompute(
        name=gpu_compute_target,
        size="STANDARD_NC12S_V3",
        min_instances=0,
        max_instances=4,
    )
    ml_client.compute.begin_create_or_update(compute).result()

ModuleNotFoundError: No module named 'azure'

## Prepare training script

Create a project directory that contains your code, including the training script and additional files / dependencies. In this example, the training script is provided:
<br>
`train_rapids.py` - entry script for RAPIDS Estimator that includes loading dataset into cuDF data frame, training with Random Forest and inference using cuML.

In [None]:
import os

project_folder = "./train_rapids"  # create folder in same dir
os.makedirs(project_folder, exist_ok=True)

We will log some metrics by using the `Run` object within the training script:

```python
from azureml.core.run import Run
run = Run.get_context()
```
 
We will also log the parameters and highest accuracy the model achieves:

```python
run.log('Accuracy', np.float(accuracy))
```

These run metrics will become particularly important when we begin hyperparameter tuning our model in the 'Tune model hyperparameters' section.

Copy the training script `train_rapids.py` into your project directory:

In [None]:
notebook_path = os.path.realpath(
    "__file__" + "/../../code"
)  # dir containing the training scrips
rapids_script = os.path.join(notebook_path, "train_rapids.py")
azure_script = os.path.join(notebook_path, "rapids_csp_azure.py")

In [None]:
import shutil

shutil.copy(rapids_script, project_folder)
shutil.copy(azure_script, project_folder)

In [None]:
notebook_path

## Train model on the remote compute

Now that you have your data and training script prepared, you are ready to train on your remote compute.

### Create experiment

Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace.

### Use Custom Docker Image

We'll be using a custom docker image to setup the environment. This is available in [rapidsai/rapidsai-cloud-ml on DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-cloud-ml/tags?page=1&ordering=last_updated). This image contains all necessary packages to run the example on Azure.

In [None]:
my_env = Environment(
    image="",  # base image to use
    name="",  # name of the model
    description="Rapids v23.02 docker container",
)

ml_client.environments.create_or_update(my_env)  # register the environment

In [None]:
from azureml.core import Environment
from azureml.core.runconfig import DockerConfiguration

environment_name = "rapids_hpo"
env = Environment(environment_name)

# enable docker
docker_config = DockerConfiguration(use_docker=True)

# rapids-cloud-ml image is available in Docker Hub
env.docker.base_image = "rapidsai/rapidsai-cloud-ml:latest"

# use rapids environment in the container, don't build a new conda environment
env.python.user_managed_dependencies = True

### Prepare RAPIDS Training Script 

We will use the [ScriptRunConfig](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.scriptrunconfig?view=azure-ml-py) class to submit our job, [Estimators](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-migrate-from-estimators-to-scriptrunconfig) have been deprecated. 

`arguments` is a dictionary of command-line arguments to pass to the training script.

In [None]:
# airline_ds=fs

In [None]:
# from azureml.core import ScriptRunConfig

# arguments = [
#     '--data_dir', airline_ds,
#     '--n_bins', 32,
#     '--compute', 'single-GPU', # set to multi-GPU for algorithms via Dask
#     '--cv_folds', 5,
# ]

# src = ScriptRunConfig(source_directory=project_folder,
#                       arguments=arguments,
#                       compute_target=gpu_cluster,
#                       script='train_rapids.py',
#                       environment=env, #docker is the environment
#                       docker_runtime_config=docker_config)

## Tune model hyperparameters

We can optimize our model's hyperparameters and improve the accuracy using Azure Machine Learning's hyperparameter tuning capabilities.

### Start a hyperparameter sweep

Let's define the hyperparameter space to sweep over. We will tune `n_estimators`, `max_depth` and `max_features` parameters. In this example we will use random sampling to try different configuration sets of hyperparameters and maximize `Accuracy`.

In [None]:
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice, loguniform, uniform

param_sampling = RandomParameterSampling(
    {
        "--n_estimators": choice(range(50, 500)),
        "--max_depth": choice(range(5, 19)),
        "--max_features": uniform(0.2, 1.0),
    }
)

hyperdrive_run_config = HyperDriveConfig(
    run_config=src,
    hyperparameter_sampling=param_sampling,
    primary_metric_name="Accuracy",
    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
    max_total_runs=10,
    max_concurrent_runs=5,
)

This will launch the RAPIDS training script with parameters that were specified in the cell above.

In [None]:
# start the HyperDrive run
run = Experiment(ws, experiment_name).submit(hyperdrive_run_config)
run

## Monitor HyperDrive runs

Monitor and view the progress of the machine learning training run with a [Jupyter widget](https://docs.microsoft.com/en-us/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py).The widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [None]:
from azureml.widgets import RunDetails

RunDetails(run).show()

In [None]:
run.wait_for_completion(show_output=True)

In [None]:
# run.cancel()

### Find and register best model

In [None]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run.get_details()["runDefinition"]["arguments"])

List the model files uploaded during the run:

In [None]:
print(best_run.get_file_names())

Register the folder (and all files in it) as a model named `train-rapids` under the workspace for deployment

In [None]:
# model = best_run.register_model(model_name='train-rapids', model_path='outputs/model-rapids.joblib')

## Delete cluster

In [None]:
# gpu_cluster.delete()