# Training RAPIDS + OPtuna experiment on Azure

## Prerequisites
- Create an Azure ML Workspace and setup environmnet on local computer following the steps in [Azure README.md](https://github.com/rapidsai/cloud-ml-examples/blob/main/azure/README.md)

In [None]:
# verify installation and check Azure ML SDK version
import azureml.core

print('SDK version:', azureml.core.VERSION)

In [None]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

datastore = ws.get_default_datastore()
print("Default datastore's name: {}".format(datastore.name))

## Create AML compute

You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this notebook, we will use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training using a dynamically scalable pool of compute resources.

This notebook will use 10 nodes for hyperparameter optimization, you can modify `max_node` based on available quota in the desired region. Similar to other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. [This article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) includes details on the default limits and how to request more quota.

`vm_size` describes the virtual machine type and size that will be used in the cluster. RAPIDS requires NVIDIA Pascal or newer architecture, you will need to specify compute targets from one of `NC_v2`, `NC_v3`, `ND` or `ND_v2` [GPU virtual machines in Azure](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu); these are VMs that are provisioned with P40 and V100 GPUs. Let's create an `AmlCompute` cluster of `Standard_NC6s_v3` GPU VMs:

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
gpu_cluster_name = 'gpu-cluster'

if gpu_cluster_name in ws.compute_targets:
    gpu_cluster = ws.compute_targets[gpu_cluster_name]
    if gpu_cluster and type(gpu_cluster) is AmlCompute:
        print('Found compute target. Will use {0} '.format(gpu_cluster_name))
else:
    print('creating new cluster')
    # m_size parameter below could be modified to one of the RAPIDS-supported VM types
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = 'Standard_NC6s_v3', max_nodes = 5, idle_seconds_before_scaledown = 300)
    # Use VM types with more than one GPU for multi-GPU option, e.g. Standard_NC12s_v3
    
    # create the cluster
    gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout 
    # if no min node count is provided it uses the scale settings for the cluster
    gpu_cluster.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
# use get_status() to get a detailed status for the current cluster 
print(gpu_cluster.get_status().serialize())

### Create experiment

Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace.

In [None]:
from azureml.core import Experiment

experiment_name = 'optuna_rapids'
experiment = Experiment(ws, name=experiment_name)

The [Environment class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.environment.environment?view=azure-ml-py) allows you to build a Docker image and customize the system that you will use for training. We will build a container image using a RAPIDS container as base image and install necessary packages. This build is necessary only the first time and will take about 15 minutes. The image will be added to your Azure Container Registry and the environment will be cached after the first run, as long as the environment definition remains the same.

In [None]:
from azureml.core import Environment

# create the environment
rapids_env = Environment('rapids_env')

# create the environment inside a Docker container
rapids_env.docker.enabled = True

# specify docker steps as a string. Alternatively, load the string from a file
dockerfile = """
FROM rapidsai/rapidsai:21.06-cuda10.2-runtime-ubuntu18.04-py3.7
RUN apt-get update && \
apt-get install -y fuse && \
apt-get install libssl1.0.0 libssl-dev && \
source activate rapids && \
pip install azureml-sdk==1.13.0 && \
pip install azureml-widgets && \
pip install optuna && \
pip install dask_optuna && \
pip install fusepy
"""

# set base image to None since the image is defined by dockerfile
rapids_env.docker.enabled = True
rapids_env.docker.base_image = None
rapids_env.docker.base_dockerfile = dockerfile

# use rapids environment in the container
rapids_env.python.user_managed_dependencies = True

## Acquiring Data

1. Download data from Kaggle - [BNP Paribas Carid Calims Management](https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/data)
2. Place only the unzipped `train.csv` file in `data_dir`(by default this is set to `'data/'`)
3. The following cell will upload it to your default workspace datastore under `path_on_datastore`.

In [None]:
from azureml.core.dataset import Dataset

data_dir = "data/"
path_on_datastore = 'bnp_upload'

datastore.upload(src_dir=data_dir, target_path=path_on_datastore, overwrite=False, show_progress=True)

ds_data = datastore.path(path_on_datastore)
dataset = Dataset.File.from_files(ds_data)

In [None]:
script_params = ['--data_dir', dataset.as_named_input('bnp_input').as_mount(),
]
from azureml.core import ScriptRunConfig

project_folder = "project_folder/"
src = ScriptRunConfig(source_directory=project_folder,
                      script='train_optuna.py',
                      arguments=script_params,
                      compute_target="gpu-cluster",
                      environment=rapids_env)


In [None]:
run = experiment.submit(config=src)
from azureml.widgets import RunDetails
RunDetails(run).show()


In [None]:
run.wait_for_completion(show_output=False)