# Multi-node multi-GPU example on Azure using dask-cloudprovider

[Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/) is a native cloud intergration for dask. It helps manage Dask clusters on different cloud platforms. In this notebook, we will look at how we can use the package to set-up a Azure cluster and run a multi-node, multi-GPU example with [RAPIDS](https://rapids.ai/). RAPIDS provides a suite of libraries to accelerate data science pipelines on the GPU entirely. This can be scaled to multiple nodes using Dask as we will see through this notebook. 

For the purposes of this demo, we will use a part of the [NYC Taxi Dataset(Yellow Taxi) from Azure Open Datasets](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets). The goal is to predict the fare amount for a given trip given the times and coordinates of the taxi trip.

Before running the notebook, run the following commands in the terminal to setup Azure CLI
```
pip install azure-cli
az login
```
And follow the instructions on the prompt to finish setting up the account.

The list of packages needed for this notebook is listed in the cell below - uncomment and run the cell to set it up.

In [None]:
# !pip install "dask-cloudprovider[azure]"
# !pip install azureml-core

# # Run the statements below one after the other in order.
# !pip install azureml-opendatasets
# !pip install --upgrade pandas

In [None]:
import math
from datetime import datetime
from math import asin, cos, pi, sin, sqrt

import cudf
import dask
import dask_cudf
import numpy as np

# This is a package in preview.
from azureml.opendatasets import NycTlcYellow
from cuml.dask.common import utils as dask_utils
from cuml.dask.ensemble import RandomForestRegressor
from cuml.metrics import mean_squared_error
from dask.distributed import Client, wait
from dask_cloudprovider.azure import AzureVMCluster
from dask_ml.model_selection import train_test_split
from dateutil import parser

# Azure cluster set up

Let us now setup the [Azure cluster](https://cloudprovider.dask.org/en/latest/azure.html) using `AzureVMCluster` from Dask Cloud Provider. To do this, you'll first need to set up a Resource Group, a Virtual Network and a Security Group on Azure. [Learn more about how you can set this up](https://cloudprovider.dask.org/en/latest/azure.html#resource-groups). Note that you can also set it up using the Azure portal directly.

Once you have set it up, you can now plug in the names of the entities you have created in the cell below. Finally note that we use the RAPIDS docker image to build the VM and use the `dask_cuda.CUDAWorker` to run within the VM.

In [None]:
location = "SOUTH CENTRAL US"
resource_group = "RAPIDS-MNMG"
vnet = "dask-vnet"
security_group = "dask-nsg"

vm_size = "Standard_NC12s_v3"
docker_image = "rapidsai/rapidsai:21.06-cuda11.0-runtime-ubuntu18.04-py3.8"
docker_args = '--shm-size=256m'
worker_class = "dask_cuda.CUDAWorker"
worker_options = {'rmm-managed-memory':True}
 
n_workers = 2

## Set up Azure Marketplace VM

We'll use [NVIDIA GPU-Optimized Image for AI and HPC](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/nvidia.ngc_azure_17_11?tab=overview) VM from the Azure Marketplace.  This is a customized image that has all the necessary dependencies and NVIDIA drivers preinstalled. 

This step might require user to accept [Azure Marketplace Image Terms](https://docs.microsoft.com/en-us/cli/azure/vm/image/terms?view=azure-cli-latest). To accept the Terms, run the following command in a cell
```python
! az vm image terms accept --urn "nvidia:ngc_azure_17_11:ngc-base-version-21-02-2:21.02.2" --verbose
```

_Note: This requires `dask-cloudprovider>=2021.6.0`_

In [None]:
dask.config.set({"logging.distributed": "info",
                 "cloudprovider.azure.azurevm.marketplace_plan": {
                     "publisher": "nvidia",
                     "name": "ngc-base-version-21-02-2",
                     "product": "ngc_azure_17_11",
                     "version": "21.02.2"
                }})
vm_image = ""
config = dask.config.get("cloudprovider.azure.azurevm", {})
config

In [None]:
%%time

cluster = AzureVMCluster(
    location=location,
    resource_group=resource_group,
    vnet=vnet,
    security_group=security_group,
    vm_image=vm_image,
    vm_size=vm_size,
    docker_image=docker_image,
    worker_class=worker_class,
    n_workers=n_workers,
    security=True,
    docker_args=docker_args,
    worker_options=worker_options,
    debug=False,
    bootstrap=False, # This is to prevent the cloud init jinja2 script from running in the custom VM.
)

# Data Cleanup

The data needs to be cleaned up before it can be used in a meaningful way. We verify the columns we need are present in appropriate datatypes to make it ready for computation using cuML.

In [None]:
#create a list of columns & dtypes the df must have
must_haves = {
 'tpepPickupDateTime': 'datetime64[ms]',
 'tpepDropoffDateTime': 'datetime64[ms]',
 'passengerCount': 'int32',
 'tripDistance': 'float32',
 'startLon': 'float32',
 'startLat': 'float32',
 'rateCodeId': 'int32',
 'endLon': 'float32',
 'endLat': 'float32',
 'fareAmount': 'float32'
}

In [None]:
def clean(df_part, must_haves):
    """
    This function performs the various clean up tasks for the data
    and returns the cleaned dataframe.
    """
    # iterate through columns in this df partition
    for col in df_part.columns:
        # drop anything not in our expected list
        if col not in must_haves:
            df_part = df_part.drop(col, axis=1)
            continue

        # fixes datetime error found by Ty Mckercher and fixed by Paul Mahler
        if df_part[col].dtype == 'object' and col in ['tpepPickupDateTime', 'tpepDropoffDateTime']:
            df_part[col] = df_part[col].astype('datetime64[ms]')
            continue

        # if column was read as a string, recast as float
        if df_part[col].dtype == 'object':
            df_part[col] = df_part[col].str.fillna('-1')
            df_part[col] = df_part[col].astype('float32')
        else:
            # downcast from 64bit to 32bit types
            # Tesla T4 are faster on 32bit ops
            if 'int' in str(df_part[col].dtype):
                df_part[col] = df_part[col].astype('int32')
            if 'float' in str(df_part[col].dtype):
                df_part[col] = df_part[col].astype('float32')
            df_part[col] = df_part[col].fillna(-1)
    return df_part

# Add Interesting Features

We'll add new features by making use of "uder defined functions" on the dataframe. We'll make use of [apply_rows](https://docs.rapids.ai/api/cudf/stable/api.html#cudf.core.dataframe.DataFrame.apply_rows), which is similar to Pandas' apply funciton. `apply_rows` operation is [JIT compiled by numba](https://numba.pydata.org/numba-doc/dev/cuda/kernels.html) into GPU kernels. 

The kernels we define are - 
1. Haversine distance: This is used for calculating the total trip distance.

2. Day of the week: This can be useful information for determining the fare cost.

`add_features` function combined the two to produce a new dataframe that has the added features.

In [None]:
def haversine_distance_kernel(startLat, startLon, endLat, endLon, h_distance):
    for i, (x_1, y_1, x_2, y_2) in enumerate(zip(startLat, startLon, endLat, endLon,)):
        x_1 = pi/180 * x_1
        y_1 = pi/180 * y_1
        x_2 = pi/180 * x_2
        y_2 = pi/180 * y_2
        
        dlon = y_2 - y_1
        dlat = x_2 - x_1
        a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2
        
        c = 2 * asin(sqrt(a)) 
        r = 6371 # Radius of earth in kilometers
        
        h_distance[i] = c * r

def day_of_the_week_kernel(day, month, year, day_of_week):
    for i, (d_1, m_1, y_1) in enumerate(zip(day, month, year)):
        if month[i] <3:
            shift = month[i]
        else:
            shift = 0
        Y = year[i] - (month[i] < 3)
        y = Y - 2000
        c = 20
        d = day[i]
        m = month[i] + shift + 1
        day_of_week[i] = (d + math.floor(m*2.6) + y + (y//4) + (c//4) -2*c)%7
        
def add_features(df):
    df['hour'] = df['tpepPickupDateTime'].dt.hour
    df['year'] = df['tpepPickupDateTime'].dt.year
    df['month'] = df['tpepPickupDateTime'].dt.month
    df['day'] = df['tpepPickupDateTime'].dt.day
    df['diff'] = df['tpepDropoffDateTime'].astype('int32') - df['tpepPickupDateTime'].astype('int32')
    
    df['pickup_latitude_r'] = df['startLat']//.01*.01
    df['pickup_longitude_r'] = df['startLon']//.01*.01
    df['dropoff_latitude_r'] = df['endLat']//.01*.01
    df['dropoff_longitude_r'] = df['endLon']//.01*.01
    
    df = df.drop('tpepDropoffDateTime', axis=1)
    df = df.drop('tpepPickupDateTime', axis =1)
    
    
    df = df.apply_rows(haversine_distance_kernel,
                   incols=['startLat', 'startLon', 'endLat', 'endLon'],
                   outcols=dict(h_distance=np.float32),
                   kwargs=dict())
    
    
    df = df.apply_rows(day_of_the_week_kernel,
                      incols=['day', 'month', 'year'],
                      outcols=dict(day_of_week=np.float32),
                      kwargs=dict())
    
    
    df['is_weekend'] = (df['day_of_week']<2)
    return df


In [None]:
def scale_workers(client, n_workers, n_gpus_per_worker, timeout=300):
    import time
    client.cluster.scale(n_workers)
    m = len(client.has_what().keys())    
    start = end = time.perf_counter_ns()
    while ((m != n_workers*n_gpus_per_worker) and (((end - start) / 1e9) < timeout) ):
        time.sleep(5)
        m = len(client.has_what().keys())
        end = time.perf_counter_ns()
    if (((end - start) / 1e9) >= timeout):
        raise RuntimeError(f"Failed to rescale cluster in {timeout} sec."
              "Try increasing timeout for very large containers, and verify available compute resources.")

## Client set up

The cells below create a [Dask Client](https://distributed.dask.org/en/latest/client.html) with the cluster we defined earlier in the notebook accessing the Azure VM. Note that we have to scale the cluster and for doing that we'll use the `scale_workers` function. This is the step where the workers are allocated.

In [None]:
client = Client(cluster)
# Scale workers and wait for workers to be up and running
# Number of GPUs per node for the VM we've spun up is 2
scale_workers(client, n_workers, 2, timeout=600) # Run this just once per cluster
client.wait_for_workers(n_workers)
client

## Machine Learning Workflow

Once workers become available, we can now run the rest of our workflow:

- read and clean the data
- add features
- split into training and validation sets
- fit a RF model
- predict on the validation set
- compute RMSE

Note that for better performance we should perform HPO ideally. 

Refer to the notebooks in the repository for how to perform automated HPO [using RayTune](https://github.com/rapidsai/cloud-ml-examples/blob/main/ray/notebooks/Ray_RAPIDS_HPO.ipynb) and [using Optuna](https://github.com/rapidsai/cloud-ml-examples/blob/main/optuna/notebooks/optuna_rapids.ipynb).

Let's get started by reading the data into the notebook.



In [None]:
end_date = parser.parse('2018-06-01')
start_date = parser.parse('2018-05-01')
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_pandas_dataframe()

Let's look at the data locally to see what we're dealing with. We see that there are columns for pickup and dropoff times, distance, along with latitude, longitude, etc. These are the information we'll use to estimate the trip fare amount.

In [None]:
nyc_tlc_df.head()

This is a pandas dataframe, we'll convert it into dask_cudf dataframe to distibute it across all available dask workers.

In [None]:
# As mentioned before, our VMs each have 2 GPUs, so we will partition among n_workers*2
df = dask_cudf.from_cudf(cudf.from_pandas(nyc_tlc_df), npartitions=n_workers * 2)

This step cleans up the data with the functions defined earlier, adds new features and split it for training and validation.

In [None]:
# Query the dataframe to clean up the outliers 
df = clean(df, must_haves)

# Add new features
taxi_df = df.map_partitions(add_features)

taxi_df = taxi_df.dropna()
taxi_df = taxi_df.astype("float32")

# Split into training and validation sets
X, y = taxi_df.drop(["fareAmount"], axis=1), taxi_df["fareAmount"].astype('float32')
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True)

To train the RandomForestRegressor, we need to persist the data across available workers. There's a function in cuML (`cuml.dask.common.dask_utils.persist_across_workers`) that makes it easier to do this for different dask_cudf dataframes.

In [None]:
workers = client.has_what().keys()
X_train, y_train = dask_utils.persist_across_workers(client, [X_train, y_train], workers=workers)

cu_dask_rf = RandomForestRegressor(ignore_empty_partitions=True)
cu_dask_rf = cu_dask_rf.fit(X_train, y_train)
wait(cu_dask_rf.rfs)

Predict Taxi Fares using the trained model and get the RMSE score.

In [None]:
y_pred = cu_dask_rf.predict(X_test)
score = mean_squared_error(y_pred.compute().to_array(), y_test.compute().to_array())
print("Workflow Complete - RMSE: ", np.sqrt(score))

# Clean Up

Close out the client and cluster.

Note: Do not forget to delete the Network Security Group and Virtual Network created too.

In [None]:
client.close()
cluster.close()