# Multi-node multi-GPU example on Azure using dask-cloudprovider

[Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/) is a native cloud intergration for dask. It helps manage Dask clusters on different cloud platforms. In this notebook, we will look at how we can use the package to set-up a Azure cluster and run a multi-node, multi-GPU example with [RAPIDS](https://rapids.ai/). RAPIDS provides a suite of libraries to accelerate data science pipelines on the GPU entirely. This can be scaled to multiple nodes using Dask as we will see through this notebook. 

For the purposes of this demo, we will use the NYC Taxi Dataset and run on a part of it.

Before running the notebook, run the following commands in the terminal to setup Azure CLI
```
pip install azure-cli
az login
```

The list of packages needed for this notebook is listed in the cell below - uncomment and run the cell to set it up.

In [None]:
# !pip install "dask-cloudprovider[azure]"
# !pip install "dask-cloudprovider[azure]" --upgrade
# !pip install --upgrade azure-mgmt-network azure-mgmt-compute
# !pip install gcsfs
# !pip install dask_xgboost
# !pip install azureml

In [2]:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask
import cudf
import dask_cudf

from dask_ml.model_selection import train_test_split
from cuml.metrics import mean_squared_error

from cuml.dask.ensemble import RandomForestRegressor
from cuml.dask.common import utils as dask_utils

import numpy as np
import pandas as pd
import os
from urllib.request import urlretrieve
import gzip

In [3]:
import numpy as np
import numba, xgboost, socket
import dask, dask_cudf
from dask.distributed import Client, wait

# Azure cluster set up

Let us now setup the [Azure cluster](https://cloudprovider.dask.org/en/latest/azure.html) using `AzureVMCluster` from Dask Cloud Provider. To do this, you;ll first need to set up a Resource Group, a Virtual Network and a Security Group on Azure. [Learn more about how you can set this up](https://cloudprovider.dask.org/en/latest/azure.html#resource-groups). Note that you can also set it up using the Azure portal directly.

Once you have set it up, you can now plug in the names of the entities you have created in the cell below. Finally note that we use the RAPIDS docker image to build the VM and use the `dask_cuda.CUDAWorker` to run within the VM.

In [4]:
location = ""
resource_group = ""
vnet = ""
security_group = ""

vm_size = "Standard_NC12s_v3"
docker_image = "rapidsai/rapidsai:cuda10.2-runtime-ubuntu18.04-py3.8"
worker_class = "dask_cuda.CUDAWorker"
 
n_workers = 1
env_vars = {"EXTRA_PIP_PACKAGES": "gcsfs"}

In [6]:
from distributed import Client
from dask_cloudprovider.azure import AzureVMCluster

cluster = AzureVMCluster(
    location=location,
    resource_group=resource_group,
    vnet=vnet,
    security_group=security_group,
    vm_size=vm_size,
    docker_image=docker_image,
    worker_class=worker_class,
    env_vars=env_vars,
)

Creating scheduler instance
Assigned public IP
Network interface ready
Creating VM
Created VM dask-b505b93e-scheduler
Waiting for scheduler to run
Scheduler is running


  next(self.gen)


Let's look at the data locally to see what we're dealing with. We will make use of the data from 2014 for the purposes of the demo.

In [16]:
base_path = 'gcs://anaconda-public-data/nyc-taxi/csv/'
tmp_df = dask_cudf.read_csv(base_path+'2014/yellow_tripdata_2014*.csv', n_rows=1000)

tmp_df.head().to_pandas()

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
0,CMT,2014-01-09 20:45:25,2014-01-09 20:52:31,1,0.7,-73.99477,40.736828,1,N,-73.982227,40.73179,CRD,6.5,0.5,0.5,1.4,0.0,8.9
1,CMT,2014-01-09 20:46:12,2014-01-09 20:55:12,1,1.4,-73.982392,40.773382,1,N,-73.960449,40.763995,CRD,8.5,0.5,0.5,1.9,0.0,11.4
2,CMT,2014-01-09 20:44:47,2014-01-09 20:59:46,2,2.3,-73.98857,40.739406,1,N,-73.986626,40.765217,CRD,11.5,0.5,0.5,1.5,0.0,14.0
3,CMT,2014-01-09 20:44:57,2014-01-09 20:51:40,1,1.7,-73.960213,40.770464,1,N,-73.979863,40.77705,CRD,7.5,0.5,0.5,1.7,0.0,10.2
4,CMT,2014-01-09 20:47:09,2014-01-09 20:53:32,1,0.9,-73.995371,40.717248,1,N,-73.984367,40.720524,CRD,6.0,0.5,0.5,1.75,0.0,8.75


In [17]:
del tmp_df

# Data Cleanup

The data needs to be cleaned up before it can be used in a meaningful way. We first perform a renaming of some columns to a cleaner name (for instance, some of the years have `tpep_ropoff_datetime` instead of `dropfoff_datetime`). We also define the datatypes each of the columns need to be read as.

In [8]:
# list of column names that need to be re-mapped
remap = {}
remap['tpep_pickup_datetime'] = 'pickup_datetime'
remap['tpep_dropoff_datetime'] = 'dropoff_datetime'
remap['ratecodeid'] = 'rate_code'

#create a list of columns & dtypes the df must have
must_haves = {
 'pickup_datetime': 'datetime64[ms]',
 'dropoff_datetime': 'datetime64[ms]',
 'passenger_count': 'int32',
 'trip_distance': 'float32',
 'pickup_longitude': 'float32',
 'pickup_latitude': 'float32',
 'rate_code': 'int32',
 'dropoff_longitude': 'float32',
 'dropoff_latitude': 'float32',
 'fare_amount': 'float32'
}

In [9]:
def clean(df_part, remap, must_haves):
    """
    This function performs the various clean up tasks for the data
    and returns the cleaned dataframe.
    """
    tmp = {col:col.strip().lower() for col in list(df_part.columns)}
    df_part = df_part.rename(columns=tmp)
    # rename using the supplied mapping
    df_part = df_part.rename(columns=remap)
    # iterate through columns in this df partition
    for col in df_part.columns:
        # drop anything not in our expected list
        if col not in must_haves:
            df_part = df_part.drop(col, axis=1)
            continue

        # fixes datetime error found by Ty Mckercher and fixed by Paul Mahler
        if df_part[col].dtype == 'object' and col in ['pickup_datetime', 'dropoff_datetime']:
            df_part[col] = df_part[col].astype('datetime64[ms]')
            continue

        # if column was read as a string, recast as float
        if df_part[col].dtype == 'object':
            df_part[col] = df_part[col].str.fillna('-1')
            df_part[col] = df_part[col].astype('float32')
        else:
            # downcast from 64bit to 32bit types
            # Tesla T4 are faster on 32bit ops
            if 'int' in str(df_part[col].dtype):
                df_part[col] = df_part[col].astype('int32')
            if 'float' in str(df_part[col].dtype):
                df_part[col] = df_part[col].astype('float32')
            df_part[col] = df_part[col].fillna(-1)
    return df_part

# Add Interesting Features

We'll add new features by making use of "uder defined functions" on the dataframe. We'll make use of [apply_rows](https://docs.rapids.ai/api/cudf/stable/api.html#cudf.core.dataframe.DataFrame.apply_rows), which is similar to Pandas' apply funciton. `apply_rows` operation is [JIT compiled by numba](https://numba.pydata.org/numba-doc/dev/cuda/kernels.html) into GPU kernels. 

The kernels we define are - 
1. Haversine distance: This is used for calculating the total trip distance.

2. Day of the week: This can be useful information for determining the fare cost.

`add_features` function combined the two to produce a new dataframe that has the added features.

In [10]:
import math
from math import cos, sin, asin, sqrt, pi

def haversine_distance_kernel(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude, h_distance):
    for i, (x_1, y_1, x_2, y_2) in enumerate(zip(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude)):
        x_1 = pi/180 * x_1
        y_1 = pi/180 * y_1
        x_2 = pi/180 * x_2
        y_2 = pi/180 * y_2
        
        dlon = y_2 - y_1
        dlat = x_2 - x_1
        a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2
        
        c = 2 * asin(sqrt(a)) 
        r = 6371 # Radius of earth in kilometers
        
        h_distance[i] = c * r

def day_of_the_week_kernel(day, month, year, day_of_week):
    for i, (d_1, m_1, y_1) in enumerate(zip(day, month, year)):
        if month[i] <3:
            shift = month[i]
        else:
            shift = 0
        Y = year[i] - (month[i] < 3)
        y = Y - 2000
        c = 20
        d = day[i]
        m = month[i] + shift + 1
        day_of_week[i] = (d + math.floor(m*2.6) + y + (y//4) + (c//4) -2*c)%7
        
def add_features(df):
    df['hour'] = df['pickup_datetime'].dt.hour
    df['year'] = df['pickup_datetime'].dt.year
    df['month'] = df['pickup_datetime'].dt.month
    df['day'] = df['pickup_datetime'].dt.day
    df['diff'] = df['dropoff_datetime'].astype('int32') - df['pickup_datetime'].astype('int32')
    
    df['pickup_latitude_r'] = df['pickup_latitude']//.01*.01
    df['pickup_longitude_r'] = df['pickup_longitude']//.01*.01
    df['dropoff_latitude_r'] = df['dropoff_latitude']//.01*.01
    df['dropoff_longitude_r'] = df['dropoff_longitude']//.01*.01
    
    df = df.drop('pickup_datetime', axis=1)
    df = df.drop('dropoff_datetime', axis =1)
    
    
    df = df.apply_rows(haversine_distance_kernel,
                   incols=['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'],
                   outcols=dict(h_distance=np.float32),
                   kwargs=dict())
    
    
    df = df.apply_rows(day_of_the_week_kernel,
                      incols=['day', 'month', 'year'],
                      outcols=dict(day_of_week=np.float32),
                      kwargs=dict())
    
    
    df['is_weekend'] = (df['day_of_week']<2)
    return df

# Train RF model

We are now ready to fit a Random Forest on the data to predict the fare for the trip.

In [11]:
cu_rf_params = {
    'n_estimators': 100,
    'max_depth': 16,
}

The cell below creates a client with the cluster we defined earlier in the notebook. Note that we have `cluster.scale`. This is the step where the workers are allocated.

Once workers become available, we can now run the rest of our workflow - reading and cleaning the data, splitting into training and validation sets, fitting a RF model and predicting on the validation set. We print out the MSE metric for this problem. Note that for better performance we should perform HPO ideally. 


Refer to the notebooks in the repository for how to perform automated HPO [using RayTune](https://github.com/rapidsai/cloud-ml-examples/blob/main/ray/notebooks/Ray_RAPIDS_HPO.ipynb) and [using Optuna](https://github.com/rapidsai/cloud-ml-examples/blob/main/optuna/notebooks/optuna_rapids.ipynb).

In [14]:
with Client(cluster) as client:
    import dask_cudf
    cluster.scale(2)
    client.wait_for_workers(2)
    from cuml.dask.ensemble import RandomForestRegressor

    base_path = 'gcs://anaconda-public-data/nyc-taxi/csv/'
    df_2014 = dask_cudf.read_csv(base_path+'2014/yellow_tripdata_2014*.csv', n_rows=100000)

    df_2014 = clean(df_2014, remap, must_haves)

    taxi_df = df_2014.map_partitions(add_features)

    taxi_df = taxi_df.dropna()
    taxi_df = taxi_df.astype("float32")
    X, y = taxi_df.drop(["fare_amount"], axis=1), taxi_df["fare_amount"].astype('float32')

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

    workers = client.has_what().keys()

    X_train, X_test, y_train, y_test = dask_utils.persist_across_workers(client,
                                                           [X_train, X_test, y_train, y_test],
                                                           workers=workers)
    cu_dask_rf = RandomForestRegressor(**cu_rf_params, ignore_empty_partitions=True)
    cu_dask_rf = cu_dask_rf.fit(X_train, y_train)

    y_pred = cu_dask_rf.predict(X_test)
    
    _y_pred, _y_test = y_pred.compute().to_array(), y_test.compute().to_array()
    
    score = mean_squared_error(_y_pred, _y_test)
    print("RMSE: ", np.sqrt(score))

RMSE:  3.3357165


In [15]:
client.close()
cluster.close()

Terminated VM dask-b505b93e-worker-11afc90d
Removed disks for VM dask-b505b93e-worker-11afc90d
Deleted network interface
Terminated VM dask-b505b93e-worker-dbb420e7
Removed disks for VM dask-b505b93e-worker-dbb420e7
Deleted network interface
Terminated VM dask-b505b93e-scheduler
Removed disks for VM dask-b505b93e-scheduler
Deleted network interface
Unassigned public IP
