# Random forest classification

## Dask + RAPIDS GPU cluster with Snowflake

<table>
    <tr>
        <td>
            <img src="https://docs.dask.org/en/latest/_images/dask_horizontal.svg" width="300">
        </td>
        <td>
            <img src="https://rapids.ai/assets/images/RAPIDS-logo-purple.svg" width="300">
        </td>
        <td>
            <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Snowflake_Logo.svg/1280px-Snowflake_Logo.svg.png" width="300">
        </td>
    </tr>
</table>

In [1]:
import os

MODEL_PATH = 'models'
if not os.path.exists(MODEL_PATH):
    os.makedirs(MODEL_PATH)
    
numeric_feat = [
    'pickup_weekday', 
    'pickup_weekofyear', 
    'pickup_hour', 
    'pickup_week_hour', 
    'pickup_minute', 
    'passenger_count',
]
categorical_feat = [
    "pickup_longitude", 
    "pickup_latitude", 
    "dropoff_longitude", 
    "dropoff_latitude",
    #'pickup_taxizone_id', 
    #'dropoff_taxizone_id',
]
features = numeric_feat + categorical_feat
y_col = 'high_tip'

# Initialize Dask GPU cluster

In [2]:
import os
import time
import datetime
import warnings
import pandas as pd

import dask.dataframe as dd
from dask.distributed import Client, progress, wait
from dask import persist, delayed

import cudf, cuml
import dask_cudf as cudd

warnings.simplefilter("ignore")

In [3]:
#client.close()

In [4]:
n_workers = 2
cluster = "172.17.0.2:8786"
#(n_workers=10, threads_per_worker=8)
client = Client(cluster)
client

0,1
Connection method: Direct,
Dashboard: http://172.17.0.2:8787/status,

0,1
Comm: tcp://172.17.0.2:8786,Workers: 3
Dashboard: http://172.17.0.2:8787/status,Total threads: 48
Started: 43 minutes ago,Total memory: 188.45 GiB

0,1
Comm: tcp://172.17.0.2:34581,Total threads: 16
Dashboard: http://172.17.0.2:37415/status,Memory: 62.82 GiB
Nanny: tcp://172.17.0.2:34639,
Local directory: /rapids/notebooks/host/benchmark_gpu_randomforest/random_forest/dask-worker-space/worker-nek61h9x,Local directory: /rapids/notebooks/host/benchmark_gpu_randomforest/random_forest/dask-worker-space/worker-nek61h9x
GPU: Tesla T4,GPU memory: 15.00 GiB
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 0.97 GiB,Spilled bytes: 0 B
Read bytes: 6.43 kiB,Write bytes: 11.42 kiB

0,1
Comm: tcp://172.17.0.2:40805,Total threads: 16
Dashboard: http://172.17.0.2:37733/status,Memory: 62.82 GiB
Nanny: tcp://172.17.0.2:38989,
Local directory: /rapids/notebooks/host/benchmark_gpu_randomforest/random_forest/dask-worker-space/worker-j6ki78zt,Local directory: /rapids/notebooks/host/benchmark_gpu_randomforest/random_forest/dask-worker-space/worker-j6ki78zt
GPU: Tesla T4,GPU memory: 15.00 GiB
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 318.10 MiB,Spilled bytes: 0 B
Read bytes: 6.43 kiB,Write bytes: 11.43 kiB

0,1
Comm: tcp://172.17.0.2:40959,Total threads: 16
Dashboard: http://172.17.0.2:35601/status,Memory: 62.82 GiB
Nanny: tcp://172.17.0.2:33307,
Local directory: /rapids/notebooks/host/benchmark_gpu_randomforest/random_forest/dask-worker-space/worker-i4s64wyr,Local directory: /rapids/notebooks/host/benchmark_gpu_randomforest/random_forest/dask-worker-space/worker-i4s64wyr
GPU: Tesla T4,GPU memory: 15.00 GiB
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 4.02 GiB,Spilled bytes: 0 B
Read bytes: 6.43 kiB,Write bytes: 11.42 kiB


Open the dashboard (link ^) and watch it when you execute some commands, you'll see which tasks are running across the cluster. There are a couple other dashboard pages worth viewing for GPU memory and utilization that are not listed on the navbar, so we grab direct links for those below.

In [5]:
!nvidia-smi

Thu Dec 29 22:17:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            On   | 00000000:00:0D.0 Off |                    0 |
| N/A   38C    P0    27W /  70W |    101MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:0E.0 Off |                    0 |
| N/A   32C    P8    15W /  70W |      4MiB / 15360MiB |      0%      Default |
|       

In [6]:
from IPython.display import display, HTML

gpu_links = f'''
<b>GPU Dashboard links</b>
<ul>
<li><a href="{client.dashboard_link}/individual-gpu-memory" target="_blank">GPU memory</a></li>
<li><a href="{client.dashboard_link}/individual-gpu-utilization" target="_blank">GPU utilization</a></li>
</ul>
'''
display(HTML(gpu_links))

If you created your cluster here in this notebook, it might take a few minutes for all your nodes to become available. You can run the chunk below to block until all nodes are ready.

>**Pro tip**: Create and/or start your cluster from the "Dask" page in Saturn if you want to get a head start!

In [7]:
client.wait_for_workers(n_workers=n_workers)

# Load data and feature engineering

Load a full month for this exercise. Note we are loading the data with Dask+RAPIDS now (`dask_cudf.read_csv` vs. `pd.read_csv`)

In [8]:

nyc_datatype = {'VendorID': 'string',
                'passenger_count': 'int32',
                'trip_distance': 'float32',
                'pickup_longitude': 'float32',
                'pickup_latitude': 'float32',
                'RateCodeID': 'string',
                'store_and_fwd_flag': 'string',
                'dropoff_longitude': 'float32',
                'dropoff_latitude': 'float32',
                'payment_type': 'string',
                'fare_amount': 'float32',
                'extra': 'float32',
                'mta_tax': 'float32',
                'tip_amount': 'float32',
                'tolls_amount': 'float32',
                'improvement_surcharge': 'float32',
                'total_amount':'float32' }

In [9]:
df = dd.read_parquet("/home/cloud/dataset/nyc-taxi/yellow_tripdata_2015.parquet")
#df = cudd.read_parquet("/home/cloud/dataset/nyc-taxi/yellow_tripdata_2015.parquet/*.parquet")
#                    ,split_row_groups=True )

df = df.astype(nyc_datatype)

In [10]:
df = df.persist()

progress(df)

VBox()

In [11]:
%time wait(df)
%time print(df.passenger_count.sum().compute())

CPU times: user 7.08 ms, sys: 0 ns, total: 7.08 ms
Wall time: 6.4 ms
245566747
CPU times: user 27.4 ms, sys: 3.56 ms, total: 31 ms
Wall time: 242 ms


In [12]:
df

Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
npartitions=353,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
,string,datetime64[ns],datetime64[ns],int32,float32,float32,float32,string,string,float32,float32,string,float32,float32,float32,float32,float32,float32,float32
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [13]:
#df.divisions
#df.partitions[10].compute()
df.VendorID
#df["passenger_count"].compute()

Dask Series Structure:
npartitions=353
    string
       ...
     ...  
       ...
       ...
Name: VendorID, dtype: string
Dask Name: getitem, 706 tasks

In [None]:
#converting from dask to dask_cudf

#first we need to trigger data loading in the cudf
df.compute()
df.dtypes

In [None]:
#df.dtypes
ds = cudd.from_cudf(df, npartitions=100)
ds.head(n=3)

In [None]:
cols = ['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
        'passenger_count', 'trip_distance', 'RateCodeID', 'store_and_fwd_flag',
        'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount', 'extra',
        'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
        'total_amount']

taxi = df[cols]

taxi['pickup_weekday'] = taxi.tpep_pickup_datetime.dt.weekday
taxi['pickup_weekofyear'] = taxi.tpep_pickup_datetime.dt.isocalendar().week
taxi['pickup_hour'] = taxi.tpep_pickup_datetime.dt.hour
taxi['pickup_minute'] = taxi.tpep_pickup_datetime.dt.minute
taxi['pickup_week_hour'] = (taxi.pickup_weekday * 24) + taxi.pickup_hour
taxi['store_and_fwd_flag'] = (taxi.store_and_fwd_flag == 'Y').astype(float)
#taxi = taxi.fillna(-1)

X = taxi[features].astype('float32')
y = taxi['total_amount']

Dask performs computations in a [lazy manner](https://tutorial.dask.org/01x_lazy.html), so we persist the dataframe to perform data loading and feature processing and load into GPU memory.

In [None]:
X = X.fillna(-1)
y = y.fillna(-1)

X, y = persist(X, y)

%time _ = wait([X, y])
%time len(X)

In [None]:
X.dtypes


In [None]:
taxi_train = X
print(f'Num rows: {len(taxi_train)}, Size: {taxi_train.memory_usage(deep=True).compute().sum() / 1e6} MB')
taxi_train.groupby('pickup_weekday')['pickup_weekday'].count().compute()

In [18]:
from cuml.dask.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, max_depth=10, seed=42)
%time _ = rf.fit(X, y)

print("done")

  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
Key:       _construct_rf-522db6c7-9a16-4daf-b095-95565931563d
Function:  _construct_rf
args:      ()
kwargs:    {'n_estimators': 25, 'random_state': 50, 'max_depth': 10, 'seed': 42}
Ex

RuntimeError: 4 of 4 worker jobs failed: (' The variable ', 'seed', ' is not supported in cuML, please read the cuML documentation at (https://docs.rapids.ai/api/cuml/nightly/api.html#random-forest) for more information'), (' The variable ', 'seed', ' is not supported in cuML, please read the cuML documentation at (https://docs.rapids.ai/api/cuml/nightly/api.html#random-forest) for more information'), (' The variable ', 'seed', ' is not supported in cuML, please read the cuML documentation at (https://docs.rapids.ai/api/cuml/nightly/api.html#random-forest) for more information'), (' The variable ', 'seed', ' is not supported in cuML, please read the cuML documentation at (https://docs.rapids.ai/api/cuml/nightly/api.html#random-forest) for more information')

Exception ignored in: <bound method RandomForestRegressor.__del__ of RandomForestRegressor()>
Traceback (most recent call last):
  File "cuml/ensemble/randomforestregressor.pyx", line 322, in cuml.ensemble.randomforestregressor.RandomForestRegressor.__del__
  File "cuml/ensemble/randomforestregressor.pyx", line 326, in cuml.ensemble.randomforestregressor.RandomForestRegressor._reset_forest_data
  File "cuml/common/base.pyx", line 269, in cuml.common.base.Base.__getattr__
AttributeError: rf_forest
Exception ignored in: <bound method RandomForestRegressor.__del__ of RandomForestRegressor()>
Traceback (most recent call last):
  File "cuml/ensemble/randomforestregressor.pyx", line 322, in cuml.ensemble.randomforestregressor.RandomForestRegressor.__del__
  File "cuml/ensemble/randomforestregressor.pyx", line 326, in cuml.ensemble.randomforestregressor.RandomForestRegressor._reset_forest_data
  File "cuml/common/base.pyx", line 269, in cuml.common.base.Base.__getattr__
AttributeError: rf_for

In [22]:
#taxi_train = taxi[features + [y_col]]
#taxi_train[features] = taxi_train[features].astype("float32").fillna(-1)
#taxi_train[y_col] = taxi_train[y_col].astype("int32").fillna(-1)

In [19]:
taxi_train = taxi_train.persist()
_ = wait(taxi_train)

In [20]:
print(f'Num rows: {len(taxi_train)}, Size: {taxi_train.memory_usage(deep=True).compute().sum() / 1e6} MB')

Num rows: 300698204, Size: 10825.135344 MB


In [21]:
taxi_train.groupby('high_tip')['high_tip'].count().compute()

high_tip
1    151325359
0    149372845
Name: high_tip, dtype: int64

In [22]:
taxi_train.head()

Unnamed: 0,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_week_hour,pickup_minute,passenger_count,pickup_taxizone_id,dropoff_taxizone_id,high_tip
0,5.0,42.0,0.0,120.0,32.0,1.0,113.0,230.0,0
1,5.0,42.0,9.0,129.0,33.0,2.0,238.0,239.0,0
2,5.0,42.0,9.0,129.0,45.0,2.0,239.0,163.0,0
3,5.0,42.0,7.0,127.0,48.0,1.0,158.0,231.0,1
4,5.0,42.0,8.0,128.0,7.0,1.0,209.0,232.0,1


# Train model

In [23]:
from cuml.dask.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100, max_depth=10, seed=42)

In [24]:
%%time
_ = rfc.fit(taxi_train[features], taxi_train[y_col])

CPU times: user 1.72 s, sys: 253 ms, total: 1.97 s
Wall time: 7.33 s


## Calculate metrics on test set

Use a different month for test set

In [29]:
test_dates = get_dates('2020-01-01', '2020-03-01')
taxi_test = cudd.from_delayed([load(conn_info, query, day) for day in test_dates])

In [30]:
taxi_test = taxi_test[features + [y_col]]
taxi_test[features] = taxi_test[features].astype("float32").fillna(-1)
taxi_test[y_col] = taxi_test[y_col].astype("int32").fillna(-1)

In [31]:
taxi_test = taxi_test.persist()
_ = wait(taxi_test)

<br>

Convert to single-GPU DataFrame using `compute()` because the Dask+RAPIDS implementation doesnt yet have `roc_auc_score`

In [32]:
from cuml.metrics import roc_auc_score

preds = rfc.predict_proba(taxi_test[features])[1]
roc_auc_score(taxi_test[y_col].compute(), preds.compute())

0.5315331220626831