# High tip classification - Random forest

## Dask + RAPIDS

 <img src="https://images.exxactcorp.com/CMS/landing-page/resource-center/supported-software/deep-learning/rapids/Rapids-Logo-lg.png" width="400" />

**Hardware**: 20 nodes - g4dn.xlarge (4 CPU, 16GB RAM; 1 GPU, 16GB GPU RAM)

In [1]:
from ml_utils import MLUtils

ml_utils = MLUtils(
    ml_task='high_tip',
    tool='rapids',
    model='random_forest',
)

# Load data and feature engineering

In [2]:
from dask.distributed import Client, wait
from dask import persist
from dask_saturn import SaturnCluster

cluster = SaturnCluster(n_workers=20, scheduler_size='xlarge', worker_size='g4dnxlarge')
client = Client(cluster)
cluster

[2020-08-07 23:26:21] INFO - dask-saturn | Cluster is ready


VBox(children=(HTML(value='<h2>SaturnCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n   …

In [3]:
from IPython.display import display, HTML

gpu_links = f'''
<b>GPU Dashboard links</b>
<ul>
<li><a href="{client.dashboard_link}/individual-gpu-memory" target="_blank">GPU memory</a></li>
<li><a href="{client.dashboard_link}/individual-gpu-utilization" target="_blank">GPU utilization</a></li>
</ul>
'''
display(HTML(gpu_links))

In [3]:
import numpy as np
import pandas as pd
import dask_cudf

In [4]:
%%time
tip_train = dask_cudf.read_parquet(f'{ml_utils.taxi_path}/data/ml/tip_train')
len(tip_train)

CPU times: user 2.72 s, sys: 664 ms, total: 3.39 s
Wall time: 13 s


219892717

In [5]:
tip_train.head()

Unnamed: 0,id,pickup_datetime,dropoff_datetime,pickup_taxizone_id,dropoff_taxizone_id,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_minute,pickup_week_hour,passenger_count,tip_fraction
0,2b7d0e26212043c38b8ab3a7693965f1,2016-07-10 06:56:05,2016-07-10 06:59:53,263.0,236.0,6,27,6,56,150,1.0,0.6
1,979d4717e9e04c44bf62959308348860,2016-07-10 10:50:18,2016-07-10 10:55:21,142.0,163.0,6,27,10,50,154,5.0,0.226667
2,07fd0421b17f4bf78a8cf09a3dabcac7,2016-07-10 10:50:18,2016-07-10 11:08:38,74.0,66.0,6,27,10,50,154,1.0,0.0
3,8d039019a6d94d7e81ce4bcbd3a04f36,2016-07-10 10:50:19,2016-07-10 10:55:38,161.0,142.0,6,27,10,50,154,1.0,0.166667
4,d71237b38d5e4d39b9c43c665080d8da,2016-07-10 10:50:19,2016-07-10 11:08:54,226.0,261.0,6,27,10,50,154,1.0,0.206957


<br>

We can use the full data with Dask + RAPIDS, so no need to sample here

# Random forest classifier

We'll train a classifier than can predict "high-tip" rides - those where the tip percent is >25%. Convert `X` to 32 bit precision for GPU processing

In [6]:
features = ml_utils.tip_vars.features
y_col = ml_utils.tip_vars.y_col
y_clf = ml_utils.tip_vars.y_clf
high_tip = ml_utils.tip_vars.high_tip

tip_train[y_clf] = (tip_train[y_col] > high_tip).astype('int32')
X = tip_train[features].astype('float32')
y = tip_train[y_clf]

In [7]:
y.value_counts().compute()

0    161579881
1     58312836
Name: high_tip, dtype: int64

Calling `persist()` performs all the data loading and pre-processing and holds the data in RAM for us to do the model training

In [8]:
%%time
X, y = persist(X, y)
_ = wait([X, y])

CPU times: user 701 ms, sys: 21.9 ms, total: 723 ms
Wall time: 12.8 s


In [9]:
from cuml.dask.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100, max_depth=10, seed=42)

In [10]:
%%time
_ = rfc.fit(X, y)

CPU times: user 521 ms, sys: 173 ms, total: 694 ms
Wall time: 2.11 s


## Save model

In [11]:
# ml_utils.write_model(rfc)

## Predict on test set

And calculate metrics. Save predictions and metrics to S3.

In [12]:
tip_test = dask_cudf.read_parquet(f'{ml_utils.taxi_path}/data/ml/tip_test')

tip_test[y_clf] = (tip_test[y_col] > high_tip).astype('int32')
X_test = tip_test[features].astype('float32')
y_test = tip_test[y_clf]

preds = tip_test[['id', y_clf]].copy()
preds.columns = ['id', 'actual']
preds['predicted'] = rfc.predict_proba(X_test[features])[1]

In [13]:
%%time
preds = preds.persist()
_ = wait(preds)

CPU times: user 127 ms, sys: 7.16 ms, total: 134 ms
Wall time: 2.17 s


In [14]:
preds.head()

Unnamed: 0,id,actual,predicted
0,11d37225291e4ba4aa7eebad9f7b272a,0,0.0
1,8bc943e88a654bb2a0861f393c7ba491,0,0.0
2,3431b20f16c5456aa29fa45bccad6f47,0,0.0
3,52cf7e80dd104f84a81b488c1f78d531,0,0.0
4,e0ae8ce6dbff40b195332af413d24b0e,0,0.0


In [15]:
%%time
import s3fs
s3 = s3fs.S3FileSystem()

path = f'{ml_utils.taxi_path}/ml_results/predictions/{ml_utils.ml_task}__{ml_utils.tool}__{ml_utils.model}'
print(path)
if s3.exists(path):
    s3.rm(path, recursive=True)

preds.to_parquet(path, write_index=False)

s3://saturn-titan/nyc-taxi/ml_results/predictions/high_tip__rapids__random_forest
CPU times: user 52.7 ms, sys: 71 µs, total: 52.8 ms
Wall time: 3.78 s


<br>

Convert to single-GPU DataFrame using `compute()` because the Dask+RAPIDS implementation doesnt yet have `roc_auc_score`

In [16]:
from cuml.metrics import roc_auc_score

preds_cudf = preds.compute()
auc = roc_auc_score(preds_cudf.actual, preds_cudf.predicted)
ml_utils.write_metric_df('auc', auc)

Unnamed: 0,ml_task,tool,model,metric,value
0,high_tip,rapids,random_forest,auc,0.50078
