# Random forest classification

## Dask + RAPIDS GPU cluster with Snowflake

<table>
    <tr>
        <td>
            <img src="https://docs.dask.org/en/latest/_images/dask_horizontal.svg" width="300">
        </td>
        <td>
            <img src="https://rapids.ai/assets/images/RAPIDS-logo-purple.svg" width="300">
        </td>
        <td>
            <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Snowflake_Logo.svg/1280px-Snowflake_Logo.svg.png" width="300">
        </td>
    </tr>
</table>

In [2]:
import os

MODEL_PATH = 'models'
if not os.path.exists(MODEL_PATH):
    os.makedirs(MODEL_PATH)
    
numeric_feat = [
    'pickup_weekday', 
    'pickup_weekofyear', 
    'pickup_hour', 
    'pickup_week_hour', 
    'pickup_minute', 
    'passenger_count',
]
categorical_feat = [
    'pickup_taxizone_id', 
    'dropoff_taxizone_id',
]
features = numeric_feat + categorical_feat
y_col = 'high_tip'

# Initialize Dask GPU cluster

In [3]:
from dask.distributed import Client, wait
import time
from dask import persist
from dask_saturn import SaturnCluster

n_workers = 20
cluster = SaturnCluster(n_workers=n_workers, scheduler_size='g4dnxlarge', worker_size='g4dnxlarge')
client = Client(cluster)


[2020-11-09 15:58:03] INFO - dask-saturn | Cluster is ready


Open the dashboard (link ^) and watch it when you execute some commands, you'll see which tasks are running across the cluster. There are a couple other dashboard pages worth viewing for GPU memory and utilization that are not listed on the navbar, so we grab direct links for those below.

In [4]:
from IPython.display import display, HTML

gpu_links = f'''
<b>GPU Dashboard links</b>
<ul>
<li><a href="{client.dashboard_link}/individual-gpu-memory" target="_blank">GPU memory</a></li>
<li><a href="{client.dashboard_link}/individual-gpu-utilization" target="_blank">GPU utilization</a></li>
</ul>
'''
display(HTML(gpu_links))

If you created your cluster here in this notebook, it might take a few minutes for all your nodes to become available. You can run the chunk below to block until all nodes are ready.

>**Pro tip**: Create and/or start your cluster from the "Dask" page in Saturn if you want to get a head start!

In [5]:
client.wait_for_workers(n_workers=n_workers)

# Load data and feature engineering

Load a full month for this exercise. Note we are loading the data with Dask+RAPIDS now (`dask_cudf.read_csv` vs. `pd.read_csv`)

In [6]:
import os
import datetime
import pandas as pd
import dask.dataframe as dd
import cudf
import dask_cudf as cudd
import warnings
warnings.simplefilter("ignore")

import snowflake.connector

SNOWFLAKE_ACCOUNT = os.environ['SNOWFLAKE_ACCOUNT']
SNOWFLAKE_USER = os.environ['SNOWFLAKE_USER']
SNOWFLAKE_PASSWORD = os.environ['SNOWFLAKE_PASSWORD']

SNOWFLAKE_WAREHOUSE = os.environ['SNOWFLAKE_WAREHOUSE']
TAXI_DATABASE = os.environ['TAXI_DATABASE']
TAXI_SCHEMA = os.environ['TAXI_SCHEMA']

conn_info = {
    'account': SNOWFLAKE_ACCOUNT,
    'user': SNOWFLAKE_USER,
    'password': SNOWFLAKE_PASSWORD,
    'warehouse': SNOWFLAKE_WAREHOUSE,
    'database': TAXI_DATABASE,
    'schema': TAXI_SCHEMA,
}
conn = snowflake.connector.connect(**conn_info)


In [14]:
from dask import delayed

query = """
SELECT 
    pickup_taxizone_id,
    dropoff_taxizone_id,
    passenger_count,
    DIV0(tip_amount, fare_amount) > 0.2 AS high_tip,
    DAYOFWEEKISO(pickup_datetime) - 1 AS pickup_weekday,
    WEEKOFYEAR(pickup_datetime) AS pickup_weekofyear,
    HOUR(pickup_datetime) AS pickup_hour,
    (pickup_weekday * 24) + pickup_hour AS pickup_week_hour,
    MINUTE(pickup_datetime) AS pickup_minute
FROM taxi_yellow2
WHERE
    DATE(pickup_datetime) = %s
"""

@delayed
def load(conn_info, query, day):
    with snowflake.connector.connect(**conn_info) as conn:
        taxi = conn.cursor().execute(query, day).fetch_pandas_all()
        taxi.columns = [x.lower() for x in taxi.columns]
        taxi = cudf.from_pandas(taxi)
        return taxi

In [15]:
def get_dates(start, end):
    date_query = """
    SELECT
        DISTINCT(DATE(pickup_datetime)) as date 
    FROM taxi_yellow
    WHERE
        pickup_datetime BETWEEN %s and %s
    """
    dates_df = conn.cursor().execute(date_query, (start, end))
    columns = [x[0] for x in dates_df.description]
    dates_df = pd.DataFrame(dates_df.fetchall(), columns=columns)
    return dates_df['DATE'].tolist()

dates = get_dates('2017-01-01', '2019-12-31')


In [16]:
taxi = cudd.from_delayed([load(conn_info, query, day) for day in dates])

Dask performs computations in a [lazy manner](https://tutorial.dask.org/01x_lazy.html), so we persist the dataframe to perform data loading and feature processing and load into GPU memory.

In [18]:
taxi_train = taxi[features + [y_col]]
taxi_train[features] = taxi_train[features].astype("float32").fillna(-1)
taxi_train[y_col] = taxi_train[y_col].astype("int32").fillna(-1)

In [19]:
taxi_train = taxi_train.persist()
_ = wait(taxi_train)

In [20]:
print(f'Num rows: {len(taxi_train)}, Size: {taxi_train.memory_usage(deep=True).compute().sum() / 1e6} MB')

Num rows: 300698204, Size: 10825.135344 MB


In [21]:
taxi_train.groupby('high_tip')['high_tip'].count().compute()

high_tip
1    151325359
0    149372845
Name: high_tip, dtype: int64

In [22]:
taxi_train.head()

Unnamed: 0,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_week_hour,pickup_minute,passenger_count,pickup_taxizone_id,dropoff_taxizone_id,high_tip
0,5.0,42.0,0.0,120.0,32.0,1.0,113.0,230.0,0
1,5.0,42.0,9.0,129.0,33.0,2.0,238.0,239.0,0
2,5.0,42.0,9.0,129.0,45.0,2.0,239.0,163.0,0
3,5.0,42.0,7.0,127.0,48.0,1.0,158.0,231.0,1
4,5.0,42.0,8.0,128.0,7.0,1.0,209.0,232.0,1


# Train model

In [23]:
from cuml.dask.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100, max_depth=10, seed=42)

In [24]:
%%time
_ = rfc.fit(taxi_train[features], taxi_train[y_col])

CPU times: user 1.72 s, sys: 253 ms, total: 1.97 s
Wall time: 7.33 s


## Calculate metrics on test set

Use a different month for test set

In [29]:
test_dates = get_dates('2020-01-01', '2020-03-01')
taxi_test = cudd.from_delayed([load(conn_info, query, day) for day in test_dates])

In [30]:
taxi_test = taxi_test[features + [y_col]]
taxi_test[features] = taxi_test[features].astype("float32").fillna(-1)
taxi_test[y_col] = taxi_test[y_col].astype("int32").fillna(-1)

In [31]:
taxi_test = taxi_test.persist()
_ = wait(taxi_test)

<br>

Convert to single-GPU DataFrame using `compute()` because the Dask+RAPIDS implementation doesnt yet have `roc_auc_score`

In [32]:
from cuml.metrics import roc_auc_score

preds = rfc.predict_proba(taxi_test[features])[1]
roc_auc_score(taxi_test[y_col].compute(), preds.compute())

0.5315331220626831