# Random forest classification

## Dask + RAPIDS GPU cluster

<table>
    <tr>
        <td>
            <img src="https://docs.dask.org/en/latest/_images/dask_horizontal.svg" width="300">
        </td>
        <td>
            <img src="https://rapids.ai/assets/images/RAPIDS-logo-purple.svg" width="300">
        </td>
    </tr>
</table>

In [None]:
import os

MODEL_PATH = 'models'
if not os.path.exists(MODEL_PATH):
    os.makedirs(MODEL_PATH)
    
numeric_feat = [
    'pickup_weekday', 
    'pickup_hour', 
    'pickup_week_hour', 
    'pickup_minute', 
    'passenger_count',
]
categorical_feat = [
    'PULocationID', 
    'DOLocationID',
]
features = numeric_feat + categorical_feat
y_col = 'high_tip'

# Initialize Dask GPU cluster

In [None]:
from dask.distributed import Client, wait
from dask_saturn import SaturnCluster

n_workers = 3
cluster = SaturnCluster(
    n_workers=n_workers,
    scheduler_size='medium',
    worker_size='g4dnxlarge'
)
client = Client(cluster)
cluster

Open the dashboard (link ^) and watch it when you execute some commands, you'll see which tasks are running across the cluster. There are a couple other dashboard pages worth viewing for GPU memory and utilization that are not listed on the navbar, so we grab direct links for those below.

In [None]:
from IPython.display import display, HTML

gpu_links = f'''
<b>GPU Dashboard links</b>
<ul>
<li><a href="{client.dashboard_link}/individual-gpu-memory" target="_blank">GPU memory</a></li>
<li><a href="{client.dashboard_link}/individual-gpu-utilization" target="_blank">GPU utilization</a></li>
</ul>
'''
display(HTML(gpu_links))

If you created your cluster here in this notebook, it might take a few minutes for all your nodes to become available. You can run the chunk below to block until all nodes are ready.

>**Pro tip**: Create and/or start your cluster from the "Dask" page in Saturn if you want to get a head start!

In [None]:
client.wait_for_workers(n_workers=n_workers)

# Load data and feature engineering

Load a full month for this exercise. Note we are loading the data with Dask+RAPIDS now (`dask_cudf.read_csv` vs. `pd.read_csv`)

In [None]:
import dask_cudf

taxi = dask_cudf.read_csv(
    's3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv',
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
    storage_options={'anon': True},
    assume_missing=True,
)

In [None]:
print(f'Num rows: {len(taxi)}, Size: {taxi.memory_usage(deep=True).sum().compute() / 1e6} MB')

In [None]:
def prep_df(df: dask_cudf.DataFrame) -> dask_cudf.DataFrame:
    '''
    Generate features from a raw taxi dataframe.
    Use 32 bit precision for GPU processing
    '''
    df = df[df.fare_amount > 0]  # avoid divide-by-zero
    df['tip_fraction'] = df.tip_amount / df.fare_amount
    df['high_tip'] = (df['tip_fraction'] > 0.2) # class label
    
    df['pickup_weekday'] = df.tpep_pickup_datetime.dt.weekday
    df['pickup_hour'] = df.tpep_pickup_datetime.dt.hour
    df['pickup_week_hour'] = (df.pickup_weekday * 24) + df.pickup_hour
    df['pickup_minute'] = df.tpep_pickup_datetime.dt.minute
    df = df[features + [y_col]].astype('float32').fillna(-1)
    df[y_col] = df[y_col].astype('int32')
    
    return df
    
taxi_train = prep_df(taxi)

Dask performs computations in a [lazy manner](https://tutorial.dask.org/01x_lazy.html), so we persist the dataframe to perform data loading and feature processing and load into GPU memory.

In [None]:
%%time
taxi_train = taxi_train.persist()
_ = wait(taxi_train)

In [None]:
taxi_train.groupby('high_tip')['high_tip'].count().compute()

In [None]:
taxi_train.head()

# Train model

In [None]:
from cuml.dask.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100, max_depth=10, seed=42)

In [None]:
%%time
_ = rfc.fit(taxi_train[features], taxi_train[y_col])

# Save model

In [None]:
# not yet supported with cuml.dask

## Calculate metrics on test set

Use a different month for test set

In [None]:
taxi_test = dask_cudf.read_csv(
    's3://nyc-tlc/trip data/yellow_tripdata_2019-02.csv',
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
    storage_options={'anon': True},
    assume_missing=True,
)

taxi_test = prep_df(taxi_test)

<br>

Convert to single-GPU DataFrame using `compute()` because the Dask+RAPIDS implementation doesn't yet have `roc_auc_score`

In [None]:
from cuml.metrics import roc_auc_score

preds = rfc.predict_proba(taxi_test[features])[1]
roc_auc_score(taxi_test[y_col].compute(), preds.compute())