# Random forest classification (single GPU)

<img src="https://rapids.ai/assets/images/RAPIDS-logo-purple.svg" width="400">

This notebook describes a machine learning training workflow using the famous [NYC Taxi Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). That dataset contains information on taxi trips in New York City.

In this exercise, you'll use `cudf` to load a subset of the data and `cuml` to answer this classification question:

> based on characteristics that can be known at the beginning of a trip, will this trip result in a high tip?

## Use RAPIDS libraries

RAPIDS is a collection of libraries which enable you to take advantage of NVIDIA GPUs to accelerate machine learning workflows. This exercise uses the following RAPIDS package:
    
* `cudf`: data frame manipulation, similar to `pandas` and `numpy`
* `cuml`: machine learning training and evaluation, similar to `scikit-learn`

These libraries are already available in the `saturncloud/saturn-gpu` images by default. For more information on RAPIDS, see ["Getting Started"](https://rapids.ai/start.html) in the NVIDIA docs.

### Monitoring GPU Usage

This tutorial aims to teach you how to take advantage of the GPU for data science workflows. To prove to yourself that RAPIDS is utilizing the GPU, it's important to understand how to monitor that utilization while your code is running. If you already know how to do that, skip on to the next section.

<details><summary>(click here to learn how to monitor resource utilization)</summary>

**Monitoring CPU and Main Memory**
    
CPUs and GPUs are two different types of processors, and the GPU has its own dedicated memory. Many data science libraries that claim to offer GPU acceleration accomplish their tasks with a mix of CPU and GPU use, so it's important to monitor both to see what that code is doing.
    
To monitor CPU utilization and the amount of free main memory (memory available to the CPU), you can use `htop`.
    
Open a new terminal and run `htop`. That will keep an auto-updating dashboard up that shows the CPU utilization and memory usage.
    
**Monitoring GPU and GPU memory**
    
Open a new terminal and run the following command.
    
```shell
watch -n 5 nvidia-smi
```
    
This command will update the output in the terminal every 5 seconds. It shows some information like:
    
* current CUDA version
* NVIDIA driver version
* internal temperature
* current utilization of GPU memory
* list of processes (if any) currently running on the GPU, and how much GPU memory they're consuming
    
If you'd prefer a simpler view, you can also consider `gpustat`, which simply tracks temperature, GPU utilization, and memory. This is not available by default in the Saturn GPU images, but you can install it from PyPi.

```shell
pip install gpustat    
```
    
And then run it
    
```shell
gpustat -cp --watch
```
    
Whichever option you choose, leave these terminals with the monitoring process running while you work, so you can see how the code below uses the avaiable resources.

<hr>

## Load data

This example is designed to run quickly with small resources. So let's just load a single month of taxi data for training.

The code below loads the data into a `cudf` data frame. This is similar to a `pandas` dataframe, but it lives in GPU memory and most operations on it are done on the GPU.

In [2]:
import cudf

taxi = cudf.read_csv(
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv',
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']
)

In [4]:
print(f'Num rows: {len(taxi)}, Size: {taxi.memory_usage(deep=True).sum() / 1e6} MB')

Num rows: 7667792, Size: 1082.117204 MB


In [7]:
def prep_df(df: cudf.DataFrame, target_col: str) -> cudf.DataFrame:
    '''
    Generate features from a raw taxi dataframe.
    Use 32 bit precision for GPU processing
    '''
    numeric_feat = [
        'pickup_weekday', 
        'pickup_hour', 
        'pickup_week_hour', 
        'pickup_minute', 
        'passenger_count',
    ]
    categorical_feat = [
        'PULocationID', 
        'DOLocationID',
    ]
    features = numeric_feat + categorical_feat

    # add target
    df = df[df.fare_amount > 0]  # avoid divide-by-zero
    df['tip_fraction'] = df.tip_amount / df.fare_amount
    df[target_col] = (df['tip_fraction'] > 0.2)
    
    # add features
    df['pickup_weekday'] = df.tpep_pickup_datetime.dt.weekday
    df['pickup_hour'] = df.tpep_pickup_datetime.dt.hour
    df['pickup_week_hour'] = (df.pickup_weekday * 24) + df.pickup_hour
    df['pickup_minute'] = df.tpep_pickup_datetime.dt.minute
    
    # drop unused columns
    df = df[features + [target_col]].astype('float32').fillna(-1)
    
    # convert target to int32 for efficiency (it's just 0s and 1s)
    df[target_col] = df[target_col].astype('int32')
    
    return df

In [8]:
target_col = 'high_tip'
taxi_train = prep_df(
    df=taxi,
    target_col=target_col
)

In [9]:
taxi_train.high_tip.value_counts()

1    4001782
0    3656453
Name: high_tip, dtype: int32

In [10]:
taxi_train.head()

Unnamed: 0,pickup_weekday,pickup_hour,pickup_week_hour,pickup_minute,passenger_count,PULocationID,DOLocationID,high_tip
0,1.0,0.0,24.0,46.0,1.0,151.0,239.0,1
1,1.0,0.0,24.0,59.0,1.0,239.0,246.0,0
2,4.0,13.0,109.0,48.0,3.0,236.0,236.0,0
3,2.0,15.0,63.0,52.0,5.0,193.0,193.0,0
4,2.0,15.0,63.0,56.0,5.0,193.0,193.0,0


# Train model

In [11]:
from cuml.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    seed=42
)

  """


In [13]:
%%time

features = [c for c in taxi_train.columns if c!= target_col]

_ = rfc.fit(
    taxi_train[features],
    taxi_train[target_col]
)

CPU times: user 12.8 s, sys: 4.98 s, total: 17.8 s
Wall time: 6 s


## Save model

In [None]:
import cloudpickle

with open(f'{MODEL_PATH}/random_forest_rapids.pkl', 'wb') as f:
    cloudpickle.dump(rfc, f)

## Calculate metrics on test set

Use a different month for test set

In [None]:
taxi_test = cudf.read_csv(
    s3.open('s3://nyc-tlc/trip data/yellow_tripdata_2019-02.csv', mode='rb'),
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']
)

taxi_test = prep_df(taxi_test)

In [None]:
from cuml.metrics import roc_auc_score

preds = rfc.predict_proba(taxi_test[features])[1]
roc_auc_score(taxi_test[y_col], preds)