# Use RAPIDS on a single GPU

This notebook describes a machine learning training workflow using the famous [NYC Taxi Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). That dataset contains information on taxi trips in New York City.

In this exercise, you'll use `cudf` to load a subset of the data and `cuml` to answer this classification question:

> based on characteristics that can be known at the beginning of a trip, will this trip result in a high tip?

## Use RAPIDS libraries

RAPIDS is a collection of libraries which enable you to take advantage of NVIDIA GPUs to accelerate machine learning workflows. This exercise uses the following RAPIDS packages:
    
* [`cudf`](https://github.com/rapidsai/cudf): data frame manipulation, similar to `pandas` and `numpy`
* [`cuml`](https://github.com/rapidsai/cuml): machine learning training and evaluation, similar to `scikit-learn`

<hr>

## Load data

This example is designed to run quickly with small resources. So let's just load a single month of taxi data for training.

The code below loads the data into a `cudf` data frame. This is similar to a `pandas` dataframe, but it lives in GPU memory and most operations on it are done on the GPU.

In [None]:
import cudf

taxi = cudf.read_csv(
    "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
)

The code below computes the size of this dataset in memory. One month is about 7.6 million rows and 1.08 GB.

In [None]:
print(f"Num rows: {len(taxi)}, Size: {round(taxi.memory_usage(deep=True).sum() / 1e9, 2)} GB")

You can examine the structure of the data with `cudf` commands:

`.head()` = view the first few rows

In [None]:
taxi.head()

`.dtypes` = list all the columns and the type of data in them

In [None]:
taxi.dtypes

<hr>

## Prep for Training

In [None]:
def prep_df(df: cudf.DataFrame, target_col: str) -> cudf.DataFrame:
    """
    Prepare a raw taxi dataframe for training.
        * computes the target ('high_tip')
        * adds features
        * removes unused features

    Casts all numeric values to 32-bit types, for efficiency and
    because some older versions of CUDA / ``cudf``, ``cuml``
    did not support 64-bit types in training data.
    """
    numeric_feat = [
        "pickup_weekday",
        "pickup_hour",
        "pickup_week_hour",
        "pickup_minute",
        "passenger_count",
    ]
    categorical_feat = [
        "PULocationID",
        "DOLocationID",
    ]
    features = numeric_feat + categorical_feat

    # add target
    df = df[df.fare_amount > 0]  # avoid divide-by-zero
    df["tip_fraction"] = df.tip_amount / df.fare_amount
    df[target_col] = df["tip_fraction"] > 0.2

    # add features
    df["pickup_weekday"] = df.tpep_pickup_datetime.dt.weekday
    df["pickup_hour"] = df.tpep_pickup_datetime.dt.hour
    df["pickup_week_hour"] = (df.pickup_weekday * 24) + df.pickup_hour
    df["pickup_minute"] = df.tpep_pickup_datetime.dt.minute

    # drop unused columns
    df = df[features + [target_col]].astype("float32").fillna(-1)

    # convert target to int32 for efficiency (it's just 0s and 1s)
    df[target_col] = df[target_col].astype("int32")

    return df

Run the code below to get a new data frame, `taxi_train`, that can be used directly for model training.

In [None]:
target_col = "high_tip"

taxi_train = prep_df(df=taxi, target_col=target_col)

`taxi_train` is a `cudf` dataframe that will be passed in to a machine learning model. Since this is a binary classification task, before proceeding we should examine the distributions of 1s and 0s in the target. This can be done with the `.value_counts()` method.

In [None]:
taxi_train.high_tip.value_counts()

Before going further, check the first few rows of the dataset to make sure that the features look reasonable.

In [None]:
taxi_train.head()

Now that the dataframe has been processed, check its size in memory again.

In [None]:
print(
    f"Num rows: {len(taxi_train)}, Size: {round(taxi_train.memory_usage(deep=True).sum() / 1e9, 2)} GB"
)

As you can see above, removing unused columns dropped the size of the training data to 0.31 GB, about one third the size of the raw data.

<hr>

## Train a Model

Now that the data have been prepped, it's time to build a model!

For this task, we'll use the `RandomForestClassifier` from `cuml`. If you've never used a random forest or need a refresher, consult ["Forests of randomized trees"](https://scikit-learn.org/stable/modules/ensemble.html#forest) in the `scikit-learn` documentation.

In [None]:
from cuml.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100, max_depth=10, n_streams=4)

In [None]:
%%time

features = [c for c in taxi_train.columns if c != target_col]

_ = rfc.fit(taxi_train[features], taxi_train[target_col])

<hr>

## Calculate metrics on test set

We'll use another month of taxi data for the test set

In [None]:
taxi_test = cudf.read_csv(
    "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
)

Before creating predictions on this new dataset, it has to be transformed in exactly the way that the original training data were prepared. Thankfully you've already wrapped that transformation logic in a function!

In [None]:
taxi_test = prep_df(taxi_test, target_col=target_col)

In [None]:
from cuml.metrics import roc_auc_score

preds = rfc.predict_proba(taxi_test[features])[1]
roc_auc_score(taxi_test[target_col], preds)

<hr>

## Next Steps

In this tutorial, you learned how to train a model for a binary classification task, using `cuml`. 

If you wanted to train a much larger model (think `max_depth=16, num_iterations=10000`) or use a much larger dataset or both, it might not be possible on a single machine. Try [this notebook](./02-rapids-dask.ipynb) to learn how to use Dask to take advantage of multiple-machine, multi-GPU training.

<hr>