# Batch Training with Ray Datasets

# Introduction

Batch training and tuning are common tasks in simple machine learning use-cases such as time series forecasting. They require fitting of simple models on multiple data batches corresponding to locations, products, etc.

Batch training in the context of this notebook is understood as creating the same model(s) for different and separate datasets or subsets of a dataset. This notebook showcases how to conduct batch training using [Ray Dataset](https://docs.ray.io/en/latest/data/dataset.html).

![Batch training diagram](./images/batch-training.svg)

For the data, we will use the [NYC Taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).  This popular tabular dataset contains historical taxi pickups by timestamp and location in NYC. To demonstrate batch training & tuning, we will simplify the data to a linear regression problem to predict `trip_duration` and use scikit-learn.

To demonstrate how data and training can be batch-parallelized, we will train a separate model for each dropoff location. This means we can use the `dropoff_location_id` column in the dataset to group the dataset into data batches. Then we will fit a separate model for each batch.

# Contents
In this this tutorial, you will learn about:
 1. [Basics of Ray Datasets](#dataset)
 2. [How to perform batch training with Ray Datasets](#train_func)

# Walkthrough

Let’s start by importing a few required libraries, including open-source [Ray](https://github.com/ray-project/ray) itself!

In [None]:
from typing import Tuple, List, Union, Optional, Callable
import os
import random
import time
import pandas as pd
import numpy as np
import pyarrow.dataset as pds
from pyarrow import fs
from pyarrow import parquet as pq
from ray.data import Dataset

In [None]:
import ray

ray.init(ignore_reinit_error=True)

In [None]:
# For benchmarking purposes, we can print the times of various operations.
# In order to reduce clutter in the output, this is set to False by default.
PRINT_TIMES = False


def print_time(msg: str):
    if PRINT_TIMES:
        print(msg)

In [None]:
# To speed things up, we’ll only use a small subset of the full dataset consisting of two last months of 2019.
# You can choose to use the full dataset for 2018-2019 by setting the SMOKE_TEST variable to False.

SMOKE_TEST = True

## Introduction to Ray Datasets <a class="anchor" id="dataset"></a>

[Ray Datasets](datasets) are the standard way to load and exchange data in Ray libraries and applications. We will use the [Ray Dataset APIs](dataset-api) to read the data and quickly inspect it.

First, we will define some global variables we will use throughout the notebook, such as the list of S3 links to the files making up the dataset and the possible location IDs.

In [None]:
# Define some global variables.
target = "trip_duration"
s3_partitions = pds.dataset(
    "s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/",
    partitioning=["year", "month"],
)

s3_files = [f"s3://{file}" for file in s3_partitions.files]

# Obtain all location IDs
location_ids = (
    pq.read_table(s3_files[0], columns=["pickup_location_id"])["pickup_location_id"]
    .unique()
    .to_pylist()
)

starting_idx = -2 if SMOKE_TEST else 0

s3_files = s3_files[starting_idx:]
print(f"NYC Taxi using {len(s3_files)} file(s)!")
print(f"s3_files: {s3_files}")

Next, we will call `ray.data.read_parquet` to create a Ray dataset from a list of S3 URIs. This will read the files in parallel onto the Ray cluster.

In [None]:
ds = ray.data.read_parquet(s3_files)
ds

### Ray Dataset statistics

Let's get some basic statistics about our newly created Ray Dataset.

Parquet stores the number of rows per file in the Parquet metadata, so we can get the number of rows in `ds` without triggering a full data read.

In [None]:
print(f"Number of rows: {ds.count()}")

Parquet pulls size-in-bytes from its metadata (not triggering a data read) This could be significantly different than actual in-memory size.

In [None]:
print(f"Size bytes (from parquet metadata): {ds.size_bytes()}")

We can also trigger full reading of the dataset and inspect the real size in bytes.

In [None]:
print(f"Size bytes (from full data read): {ds.fully_executed().size_bytes()}")

Let's fetch the schema from the underlying Parquet metadata.

In [None]:
print("\nSchema data types:")
data_types = list(zip(ds.schema().names, ds.schema().types))
[print(f"{s[0]}: {s[1]}") for s in data_types]

Finally, we can take a peek at a sample row:

In [None]:
print("\nLook at a sample row:")
ds.take(1)

### Filter on Read - Projection and Filter Pushdown

Note that Ray Datasets' Parquet reader supports projection (column selection) and row filter pushdown, where we can push the above column selection and the row-based filter to the Parquet read. If we specify column selection at Parquet read time, the unselected columns won't even be read from disk! This can save a lot of memory, especially with big datasets, and allow us to avoid OOM issues.

The row-based filter is specified via [Arrow's dataset field expressions](https://arrow.apache.org/docs/6.0/python/generated/pyarrow.dataset.Expression.html#pyarrow.dataset.Expression). 

**Best practice is to filter as much as you can directly in the Ray Dataset `read_parquet()` statement!**

Normally, there is some data exploration to determine the cleaning steps. Let's just assume we know the data cleaning steps are:
- Drop negative trip distances, 0 fares, 0 passengers and trip durations smaller than 1 minute.
- Drop 2 unknown zones: `['264', '265']`.
- Calculate trip duration and add it as a new column.


In [None]:
def pushdown_read_data(files_list: list, sample_ids: list) -> Dataset:
    filter_expr = (
        (pds.field("passenger_count") > 0)
        & (pds.field("trip_distance") > 0)
        & (pds.field("fare_amount") > 0)
        & (~pds.field("pickup_location_id").isin([264, 265]))
        & (~pds.field("dropoff_location_id").isin([264, 265]))
        & (pds.field("dropoff_location_id").isin(sample_ids))
    )

    dataset = ray.data.read_parquet(
        files_list,
        columns=[
            "pickup_at",
            "dropoff_at",
            "pickup_location_id",
            "dropoff_location_id",
            "passenger_count",
            "trip_distance",
            "fare_amount",
        ],
        filter=filter_expr,
    )

    return dataset

In [None]:
# Test the pushdown_read_data function
pushdown_ds = pushdown_read_data(s3_files, location_ids)

print(f"Number rows: {pushdown_ds.count()}")
# Display some metadata about the dataset.
print("\nMetadata: ")
print(pushdown_ds)
# Fetch the schema from the underlying Parquet metadata.
print("\nSchema:")
print(pushdown_ds.schema())
# Take a peek at a single row
print("\nLook at a sample row:")
pushdown_ds.take(1)

We can use `to_pandas` to convert a Ray Dataset into a pandas DataFrame and inspect that.

```{tip}
Converting a Ray Dataset to pandas is not recommended with large data sizes, as it will load all the data into the memory of a single node. To help avoid OOM errors, `to_pandas` will by default only convert the first 10000 rows.
```

In [None]:
df = pushdown_ds.to_pandas()
df[["dropoff_location_id", "trip_distance"]].groupby("dropoff_location_id").count()

### Custom data transform functions

Ray Datasets allows you to specify custom data transform functions using familiar syntax, such as Pandas. These [custom functions, or UDFs (user defined functions)](transforming_datasets) can be called using `Dataset.map_batches(my_UDF)`. It is necessary to specify the data processing API you are using in the `batch_format` parameter. The transformation will be conducted in parallel for each data batch.

```{tip}
You may need to call `Dataset.repartition(n)` first to split the Dataset into more blocks internally. The upper bound of parallelism is the number of blocks.
```

Available data processing APIs you can specify in the `batch_format` paramater include `"pandas","pyarrow", "numpy"`, and `"native"` to avoid converting data at all. Tabular data will be passed into your UDF by default as a pandas dataframe. Tensor data will be passed into your UDF as a numpy array. Here, we will use `batch_format="pandas"` explicitly.

In [None]:
# A Pandas DataFrame UDF for transforming the underlying blocks of a Dataset in parallel.
def transform_batch(df: pd.DataFrame) -> pd.DataFrame:
    df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds
    df = df[df["trip_duration"] > 60]
    df.drop(["dropoff_at", "pickup_at", "pickup_location_id"], axis=1, inplace=True)
    df["dropoff_location_id"] = df["dropoff_location_id"].fillna(-1)
    return df

In [None]:
# Test the transform UDF function.
print(f"Number of rows before transformation: {pushdown_ds.count()}")

# batch_format="pandas" tells Datasets to provide the transformer with batches
# represented as Pandas DataFrames.
pushdown_ds = pushdown_ds.map_batches(transform_batch, batch_format="pandas")

# Verify row count.
print(f"Number of rows after transformation: {pushdown_ds.count()}")

### Tidying up

We'll delete the datasets we have been using in order to free up memory in our Ray cluster.

In [None]:
del ds
del pushdown_ds

To make our code easier to read, let's summarize the data processing functions again here.

In [None]:
# Filter parquet data using Ray Datasets read_parquet()
def pushdown_read_data(files_list: list, sample_ids: list) -> Dataset:

    start = time.time()

    filter_expr = (
        (pds.field("passenger_count") > 0)
        & (pds.field("trip_distance") > 0)
        & (pds.field("fare_amount") > 0)
        & (~pds.field("pickup_location_id").isin([264, 265]))
        & (~pds.field("dropoff_location_id").isin([264, 265]))
        & (pds.field("dropoff_location_id").isin(sample_ids))
    )

    dataset = ray.data.read_parquet(
        files_list,
        columns=[
            "pickup_at",
            "dropoff_at",
            "pickup_location_id",
            "dropoff_location_id",
            "passenger_count",
            "trip_distance",
            "fare_amount",
        ],
        filter=filter_expr,
    )

    data_loading_time = time.time() - start
    print_time(f"Data loading time: {data_loading_time:.2f} seconds")
    return dataset


# A Pandas DataFrame UDF for transforming the underlying blocks of a Dataset in parallel.
def transform_batch(df: pd.DataFrame) -> pd.DataFrame:
    df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds
    df = df[df["trip_duration"] > 60]
    df.drop(["dropoff_at", "pickup_at", "pickup_location_id"], axis=1, inplace=True)
    df["dropoff_location_id"] = df["dropoff_location_id"].fillna(-1)
    return df

## Batch training with Ray Datasets <a class="anchor" id="train_func"></a>

Now that we have learned more about our data and written a pandas UDF to transform our data, we are ready to train a model on batches of this data in parallel.

To simplify the model training part, we will use linear regression in Scikit-learn.  
- We will use the `dropoff_location_id` column in the dataset to group the dataset into data batches. 
- Then we will fit a separate model for each batch to predict `trip_duration`.

In [None]:
import sklearn
from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from ray.train.sklearn import SklearnTrainer, SklearnPredictor
from ray.train.batch_predictor import BatchPredictor

### Define training functions

We want to fit a linear regression model to the trip duration for each drop-off location. For scoring, we will calculate mean absolute error on the validation set, and report that as model error per drop-off location.

We define `fit_and_score_sklearn` function [<b><i>as a Ray task</i></b>](https://docs.ray.io/en/latest/ray-core/tasks.html), where each Scikit-learn training task will consume a dataset shard in batches. Ray is able to automatically distribute ray tasks on your Ray cluster, to utilize parallel compute.


In [None]:
# Ray task to fit and score a scikit-learn model.
@ray.remote
def fit_and_score_sklearn(
    train_df: pd.DataFrame, test_df: pd.DataFrame, model: BaseEstimator
) -> Tuple[BaseEstimator, float]:

    # Assemble train/test pandas dfs
    train_X = train_df[["passenger_count", "trip_distance", "fare_amount"]]
    train_y = train_df.trip_duration
    test_X = test_df[["passenger_count", "trip_distance", "fare_amount"]]
    test_y = test_df.trip_duration

    # Start training.
    model = model.fit(train_X, train_y)
    pred_y = model.predict(test_X)
    error = sklearn.metrics.mean_absolute_error(test_y, pred_y)

    return model, error

The `train_and_evaluate` function contains the logic for train-test splitting and fitting of multiple models in parallel on each data batch, for purposes of comparison. Thanks to this, we can evaluate several models and choose the best one for each data batch.

This function takes as input, batches of Ray Dataset data.  Each batch of data is placed into Ray's distributed shared-memory object store, using the command [<b><i>ray.put()</i></b>](https://docs.ray.io/en/latest/ray-core/objects.html). Then the remote `fit_and_score_sklearn` Ray task is run simultaneously for all batches of data at once. Function `fit_and_score_sklearn` return values are all retrieved outside the loop using a single [<b><i>ray.get()</i></b>](https://docs.ray.io/en/latest/ray-core/tasks/patterns/ray-get-loop.html) call.

In [None]:
def train_and_evaluate(
    the_df: pd.DataFrame, models: List[BaseEstimator]
) -> List[Tuple[BaseEstimator, float]]:

    # check if input df is big enough for training
    if len(the_df) < 4:
        print_time(f"Dataframe for LocID: {i} is empty or smaller than 4")
        return None
    else:
        loc_id = the_df.dropoff_location_id[0]
        # print(f"Processing location {loc_id}...")

    start = time.time()

    # Train / test split
    # Randomly split the data into 80/20 train/test.
    train_df, test_df = train_test_split(the_df, test_size=0.2, shuffle=True)

    # We put the train & test dataframes into Ray object store
    # so that they can be reused by all models fitted here.
    # https://docs.ray.io/en/latest/ray-core/tips-for-first-time.html#tip-3-avoid-passing-same-object-repeatedly-to-remote-tasks
    train_ref = ray.put(train_df)
    test_ref = ray.put(test_df)

    # Launch a fit and score task for each model.
    results = ray.get(
        [fit_and_score_sklearn.remote(train_ref, test_ref, model) for model in models]
    )
    # results.sort(key=lambda x: x[1])  # sort by error

    # Assemble loc_id, name of model, and metrics in a pandas DataFrame
    results = [loc_id] + list(results[0])
    results_return = pd.DataFrame(columns=["location_id", "model", "error"])
    results_return.loc[0] = results

    training_time = time.time() - start
    print_time(f"Training time for LocID {loc_id}: {training_time:.2f} seconds")

    return results_return

Recall how we wrote a data transform <i>UDF_func</i> using Pandas syntax?  It was called with pattern:
- `rds.map_batches(UDF_func, batch_format)`

Similarly, a groupby-agg function can be used later when we perform a Ray Dataset <b>groupby</b>.  Below, the function <i>agg_func</i> will be called using a pattern: 
- `rds.groupby.map_groups(agg_func, batch_format)`.

In [None]:
# possible to add more trainers in this list
MODELS = [LinearRegression()]

In [None]:
# A Pandas DataFrame aggregation function for processing grouped batches of Ray Dataset data.
def agg_func(the_df: pd.DataFrame):

    ret = pd.DataFrame()

    # Handle errors in data groups
    try:
        # Transform the input pandas AND fit_and_evaluate the transformed pandas
        ret = train_and_evaluate(transform_batch(the_df), MODELS)
    except Exception:
        pass

    # Process null data groups
    if ret.shape[0] == 0:
        loc_id = the_df.dropoff_location_id[0]
        print(f"failed on {loc_id}")
        # assemble a null entry
        ret = [loc_id, None, None]
        results_return = pd.DataFrame(columns=["location_id", "model", "error"])
        results_return.loc[0] = ret
        return results_return

    return ret

### Run batch training using `map_groups`

Finally, the main "driver code" reads each Parquet file (each file corresponds to one month of NYC taxi data) into a Ray Dataset, called `ds`. Then we use Ray Dataset <b>groupby</b> to map each group into a batch of data, on which `agg_func` can run, using the pattern [groupby-map_groups(agg_func, batch_format)](https://docs.ray.io/en/latest/data/api/grouped_dataset.html). This implements an accumulator-based aggregation, which can run on each batch of data in parallel.

In [None]:
# Driver code to run this.
start = time.time()

# Read data into Ray Dataset
ds = pushdown_read_data(s3_files, location_ids)

# Use Ray Dataset groupby.map_groups() to parallel process each group
# Returns a Ray Datset
results = ds.groupby("dropoff_location_id").map_groups(agg_func, batch_format="pandas")
print_time(f"groupby.map_groups() returned: {results}")

total_time_taken = time.time() - start
print(f"Total number of models: {results.count()}")
print(f"TOTAL TIME TAKEN: {total_time_taken:.2f} seconds")
print(results)

In [None]:
# sort values by location id
results_df = results.to_pandas(limit=float("inf"))
results_df.sort_values(by=["location_id"], ascending=True, inplace=True)
results_df