# Batch Training with Ray Core

```{tip}
We strongly recommend using [Ray Datasets](data_user_guide) and [AIR Trainers](air-trainers) to develop batch training, which will enable you to build it faster and more easily, and get the built-in benefits like auto-scaling actor pool. If you think your use case cannot be supported by Ray Datasets or AIR, we'd love to get your feedback e.g. through a [Ray GitHub issue](https://github.com/ray-project/ray/issues).
```

Batch training and tuning are common tasks in simple machine learning use-cases such as time series forecasting. They require fitting of simple models on multiple data batches corresponding to locations, products, etc. This notebook showcases how to conduct batch training on the [NYC Taxi Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) using only Ray Core and stateless Ray tasks.

Batch training in the context of this notebook is understood as creating the same model(s) for different and separate datasets or subsets of a dataset. This task is naively parallelizable and can be easily scaled with Ray.

![Batch training diagram](./images/batch-training.svg)

# Contents
In this this tutorial, you will learn about:
 1. Reading parquet data,
 2. Creating Ray tasks to preprocess, train and evaluate data batches,
 3. Dividing data into batches and spawning a Ray task for each batch to be ran in parallel,
 4. The runtime-memory tradeoff between delegating data loading to each task as opposed to centralized data loading.

# Walkthrough

Our task is to create separate time series models for each pickup location. We can use the `pickup_location_id` column in the dataset to group the dataset into data batches. We will then fit models for each batch and choose the best one.

Let’s start by importing Ray and initializing a local Ray cluster.

In [1]:
from typing import Callable, Optional, List, Union, Tuple, Iterable
import time
import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import pyarrow as pa
from pyarrow import fs
from pyarrow import dataset as ds
from pyarrow import parquet as pq
import pyarrow.compute as pc

In [2]:
import ray

ray.init(ignore_reinit_error=True)

0,1
Python version:,3.8.5
Ray version:,2.0.0
Dashboard:,http://console.anyscale-staging.com/api/v2/sessions/ses_ZmHebxHaZpYkw9x9efJ5wBVX/services?redirect_to=dashboard


For benchmarking purposes, we can print the times of various operations. In order to reduce clutter in the output, this is set to False by default.

In [3]:
PRINT_TIMES = False


def print_time(msg: str):
    if PRINT_TIMES:
        print(msg)

To speed things up, we'll only use a small subset of the full dataset consisting of two last months of 2019. You can choose to use the full dataset for 2018-2019 by setting the `SMOKE_TEST` variable to False.

In [4]:
SMOKE_TEST = True

The `read_data` function reads a Parquet file and uses a push-down predicate to extract the data batch we want to fit a model on using the provided index to group the rows. By having each task read the data and extract batches separately, we ensure that memory utilization is minimal - as opposed to requiring each task to load the entire partition into memory first.

We are using PyArrow to read the file, as it supports push-down predicates to be applied during file reading. This lets us avoid having to load an entire file into memory, which could cause an OOM error with a large dataset. After the dataset is loaded, we convert it to pandas so that it can be used for training with scikit-learn.

In [5]:
def read_data(file: str, pickup_location_id: int) -> pd.DataFrame:
    return pq.read_table(
        file,
        filters=[("pickup_location_id", "=", pickup_location_id)],
        columns=[
            "pickup_at",
            "dropoff_at",
            "pickup_location_id",
            "dropoff_location_id",
        ],
    ).to_pandas()

As we will be using the NYC Taxi dataset, we define a simple batch transformation function to set correct data types, calculate the trip duration and fill missing values.

In [6]:
def transform_batch(df: pd.DataFrame) -> pd.DataFrame:
    df["pickup_at"] = pd.to_datetime(df["pickup_at"], format="%Y-%m-%d %H:%M:%S")
    df["dropoff_at"] = pd.to_datetime(df["dropoff_at"], format="%Y-%m-%d %H:%M:%S")
    df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds
    df["pickup_location_id"] = df["pickup_location_id"].fillna(-1)
    df["dropoff_location_id"] = df["dropoff_location_id"].fillna(-1)
    return df

We will be fitting scikit-learn models on data batches. We define a Ray task `fit_and_score_sklearn` that fits the model and calculates mean absolute error on the validation set. We will be treating this as a simple regression problem where we want to predict the relationship between the drop-off location and the trip duration.

In [7]:
# Ray task to fit and score a scikit-learn model.
@ray.remote
def fit_and_score_sklearn(
    train: pd.DataFrame, test: pd.DataFrame, model: BaseEstimator
) -> Tuple[BaseEstimator, float]:
    train_X = train[["dropoff_location_id"]]
    train_y = train["trip_duration"]
    test_X = test[["dropoff_location_id"]]
    test_y = test["trip_duration"]

    # Start training.
    model = model.fit(train_X, train_y)
    pred_y = model.predict(test_X)
    error = mean_absolute_error(test_y, pred_y)
    return model, error

Next, we will define a `train_and_evaluate` Ray task which contains all logic necessary to load a data batch, transform it, split it into train and test and fit and evaluate models on it. We make sure to return the file and location id used so that we can map the fitted models back to them.

For data loading and processing, we are using the `read_data` and `transform_batch` functions we have defined earlier.

You may notice that we have previously defined `fit_and_score_sklearn` as a Ray task as well and set it to be executed from inside `train_and_evaluate`. This allows us to dynamically create a [tree of tasks](task-pattern-tree-of-tasks), ensuring that the cluster resources are fully utillized. Without this pattern, each `train_and_evaluate` would need to be assigned several CPU cores for the model fitting in advance - meaning that if certain models finish faster, then all CPU cores assigned to the batch would stil stay occupied. Thankfully, Ray is able to deal with nested parallelism in tasks without the need for any extra logic, allowing us to simplify the code and make the best use of our cluster.

In [8]:
def train_and_evaluate_internal(
    df: pd.DataFrame, models: List[BaseEstimator], pickup_location_id: int = 0
) -> List[Tuple[BaseEstimator, float]]:
    # We need at least 4 rows to create a train / test split.
    if len(df) < 4:
        print(f"Dataframe for LocID: {pickup_location_id} is empty or smaller than 4")
        return None

    # Train / test split.
    train, test = train_test_split(df)

    # We put the train & test dataframes into Ray object store
    # so that they can be reused by all models fitted here.
    # https://docs.ray.io/en/latest/ray-core/tips-for-first-time.html#tip-3-avoid-passing-same-object-repeatedly-to-remote-tasks
    train_ref = ray.put(train)
    test_ref = ray.put(test)

    # Launch a fit and score task for each model.
    results = ray.get(
        [fit_and_score_sklearn.remote(train_ref, test_ref, model) for model in models]
    )
    results.sort(key=lambda x: x[1])  # sort by error
    return results


@ray.remote
def train_and_evaluate(
    file_name: str,
    pickup_location_id: int,
    models: List[BaseEstimator],
) -> Tuple[str, str, List[Tuple[BaseEstimator, float]]]:
    start_time = time.time()
    data = read_data(file_name, pickup_location_id)
    data_loading_time = time.time() - start_time
    print_time(f"Data loading time for LocID: {pickup_location_id}: {data_loading_time}")

    # Perform transformation
    start_time = time.time()
    data = transform_batch(data)
    transform_time = time.time() - start_time
    print_time(f"Data transform time for LocID: {pickup_location_id}: {transform_time}")

    # Perform training & evaluation for each model
    start_time = time.time()
    results = train_and_evaluate_internal(data, models, pickup_location_id),
    training_time = time.time() - start_time
    print_time(f"Training time for LocID: {pickup_location_id}: {training_time}")

    return (
        file_name,
        pickup_location_id,
        results,
    )

Finally, the `run_batch_training` driver function generates tasks for each Parquet file it recieves (with each file corresponding to one month). We define the function to take in a list of models, so that we can evaluate them all and choose the best one for each batch. The function blocks when it reaches `ray.get()` and waits for tasks to return their results.

In [9]:
def run_batch_training(files: List[str], models: List[BaseEstimator]):
    print("Starting run...")
    start = time.time()

    # Store task references
    task_refs = []
    for file in files:
        try:
            locdf = pq.read_table(file, columns=["pickup_location_id"])
        except Exception:
            continue
        pickup_location_ids = locdf["pickup_location_id"].unique()

        for pickup_location_id in pickup_location_ids:
            # Cast PyArrow scalar to Python if needed.
            try:
                pickup_location_id = pickup_location_id.as_py()
            except Exception:
                pass
            task_refs.append(train_and_evaluate.remote(file, pickup_location_id, models))

    # Block to obtain results from each task
    results = ray.get(task_refs)

    taken = time.time() - start
    count = len(results)
    # If result is None, then it means there weren't enough records to train
    results_not_none = [x for x in results if x is not None]
    count_not_none = len(results_not_none)

    # Sleep a moment for nicer output
    time.sleep(1)
    print("", flush=True)
    print(
        f"Number of pickup locations: {count}"
    )
    print(
        f"Number of pickup locations with enough records to train: {count_not_none}"
    )
    print(
        f"Number of models trained: {count_not_none * len(models)}"
    )
    print(f"TOTAL TIME TAKEN: {taken:.2f} seconds")
    return results

We obtain the partitions of the dataset from an S3 bucket so that we can pass them to `run`.

In [10]:
# Obtain the dataset. Each month is a separate file.
dataset = ds.dataset(
    "s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/",
    partitioning=["year", "month"],
)
starting_idx = -2 if SMOKE_TEST else 0
files = [f"s3://{file}" for file in dataset.files][starting_idx:]
print(f"Obtained {len(files)} files!")

Obtained 2 files!


We can now run our script. The output is a list of tuples in the following format: `(file name, partition id, list of models and their MAE scores)`. For brevity, we will print out the first 10 tuples.

In [11]:
from sklearn.linear_model import LinearRegression

results = run_batch_training(files, models=[LinearRegression()])
print(results[:10])

Starting run...
(train_and_evaluate pid=101595) Dataframe for LocID: 214 is empty or smaller than 4
(train_and_evaluate pid=101755) Dataframe for LocID: 176 is empty or smaller than 4
(train_and_evaluate pid=102727) Dataframe for LocID: 204 is empty or smaller than 4

Total number of models (all tasks): 522 (522)
TOTAL TIME TAKEN: 56.86 seconds
[('s3://air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 145, [(LinearRegression(), 878.257446386803)]), ('s3://air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 161, [(LinearRegression(), 749.1613025647059)]), ('s3://air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 163, [(LinearRegression(), 744.2731741154661)]), ('s3://air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 193, [(LinearR

We can also provide multiple scikit-learn models to our `run` function and the best one will be chosen for each batch. A common use-case here would be to define several models of the same type with different hyperparameters.

In [12]:
from sklearn.tree import DecisionTreeRegressor

results = run_batch_training(
    files,
    models=[
        LinearRegression(),
        DecisionTreeRegressor(),
        DecisionTreeRegressor(splitter="random"),
    ],
)
print(results[:10])

Starting run...


(train_and_evaluate pid=109536) Dataframe for LocID: 214 is empty or smaller than 4
(train_and_evaluate pid=113949) Dataframe for LocID: 176 is empty or smaller than 4
(train_and_evaluate pid=114454) Dataframe for LocID: 204 is empty or smaller than 4

Total number of models (all tasks): 1566 (522)
TOTAL TIME TAKEN: 77.74 seconds
[('s3://air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 145, [(DecisionTreeRegressor(), 618.3188066672511), (DecisionTreeRegressor(splitter='random'), 619.603168803982), (LinearRegression(), 848.0861910655393)]), ('s3://air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 161, [(DecisionTreeRegressor(), 578.1630239752242), (DecisionTreeRegressor(splitter='random'), 578.1630239752242), (LinearRegression(), 736.7727357111179)]), ('s3://air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_

## Centralized data loading using Ray object store

In order to ensure that the data can always fit in memory, each task reads the files independently and extracts the desired data batch. This, however, negatively impacts the runtime. If we have sufficient memory in our Ray cluster, we can instead load each partition once, extract the batches, and save them in the [Ray object store](objects-in-ray), reducing time required dramatically at a cost of higher memory usage.

Notice we do not call `ray.get()` on the references of the `read_into_object_store`. Instead, we pass the reference itself as the argument to the `train_and_evaluate.remote` dispatch, [allowing for the data to stay in the object store until it is actually needed](ray-pass-large-arg-by-value). This avoids a situation where all the data would be loaded into the memory of the process calling `ray.get()`.

You can use the Ray Dashboard to compare the memory usage between the previous approach and this one.

In [15]:
# Redefine the train_and_evaluate task to use in-memory data.
# We still keep file_name and pickup_location_id for identification purposes.
@ray.remote
def train_and_evaluate(
    data: pd.DataFrame,
    file_name: str,
    pickup_location_id: int,
    models: List[BaseEstimator],
) -> Tuple[str, str, List[Tuple[BaseEstimator, float]]]:
    # Perform transformation
    start_time = time.time()
    data = transform_batch(data)
    transform_time = time.time() - start_time
    print_time(f"Data transform time for LocID: {pickup_location_id}: {transform_time}")

    return (
        file_name,
        pickup_location_id,
        train_and_evaluate_internal(data, models, pickup_location_id),
    )


@ray.remote
def read_into_object_store(file: str) -> List[ray.ObjectRef]:
    print(f"Loading {file}")
    # Read the entire file into memory.
    try:
        locdf = pq.read_table(
            file,
            columns=[
                "pickup_at",
                "dropoff_at",
                "pickup_location_id",
                "dropoff_location_id",
            ],
        )
    except Exception:
        return []

    pickup_location_ids = locdf["pickup_location_id"].unique()

    data_batch_refs = []
    for pickup_location_id in pickup_location_ids:
        # Put each data batch as a separate dataframe into Ray object store.
        data_batch_refs.append(
            (
                pickup_location_id,
                ray.put(
                    locdf.filter(
                        pc.field("pickup_location_id") == pickup_location_id
                    ).to_pandas()
                ),
            )
        )

    return data_batch_refs


def run_batch_training_with_object_store(files: List[str], models: List[BaseEstimator]):
    print("Starting run...")
    start = time.time()

    # Store task references
    task_refs = []

    # Use a SPREAD scheduling strategy to load each
    # file on a separate node as an OOM safeguard.
    # This is not foolproof though! We can also specify a resource
    # requirement for memory, if we know what is the maximum
    # memory requirement for a single file.
    read_into_object_store_spread = read_into_object_store.options(
        scheduling_strategy="SPREAD"
    )
    read_tasks = [read_into_object_store_spread.remote(file) for file in files]

    # Dictionary of references to data batches with file names as keys
    data_batch_refs_by_file = {}
    for file_id, data_batch_refs in enumerate(ray.get(read_tasks)):
        data_batch_refs_by_file[files[file_id]] = data_batch_refs

    for file, data_batch_refs in data_batch_refs_by_file.items():
        for pickup_location_id, data_batch_ref in data_batch_refs:
            # Cast PyArrow scalar to Python if needed.
            try:
                pickup_location_id = pickup_location_id.as_py()
            except Exception:
                pass
            task_refs.append(train_and_evaluate.remote(data_batch_ref, file, pickup_location_id, models))

    # Block to obtain results from each task
    results = ray.get(task_refs)

    taken = time.time() - start
    count = len(results)
    # If result is None, then it means there weren't enough records to train
    results_not_none = [x for x in results if x is not None]
    count_not_none = len(results_not_none)

    # Sleep a moment for nicer output
    time.sleep(1)
    print("", flush=True)
    print(
        f"Number of pickup locations: {count}"
    )
    print(
        f"Number of pickup locations with enough records to train: {count_not_none}"
    )
    print(
        f"Number of models trained: {count_not_none * len(models)}"
    )
    print(f"TOTAL TIME TAKEN: {taken:.2f} seconds")
    return results

In [16]:
results = run_batch_training_with_object_store(files, models=[LinearRegression()])
print(results[:10])

Starting run...


(read_into_object_store pid=115283) Loading s3://air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet
(read_into_object_store pid=114454) Loading s3://air-example-data/ursa-labs-taxi-data/by_year/2019/06/data.parquet/ab5b9d2b8cc94be19346e260b543ec35_000000.parquet


(train_and_evaluate pid=117506) Dataframe for LocID: 214 is empty or smaller than 4
(train_and_evaluate pid=114924) Dataframe for LocID: 204 is empty or smaller than 4
(train_and_evaluate pid=117506) Dataframe for LocID: 176 is empty or smaller than 4

Total number of models (all tasks): 522 (522)
TOTAL TIME TAKEN: 12.03 seconds
[('s3://air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 145, [(LinearRegression(), 802.2276313581212)]), ('s3://air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 161, [(LinearRegression(), 772.1983771933358)]), ('s3://air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 163, [(LinearRegression(), 748.7126342594678)]), ('s3://air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 193, [(LinearRegression(), 83

We can see that this approach allowed us to finish training much faster, but it would not have been possible if the dataset was too large to fit into our cluster memory. Therefore, this pattern is only recommended if the data you are working with is small. Otherwise, it is recommended to load the data inside the tasks right before its used.