# Batch Training with Ray Core

```{tip}
We strongly recommend using [Ray Datasets](data_user_guide) and [AIR Trainers](air-trainers) to develop batch training, which will enable you to build it faster and more easily, and get the built-in benefits like auto-scaling actor pool. If you think your use case cannot be supported by Ray Datasets or AIR, we'd love to get your feedback e.g. through a [Ray GitHub issue](https://github.com/ray-project/ray/issues).
```

Batch training and tuning are common tasks in simple machine learning use-cases such as time series forecasting. They require fitting of simple models on multiple data batches corresponding to locations, products, etc. This notebook showcases how to conduct batch training on the [NYC Taxi Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) using only Ray Core and stateless Ray tasks.

# Walkthrough

Our task is to create separate time series models for each pickup location. We can use the `pickup_location_id` column in the dataset to group the dataset into data batches. We will then fit models for each batch and choose the best one.

Let’s start by importing Ray and initializing a local Ray cluster.

In [1]:
from typing import Callable, Optional, List, Union, Tuple, Iterable
import time
import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import pyarrow as pa
from pyarrow import fs
from pyarrow import dataset as ds
from pyarrow import parquet as pq
import pyarrow.compute as pc

In [2]:
import ray

ray.init(ignore_reinit_error=True)

2022-09-28 10:54:49,267	INFO worker.py:1223 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS
2022-09-28 10:54:49,978	INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.31.95.254:9031...
2022-09-28 10:54:49,990	INFO worker.py:1509 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://console.anyscale-staging.com/api/v2/sessions/ses_ZmHebxHaZpYkw9x9efJ5wBVX/services?redirect_to=dashboard [39m[22m
2022-09-28 10:54:50,008	INFO packaging.py:342 -- Pushing file package 'gcs://_ray_pkg_1da1c9c508cd1778096dc3a7918737d6.zip' (4.85MiB) to Ray cluster...
2022-09-28 10:54:50,052	INFO packaging.py:351 -- Successfully pushed file package 'gcs://_ray_pkg_1da1c9c508cd1778096dc3a7918737d6.zip'.


0,1
Python version:,3.8.5
Ray version:,2.0.0
Dashboard:,http://console.anyscale-staging.com/api/v2/sessions/ses_ZmHebxHaZpYkw9x9efJ5wBVX/services?redirect_to=dashboard


For benchmarking purposes, we can print the times of various operations. In order to reduce clutter in the output, this is set to False by default.

In [3]:
PRINT_TIMES = False


def print_time(msg: str):
    if PRINT_TIMES:
        print(msg)

For testing purposes, we'll only use a small subset of the full dataset. You can choose to use the full dataset by setting the `SMOKE_TEST` variable to False.

In [4]:
SMOKE_TEST = True

As we will be using the NYC Taxi dataset, we define a simple batch transformation function to set correct data types, calculate the trip duration and fill missing values.

In [5]:
# A Pandas DataFrame UDF for transforming the underlying blocks of a Dataset in parallel.
def transform_batch(df: pd.DataFrame) -> pd.DataFrame:
    df["pickup_at"] = pd.to_datetime(df["pickup_at"], format="%Y-%m-%d %H:%M:%S")
    df["dropoff_at"] = pd.to_datetime(df["dropoff_at"], format="%Y-%m-%d %H:%M:%S")
    df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds
    df["pickup_location_id"] = df["pickup_location_id"].fillna(-1)
    df["dropoff_location_id"] = df["dropoff_location_id"].fillna(-1)
    return df

We will be fitting scikit-learn models on data batches. We define a Ray task `fit_and_score_sklearn` that fits the model and calculates mean absolute error on the validation set. We will be treating this as a simple regression problem where we want to predict the relationship between the drop-off location and the trip duration.

In [6]:
# Ray task to fit and score a scikit-learn model.
@ray.remote
def fit_and_score_sklearn(
    train: pd.DataFrame, test: pd.DataFrame, model: BaseEstimator
) -> Tuple[BaseEstimator, float]:
    train_X = train[["dropoff_location_id"]]
    train_y = train["trip_duration"]
    test_X = test[["dropoff_location_id"]]
    test_y = test["trip_duration"]

    # Start training.
    model = model.fit(train_X, train_y)
    pred_y = model.predict(test_X)
    error = mean_absolute_error(test_y, pred_y)
    return model, error

The `train_and_evaluate` function contains the logic for train-test splitting and fitting of multiple models in parallel on each data batch, for purposes of comparison. Thanks to this, we can evaluate several models and choose the best one for each data batch.

In [7]:
def train_and_evaluate(
    df: pd.DataFrame, models: List[BaseEstimator], i: int = 0
) -> List[Tuple[BaseEstimator, float]]:
    # We need at least 4 rows to create a train / test split.
    if len(df) < 4:
        print_time(f"Dataframe for LocID: {i} is empty or smaller than 4")
        return None

    start = time.time()

    # Train / test split.
    train, test = train_test_split(df)

    # We put the train & test dataframes into Ray object store
    # so that they can be reused by all models fitted here.
    # https://docs.ray.io/en/latest/ray-core/tips-for-first-time.html#tip-3-avoid-passing-same-object-repeatedly-to-remote-tasks
    train_ref = ray.put(train)
    test_ref = ray.put(test)

    # Launch a fit and score task for each model.
    results = ray.get(
        [fit_and_score_sklearn.remote(train_ref, test_ref, model) for model in models]
    )
    results.sort(key=lambda x: x[1])  # sort by error

    time_taken = time.time() - start
    print_time(f"Training time for LocID: {i}: {time_taken}")
    return results

The `read_data` function reads a Parquet file and uses a push-down predicate to extract the data batch we want to fit a model on using the provided index to group the rows. By having each task read the data and extract batches separately, we ensure that memory utilization is minimal - as opposed to requiring each task to load the entire partition into memory first.

In [8]:
def read_data(file: str, i: int) -> pd.DataFrame:
    return pq.read_table(
        file,
        filters=[("pickup_location_id", "=", i)],
        columns=[
            "pickup_at",
            "dropoff_at",
            "pickup_location_id",
            "dropoff_location_id",
        ],
    ).to_pandas()

The `task` Ray task contains all logic necessary to load a data batch, transform it and fit and evaluate models on it.

You may notice that we have previously defined `fit_and_score_sklearn` as a Ray task as well and set it to be executed from inside `task`. This allows us to dynamically create a {doc}`tree of tasks </ray-core/tasks/patterns/tree-of-tasks>`, ensuring that the cluster resources are fully utillized. Without this pattern, each `task` would need to be assigned several CPU cores for the model fitting, meaning that if certain models finish faster, then those CPU cores would stil stay occupied. Thankfully, Ray is able to deal with nested parallelism in tasks without the need for any extra logic, allowing us to simplify the code.

In [9]:
@ray.remote
def task(
    data: Union[str, pd.DataFrame],
    file_name: str,
    i: int,
    models: List[BaseEstimator],
    load_data_func: Optional[Callable] = None,
) -> List[Tuple[BaseEstimator, float]]:
    if load_data_func:
        start_time = time.time()
        data = load_data_func(data, i)
        data_loading_time = time.time() - start_time
        print_time(f"Data loading time for LocID: {i}: {data_loading_time}")

    # Cast PyArrow scalar to Python if needed.
    try:
        i = i.as_py()
    except Exception:
        pass

    # Perform transformation
    start_time = time.time()
    data = transform_batch(data)
    transform_time = time.time() - start_time
    print_time(f"Data transform time for LocID: {i}: {transform_time}")

    return file_name, i, train_and_evaluate(data, models, i)

The `task_generator` generator dispatches tasks and yields references to them. Each task will be ran in parallel on a separate batch as determined by the `pickup_location_id` column in the provided file. Ray will handle scheduling automatically.

In [10]:
def task_generator(files: List[str], models: List[BaseEstimator]) -> ray.ObjectRef:
    for file in files:
        try:
            locdf = pq.read_table(file, columns=["pickup_location_id"])
        except Exception:
            continue
        loc_list = locdf["pickup_location_id"].unique()

        for i in loc_list:
            yield task.remote(file, file, i, models, read_data)

Finally, the `run` driver function obtains the partitions of the dataset from an S3 bucket and generates tasks for each Parquet file it contains (with each file corresponding to one month). We define the function to take in a list of models, so that we can evaluate them all and choose the best one for each batch. In order to not overload cluster and cause OOM, we use `ray.wait()` to limit the number of in-flight tasks - see details about this design pattern in {doc}`/ray-core/patterns/limit-tasks`.

In [11]:
def run(task_generator: Callable, models: List[BaseEstimator]):
    print("Starting run")
    start = time.time()

    # Obtain the dataset. Each month is a separate file.
    s3 = fs.S3FileSystem(region="us-east-2")
    dataset = ds.dataset(
        "ursa-labs-taxi-data/", filesystem=s3, partitioning=["year", "month"]
    )
    starting_idx = -2 if SMOKE_TEST else 0
    files = [f"s3://{file}" for file in dataset.files][starting_idx:]

    # NOTE: This should be set to a number that's large enough (e.g. at least
    # the number of CPUs in the cluster, usually can be even larger) to enable good
    # parallelization. In practice you can start with sys.maxsize (i.e. no limit),
    # and scale down if you have massive number of tasks causing overload/OOM the node.
    import sys

    max_in_flight_tasks = sys.maxsize
    result_refs = []
    results = []

    # Launch all training tasks.
    for ref in task_generator(files, models):
        # Apply backpressure: when there are more than max_in_flight_tasks tasks pending,
        # we wait with ray.wait() untill one of the object ref is ready.
        if len(result_refs) > max_in_flight_tasks:
            num_ready = len(result_refs) - max_in_flight_tasks
            newly_completed, result_refs = ray.wait(result_refs, num_returns=num_ready)
            for completed_ref in newly_completed:
                results.append(ray.get(completed_ref))

        result_refs.append(ref)

    # Wait the remaining pending tasks to complete.
    newly_completed, result_refs = ray.wait(result_refs, num_returns=len(result_refs))
    results.extend(ray.get(newly_completed))

    taken = time.time() - start
    count = len(results)
    results_not_none = [x for x in results if x is not None]
    count_not_none = len(results_not_none)

    # Sleep a moment for nicer output
    time.sleep(1)
    print("", flush=True)
    print(f"Total number of models (all tasks): {count_not_none} ({count})")
    print(f"TOTAL TIME TAKEN: {taken:.2f} seconds")
    return results

We can now run our script. The output is a list of tuples in the following format: `(file name, partition id, list of models and their MAE scores)`.

In [12]:
from sklearn.linear_model import LinearRegression

run(task_generator, models=[LinearRegression()])

Starting run

Total number of models (all tasks): 522 (522)
TOTAL TIME TAKEN: 49.75 seconds


[('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  145,
  [(LinearRegression(), 853.9676856746265)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  161,
  [(LinearRegression(), 756.4251596787243)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  163,
  [(LinearRegression(), 758.7382755797829)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  193,
  [(LinearRegression(), 787.3141708976565)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  260,
  [(LinearRegression(), 646.1338185757664)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  56,
  [(LinearRegression(), 1406.420030002027)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  79,
  [(LinearRegression(), 679.5541727445309)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  90,
  [(LinearRegression(), 656.3225432427536)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  162,
  [(LinearRegression(), 692.3266591503689)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  50,
  [(Linear

We can also provide multiple scikit-learn models to our `run` function and the best one will be chosen for each batch. A common use-case here would be to define several models of the same type with different hyperparameters.

In [13]:
from sklearn.tree import DecisionTreeRegressor

run(
    task_generator,
    models=[
        LinearRegression(),
        DecisionTreeRegressor(),
        DecisionTreeRegressor(splitter="random"),
    ],
)

Starting run
Total number of models (all tasks): 522 (522)
TOTAL TIME TAKEN: 41.95 seconds


[('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  145,
  [(DecisionTreeRegressor(splitter='random'), 572.3720500312127),
   (DecisionTreeRegressor(), 573.7375520469991),
   (LinearRegression(), 837.6952634729321)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  161,
  [(DecisionTreeRegressor(), 597.5412417883275),
   (DecisionTreeRegressor(splitter='random'), 597.5504013732001),
   (LinearRegression(), 756.9249015987074)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  163,
  [(DecisionTreeRegressor(), 588.3253443144102),
   (DecisionTreeRegressor(splitter='random'), 588.3576322443986),
   (LinearRegression(), 757.6811787609638)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  193,
  [(DecisionTreeRegressor(splitter='random'), 649.9896195529243),
   (DecisionTreeRegressor(), 650.0184839062366),
   (LinearRegression(), 815.1586578607721)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  260,
  [(DecisionTreeRegressor(splitter='random'), 521.7869711190898),


## Loading data once into Ray object store

In order to ensure that the data can always fit in memory, each task reads the files independently and extracts the desired data batch. This, however, negatively impacts the runtime. If we have sufficient memory in our Ray cluster, we can instead load each partition once, extract the batches, and save them in the [Ray object store](objects-in-ray), reducing time required dramatically at a cost of higher memory usage.

Notice we do not call `ray.get()` on the references of the `read_into_object_store`. Instead, we pass the reference itself as the argument to the `task.remote` dispatch, [allowing for the data to stay in the object store until it is actually needed](tip-avoid-same-object-in-remote). This avoids a situation where all the data would be loaded into the memory of the process calling `ray.get()`.

You can use the Ray Dashboard to compare the memory usage between the previous approach and this one.

In [18]:
from ray.util.placement_group import placement_group, remove_placement_group
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy


@ray.remote
def read_into_object_store(file: str) -> List[ray.ObjectRef]:
    print(f"Loading {file}")
    # Read the entire file into memory.
    try:
        locdf = pq.read_table(
            file,
            columns=[
                "pickup_at",
                "dropoff_at",
                "pickup_location_id",
                "dropoff_location_id",
            ],
        )
    except Exception:
        return []

    loc_list = locdf["pickup_location_id"].unique()

    group_refs = []
    for i in loc_list:
        # Put each data batch as a separate dataframe into Ray object store.
        group_refs.append(
            (i, ray.put(locdf.filter(pc.field("pickup_location_id") == i).to_pandas()))
        )

    return group_refs


def task_generator_with_object_store(
    files: List[str], models: List[BaseEstimator]
) -> ray.ObjectRef:
    # Use a placement group with a SPREAD strategy to load each
    # file on a separate node as an OOM safeguard.
    # This is not foolproof though! We can also specify a resource
    # requirement for memory, if we know what is the maximum
    # memory requirement for a single file.
    pg = placement_group([{"CPU": 1}] * len(files), strategy="SPREAD")
    ray.get(pg.ready())

    read_into_object_store_pg = read_into_object_store.options(
        scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=pg)
    )
    load_tasks = [read_into_object_store_pg.remote(file) for file in files]
    group_refs = {}
    for i, refs in enumerate(ray.get(load_tasks)):
        group_refs[files[i]] = refs
    remove_placement_group(pg)

    for file, refs in group_refs.items():
        for i, ref in refs:
            yield task.remote(ref, file, i, models)

In [19]:
run(task_generator_with_object_store, models=[LinearRegression()])

Starting run
(read_into_object_store pid=104976, ip=172.31.80.226) Loading s3://ursa-labs-taxi-data/2019/06/data.parquet
(read_into_object_store pid=102476, ip=172.31.67.35) Loading s3://ursa-labs-taxi-data/2019/05/data.parquet



Total number of models (all tasks): 522 (522)
TOTAL TIME TAKEN: 25.02 seconds


[('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  145,
  [(LinearRegression(), 749.5870180494717)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  161,
  [(LinearRegression(), 730.9939763112341)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  163,
  [(LinearRegression(), 752.6143725415071)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  193,
  [(LinearRegression(), 877.4961002488161)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  260,
  [(LinearRegression(), 636.7529178428583)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  56,
  [(LinearRegression(), 1383.1499081324398)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  79,
  [(LinearRegression(), 708.8218565941361)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  90,
  [(LinearRegression(), 634.7344296966268)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  162,
  [(LinearRegression(), 690.5440360771848)]),
 ('s3://ursa-labs-taxi-data/2019/05/data.parquet',
  50,
  [(Linea

We can see that this approach allowed us to finish training much faster, but it would not have been possible if the dataset was too large to fit into our cluster memory. Therefore, this pattern is only recommended if the data you are working with is small. Otherwise, it is recommended to load the data inside the tasks right before its used.