Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Remove trainer references from preprocessors #38348

Merged
merged 15 commits into from
Aug 16, 2023
4 changes: 3 additions & 1 deletion doc/source/data/doc_code/preprocessors.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,15 @@
valid_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(1, 32, 3)])

preprocessor = MinMaxScaler(["x"])
preprocessor.fit(train_dataset)
richardliaw marked this conversation as resolved.
Show resolved Hide resolved
train_dataset = preprocessor.transform(train_dataset)
valid_dataset = preprocessor.transform(valid_dataset)

trainer = XGBoostTrainer(
label_column="y",
params={"objective": "reg:squarederror"},
scaling_config=ScalingConfig(num_workers=2),
datasets={"train": train_dataset, "valid": valid_dataset},
preprocessor=preprocessor,
)
result = trainer.fit()
# __trainer_end__
Expand Down
94 changes: 14 additions & 80 deletions doc/source/data/preprocessors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,14 @@ Using Preprocessors

Data preprocessing is a common technique for transforming raw data into features for a machine learning model.
In general, you may want to apply the same preprocessing logic to your offline training data and online inference data.
Ray AIR provides several common preprocessors out of the box and interfaces to define your own custom logic.

This page covers *preprocessors*, which are a higher level API on top of existing Ray Data operations like `map_batches`,
targeted towards tabular and structured data use cases.

The recommended way to perform preprocessing is to :ref:`use existing Ray Data operations <transforming_data>` instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the wording here? It's a little bit confusing right now if we say the recommended way for preprocessing is existing Ray Data operations, but then say you should consider built-in preprocessors.

Maybe just "While Ray Data supports generic transformations on datasets, for tabular data it also provides built-in preprocessors"

of preprocessors. However, if you are working with tabular data, you should consider using Ray Data preprocessors.



.. https://docs.google.com/drawings/d/1ZIbsXv5vvwTVIEr2aooKxuYJ_VL7-8VMNlRinAiPaTI/edit

Expand All @@ -15,15 +22,7 @@ Ray AIR provides several common preprocessors out of the box and interfaces to d
Overview
--------

The most common way of using a preprocessor is by passing it as an argument to the constructor of a Ray Train :ref:`Trainer <train-docs>` in conjunction with a :ref:`Ray Data dataset <data>`.
For example, the following code trains a model with a preprocessor that normalizes the data.

.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __trainer_start__
:end-before: __trainer_end__

The ``Preprocessor`` class with four public methods that can we used separately from a trainer:
The ``Preprocessor`` class has four public methods:
richardliaw marked this conversation as resolved.
Show resolved Hide resolved

#. ``fit()``: Compute state information about a :class:`Dataset <ray.data.Dataset>` (e.g., the mean or standard deviation of a column)
richardliaw marked this conversation as resolved.
Show resolved Hide resolved
and save it to the ``Preprocessor``. This information is used to perform ``transform()``, and the method is typically called on a
Expand Down Expand Up @@ -55,84 +54,19 @@ Finally, call ``transform_batch`` on a single batch of data.
:start-after: __preprocessor_transform_batch_start__
:end-before: __preprocessor_transform_batch_end__

Life of an AIR preprocessor
---------------------------

Now that we've gone over the basics, let's dive into how ``Preprocessor``\s fit into an end-to-end application built with AIR.
The diagram below depicts an overview of the main steps of a ``Preprocessor``:

#. Passed into a ``Trainer`` to ``fit`` and ``transform`` input ``Dataset``\s
#. Saved as a ``Checkpoint``
#. Reconstructed in a ``Predictor`` to ``fit_batch`` on batches of data
The most common way of using a preprocessor is by using it on a :ref:`Ray Data dataset <data>`, which is then passed to a Ray Train :ref:`Trainer <train-docs>`. See also:

.. figure:: images/air-preprocessor.svg

Throughout this section we'll go through this workflow in more detail, with code examples using XGBoost.
The same logic is applicable to other machine learning framework integrations as well.

Trainer
~~~~~~~

The journey of the ``Preprocessor`` starts with the :class:`Trainer <ray.train.trainer.BaseTrainer>`.
If the ``Trainer`` is instantiated with a ``Preprocessor``, then the following logic is executed when ``Trainer.fit()`` is called:

#. If a ``"train"`` ``Dataset`` is passed in, then the ``Preprocessor`` calls ``fit()`` on it.
#. The ``Preprocessor`` then calls ``transform()`` on all ``Dataset``\s, including the ``"train"`` ``Dataset``.
#. The ``Trainer`` then performs training on the preprocessed ``Dataset``\s.

.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __trainer_start__
:end-before: __trainer_end__

.. note::

If you're passing a ``Preprocessor`` that is already fitted, it is refitted on the ``"train"`` ``Dataset``.
Adding the functionality to support passing in a fitted Preprocessor is being tracked
`here <https://github.com/ray-project/ray/issues/25299>`__.

.. TODO: Remove the note above once the issue is resolved.

Tune
~~~~

If you're using ``Ray Tune`` for hyperparameter optimization, be aware that each ``Trial`` instantiates its own copy of
the ``Preprocessor`` and the fitting and transforming logic occur once per ``Trial``.

Checkpoint
~~~~~~~~~~

``Trainer.fit()`` returns a ``Result`` object which contains a ``Checkpoint``.
If a ``Preprocessor`` is passed into the ``Trainer``, then it is saved in the ``Checkpoint`` along with any fitted state.

As a sanity check, let's confirm the ``Preprocessor`` is available in the ``Checkpoint``. In practice, you don't need to check.

.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __checkpoint_start__
:end-before: __checkpoint_end__


Predictor
~~~~~~~~~

A ``Predictor`` can be constructed from a saved ``Checkpoint``. If the ``Checkpoint`` contains a ``Preprocessor``,
then the ``Preprocessor`` calls ``transform_batch`` on input batches prior to performing inference.

In the following example, we show the batch inference flow.

.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __predictor_start__
:end-before: __predictor_end__
* :ref:`Working with Pytorch user guide <working_with_pytorch>`
richardliaw marked this conversation as resolved.
Show resolved Hide resolved
* Ray Train's data preprocessing and ingest section for :ref:`PyTorch <data-ingest-torch>`
* Ray Train's data preprocessing and ingest section for :ref:`LightGBM/XGBoost <data-ingest-gbdt>`

Types of preprocessors
----------------------

Built-in preprocessors
~~~~~~~~~~~~~~~~~~~~~~

Ray AIR provides a handful of preprocessors out of the box.
Ray Data provides a handful of preprocessors out of the box.

**Generic preprocessors**

Expand Down
41 changes: 41 additions & 0 deletions doc/source/train/distributed-xgboost-lightgbm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,47 @@ machines have 16 CPUs in addition to the 4 GPUs, each actor should have
:end-before: __gpu_xgboost_end__


.. _data-ingest-gbdt:

How to preprocess data for training?
------------------------------------

The recommended way to preprocess data for training is to use Ray Data operations such as `map_batches`.
richardliaw marked this conversation as resolved.
Show resolved Hide resolved

However, particularly for tabular data, Ray Data comes with out-of-the-box :ref:`preprocessors <air-preprocessors>` that implement common feature preprocessing operations.
You can use this with Ray Train Trainers by applying them on the dataset before passing the dataset into a Trainer. For example:

.. testcode::

import ray

from ray.data.preprocessors import MinMaxScaler
from ray.train.xgboost import XGBoostTrainer
from ray.train import ScalingConfig

train_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(0, 32, 3)])
valid_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(1, 32, 3)])

preprocessor = MinMaxScaler(["x"])
preprocessor.fit(train_dataset)
train_dataset = preprocessor.transform(train_dataset)
valid_dataset = preprocessor.transform(valid_dataset)

trainer = XGBoostTrainer(
label_column="y",
params={"objective": "reg:squarederror"},
scaling_config=ScalingConfig(num_workers=2),
datasets={"train": train_dataset, "valid": valid_dataset},
)
result = trainer.fit()


.. testoutput::
:hide:

...


How to optimize XGBoost memory usage?
-------------------------------------

Expand Down
50 changes: 50 additions & 0 deletions doc/source/train/user-guides/data-loading-preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -415,6 +415,56 @@ If your model is sensitive to shuffle quality, call :meth:`Dataset.random_shuffl

For more information on how to optimize shuffling, and which approach to choose, see the :ref:`Optimize shuffling guide <optimizing_shuffles>`.

Preprocessing Data
richardliaw marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Preprocessing Data
Preprocessing Tabular Data

------------------

The recommended way to preprocess data for training is to use Ray Data operations such as `map_batches`.
See the :ref:`Ray Data Working with Pytorch guide <working_with_pytorch>` for more details.

However, particularly for tabular data, you can also use Ray Data :ref:`preprocessors <air-preprocessors>`, which implement common data preprocessing operations.
You can use this with Ray Train Trainers by applying them on the dataset before passing the dataset into a Trainer. For example:

.. testcode::

import numpy as np

import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
from ray.data.preprocessors import Concatenator, Chain, StandardScaler
richardliaw marked this conversation as resolved.
Show resolved Hide resolved

dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")

# Create a preprocessor to scale some columns and concatenate the result.
preprocessor = Chain(
StandardScaler(columns=["mean radius", "mean texture"]),
Concatenator(exclude=["target"], dtype=np.float32),
)
dataset = preprocessor.fit_transform(dataset) # this will be applied lazily

def train_loop_per_worker():
# Get an iterator to the dataset we passed in below.
it = session.get_dataset_shard("train")
for _ in range(2):
# Prefetch 10 batches at a time.
for batch in it.iter_batches(batch_size=128, prefetch_batches=10):
print("Do some training on batch", batch)

my_trainer = TorchTrainer(
train_loop_per_worker,
scaling_config=ScalingConfig(num_workers=2),
datasets={"train": dataset},
)
my_trainer.fit()



.. testoutput::
:hide:

...


Reproducibility
---------------
When developing or hyperparameter tuning models, reproducibility is important during data ingest so that data ingest does not affect model quality. Follow these three steps to enable reproducibility:
Expand Down
Loading