ray-project · richardliaw · Aug 16, 2023 · Aug 11, 2023 · Aug 11, 2023 · Aug 14, 2023
@@ -56,13 +56,15 @@
 valid_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(1, 32, 3)])
 
 preprocessor = MinMaxScaler(["x"])
+preprocessor.fit(train_dataset)
+train_dataset = preprocessor.transform(train_dataset)
+valid_dataset = preprocessor.transform(valid_dataset)
 
 trainer = XGBoostTrainer(
     label_column="y",
     params={"objective": "reg:squarederror"},
     scaling_config=ScalingConfig(num_workers=2),
     datasets={"train": train_dataset, "valid": valid_dataset},
-    preprocessor=preprocessor,
 )
 result = trainer.fit()
 # __trainer_end__

@@ -5,7 +5,14 @@ Using Preprocessors
 
 Data preprocessing is a common technique for transforming raw data into features for a machine learning model.
 In general, you may want to apply the same preprocessing logic to your offline training data and online inference data.
-Ray AIR provides several common preprocessors out of the box and interfaces to define your own custom logic.
+
+This page covers *preprocessors*, which are a higher level API on top of existing Ray Data operations like `map_batches`,
+targeted towards tabular and structured data use cases.
+
+The recommended way to perform preprocessing is to :ref:`use existing Ray Data operations <transforming_data>` instead
+of preprocessors. However, if you are working with tabular data, you should consider using Ray Data preprocessors.
+
+
 
 .. https://docs.google.com/drawings/d/1ZIbsXv5vvwTVIEr2aooKxuYJ_VL7-8VMNlRinAiPaTI/edit
 
@@ -15,15 +22,7 @@ Ray AIR provides several common preprocessors out of the box and interfaces to d
 Overview
 --------
 
-The most common way of using a preprocessor is by passing it as an argument to the constructor of a Ray Train :ref:`Trainer <train-docs>` in conjunction with a :ref:`Ray Data dataset <data>`.
-For example, the following code trains a model with a preprocessor that normalizes the data.
-
-.. literalinclude:: doc_code/preprocessors.py
-    :language: python
-    :start-after: __trainer_start__
-    :end-before: __trainer_end__
-
-The  ``Preprocessor`` class with four public methods that can we used separately from a trainer:
+The  ``Preprocessor`` class has four public methods:
 
 #. ``fit()``: Compute state information about a :class:`Dataset <ray.data.Dataset>` (e.g., the mean or standard deviation of a column)
    and save it to the ``Preprocessor``. This information is used to perform ``transform()``, and the method is typically called on a
@@ -55,84 +54,19 @@ Finally, call ``transform_batch`` on a single batch of data.
     :start-after: __preprocessor_transform_batch_start__
     :end-before: __preprocessor_transform_batch_end__
 
-Life of an AIR preprocessor
----------------------------
-
-Now that we've gone over the basics, let's dive into how ``Preprocessor``\s fit into an end-to-end application built with AIR.
-The diagram below depicts an overview of the main steps of a ``Preprocessor``:
-
-#. Passed into a ``Trainer`` to ``fit`` and ``transform`` input ``Dataset``\s
-#. Saved as a ``Checkpoint``
-#. Reconstructed in a ``Predictor`` to ``fit_batch`` on batches of data
+The most common way of using a preprocessor is by using it on a :ref:`Ray Data dataset <data>`, which is then passed to a Ray Train :ref:`Trainer <train-docs>`. See also:
 
-.. figure:: images/air-preprocessor.svg
-
-Throughout this section we'll go through this workflow in more detail, with code examples using XGBoost.
-The same logic is applicable to other machine learning framework integrations as well.
-
-Trainer
-~~~~~~~
-
-The journey of the ``Preprocessor`` starts with the :class:`Trainer <ray.train.trainer.BaseTrainer>`.
-If the ``Trainer`` is instantiated with a ``Preprocessor``, then the following logic is executed when ``Trainer.fit()`` is called:
-
-#. If a ``"train"`` ``Dataset`` is passed in, then the ``Preprocessor`` calls ``fit()`` on it.
-#. The ``Preprocessor`` then calls ``transform()`` on all ``Dataset``\s, including the ``"train"`` ``Dataset``.
-#. The ``Trainer`` then performs training on the preprocessed ``Dataset``\s.
-
-.. literalinclude:: doc_code/preprocessors.py
-    :language: python
-    :start-after: __trainer_start__
-    :end-before: __trainer_end__
-
-.. note::
-
-    If you're passing a ``Preprocessor`` that is already fitted, it is refitted on the ``"train"`` ``Dataset``.
-    Adding the functionality to support passing in a fitted Preprocessor is being tracked
-    `here <https://github.com/ray-project/ray/issues/25299>`__.
-
-.. TODO: Remove the note above once the issue is resolved.
-
-Tune
-~~~~
-
-If you're using ``Ray Tune`` for hyperparameter optimization, be aware that each ``Trial`` instantiates its own copy of
-the ``Preprocessor`` and the fitting and transforming logic occur once per ``Trial``.
-
-Checkpoint
-~~~~~~~~~~
-
-``Trainer.fit()`` returns a ``Result`` object which contains a ``Checkpoint``.
-If a ``Preprocessor`` is passed into the ``Trainer``, then it is saved in the ``Checkpoint`` along with any fitted state.
-
-As a sanity check, let's confirm the ``Preprocessor`` is available in the ``Checkpoint``. In practice, you don't need to check.
-
-.. literalinclude:: doc_code/preprocessors.py
-    :language: python
-    :start-after: __checkpoint_start__
-    :end-before: __checkpoint_end__
-
-
-Predictor
-~~~~~~~~~
-
-A ``Predictor`` can be constructed from a saved ``Checkpoint``. If the ``Checkpoint`` contains a ``Preprocessor``,
-then the ``Preprocessor`` calls ``transform_batch`` on input batches prior to performing inference.
-
-In the following example, we show the batch inference flow.
-
-.. literalinclude:: doc_code/preprocessors.py
-    :language: python
-    :start-after: __predictor_start__
-    :end-before: __predictor_end__
+* :ref:`Working with Pytorch user guide <working_with_pytorch>`
+* Ray Train's data preprocessing and ingest section for :ref:`PyTorch <data-ingest-torch>`
+* Ray Train's data preprocessing and ingest section for :ref:`LightGBM/XGBoost <data-ingest-gbdt>`
 
 Types of preprocessors
 ----------------------
 
 Built-in preprocessors
 ~~~~~~~~~~~~~~~~~~~~~~
 
-Ray AIR provides a handful of preprocessors out of the box.
+Ray Data provides a handful of preprocessors out of the box.
 
 **Generic preprocessors**
 

@@ -190,6 +190,47 @@ machines have 16 CPUs in addition to the 4 GPUs, each actor should have
     :end-before: __gpu_xgboost_end__
 
 
+.. _data-ingest-gbdt:
+
+How to preprocess data for training?
+------------------------------------
+
+The recommended way to preprocess data for training is to use Ray Data operations such as `map_batches`. 
+
+However, particularly for tabular data, Ray Data comes with out-of-the-box :ref:`preprocessors <air-preprocessors>` that implement common feature preprocessing operations.
+You can use this with Ray Train Trainers by applying them on the dataset before passing the dataset into a Trainer. For example:
+
+.. testcode::
+
+    import ray
+
+    from ray.data.preprocessors import MinMaxScaler
+    from ray.train.xgboost import XGBoostTrainer
+    from ray.train import ScalingConfig
+
+    train_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(0, 32, 3)])
+    valid_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(1, 32, 3)])
+
+    preprocessor = MinMaxScaler(["x"])
+    preprocessor.fit(train_dataset)
+    train_dataset = preprocessor.transform(train_dataset)
+    valid_dataset = preprocessor.transform(valid_dataset)
+
+    trainer = XGBoostTrainer(
+        label_column="y",
+        params={"objective": "reg:squarederror"},
+        scaling_config=ScalingConfig(num_workers=2),
+        datasets={"train": train_dataset, "valid": valid_dataset},
+    )
+    result = trainer.fit()
+
+
+.. testoutput::
+    :hide:
+
+    ...
+
+
 How to optimize XGBoost memory usage?
 -------------------------------------
 

@@ -415,6 +415,56 @@ If your model is sensitive to shuffle quality, call :meth:`Dataset.random_shuffl
 
 For more information on how to optimize shuffling, and which approach to choose, see the :ref:`Optimize shuffling guide <optimizing_shuffles>`.
 
+Preprocessing Data
-Preprocessing Data
+Preprocessing Tabular Data
-Preprocessing Data
+Preprocessing Tabular Data
+------------------
+
+The recommended way to preprocess data for training is to use Ray Data operations such as `map_batches`. 
+See the :ref:`Ray Data Working with Pytorch guide <working_with_pytorch>` for more details.
+
+However, particularly for tabular data, you can also use Ray Data :ref:`preprocessors <air-preprocessors>`, which implement common data preprocessing operations.
+You can use this with Ray Train Trainers by applying them on the dataset before passing the dataset into a Trainer. For example:
+
+.. testcode::
+
+    import numpy as np
+
+    import ray
+    from ray.train import ScalingConfig
+    from ray.train.torch import TorchTrainer
+    from ray.data.preprocessors import Concatenator, Chain, StandardScaler
+
+    dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
+
+    # Create a preprocessor to scale some columns and concatenate the result.
+    preprocessor = Chain(
+        StandardScaler(columns=["mean radius", "mean texture"]),
+        Concatenator(exclude=["target"], dtype=np.float32),
+    )
+    dataset = preprocessor.fit_transform(dataset)  # this will be applied lazily
+
+    def train_loop_per_worker():
+        # Get an iterator to the dataset we passed in below.
+        it = session.get_dataset_shard("train")
+        for _ in range(2):
+            # Prefetch 10 batches at a time.
+            for batch in it.iter_batches(batch_size=128, prefetch_batches=10):
+                print("Do some training on batch", batch)
+
+    my_trainer = TorchTrainer(
+        train_loop_per_worker,
+        scaling_config=ScalingConfig(num_workers=2),
+        datasets={"train": dataset},
+    )
+    my_trainer.fit()
+
+
+
+.. testoutput::
+    :hide:
+
+    ...
+
+
 Reproducibility
 ---------------
 When developing or hyperparameter tuning models, reproducibility is important during data ingest so that data ingest does not affect model quality. Follow these three steps to enable reproducibility: