Data preprocessing is a common technique for transforming raw data into features for a machine learning model. In general, you may want to apply the same preprocessing logic to your offline training data and online inference data. Ray AIR provides several common preprocessors out of the box and interfaces to define your own custom logic.
The most common way of using a preprocessor is by passing it as an argument to the constructor of a Trainer <air-trainers>
in conjunction with a Ray Data <data>
. For example, the following code trains a model with a preprocessor that normalizes the data.
doc_code/preprocessors.py
The Preprocessor
class with four public methods that can we used separately from a trainer:
fit()
: Compute state information about aDataset <ray.data.Dataset>
(e.g., the mean or standard deviation of a column) and save it to thePreprocessor
. This information is used to performtransform()
, and the method is typically called on a training dataset.transform()
: Apply a transformation to aDataset
. If thePreprocessor
is stateful, thenfit()
must be called first. This method is typically called on training, validation, and test datasets.transform_batch()
: Apply a transformation to a singlebatch <ray.train.predictor.DataBatchType>
of data. This method is typically called on online or offline inference data.fit_transform()
: Syntactic sugar for calling bothfit()
andtransform()
on aDataset
.
To show these methods in action, let's walk through a basic example. First, we'll set up two simple Ray Dataset
s.
doc_code/preprocessors.py
Next, fit
the Preprocessor
on one Dataset
, and then transform
both Dataset
s with this fitted information.
doc_code/preprocessors.py
Finally, call transform_batch
on a single batch of data.
doc_code/preprocessors.py
Now that we've gone over the basics, let's dive into how Preprocessor
s fit into an end-to-end application built with AIR. The diagram below depicts an overview of the main steps of a Preprocessor
:
- Passed into a
Trainer
tofit
andtransform
inputDataset
s - Saved as a
Checkpoint
- Reconstructed in a
Predictor
tofit_batch
on batches of data
Throughout this section we'll go through this workflow in more detail, with code examples using XGBoost. The same logic is applicable to other machine learning framework integrations as well.
The journey of the Preprocessor
starts with the Trainer <ray.train.trainer.BaseTrainer>
. If the Trainer
is instantiated with a Preprocessor
, then the following logic is executed when Trainer.fit()
is called:
- If a
"train"
Dataset
is passed in, then thePreprocessor
callsfit()
on it. - The
Preprocessor
then callstransform()
on allDataset
s, including the"train"
Dataset
. - The
Trainer
then performs training on the preprocessedDataset
s.
doc_code/preprocessors.py
Note
If you're passing a Preprocessor
that is already fitted, it is refitted on the "train"
Dataset
. Adding the functionality to support passing in a fitted Preprocessor is being tracked here.
If you're using Ray Tune
for hyperparameter optimization, be aware that each Trial
instantiates its own copy of the Preprocessor
and the fitting and transforming logic occur once per Trial
.
Trainer.fit()
returns a Result
object which contains a Checkpoint
. If a Preprocessor
is passed into the Trainer
, then it is saved in the Checkpoint
along with any fitted state.
As a sanity check, let's confirm the Preprocessor
is available in the Checkpoint
. In practice, you don't need to check.
doc_code/preprocessors.py
A Predictor
can be constructed from a saved Checkpoint
. If the Checkpoint
contains a Preprocessor
, then the Preprocessor
calls transform_batch
on input batches prior to performing inference.
In the following example, we show the Batch Predictor flow. The same logic applies to the Online Inference flow <air-key-concepts-online-inference>
.
doc_code/preprocessors.py
Ray AIR provides a handful of preprocessors out of the box.
Generic preprocessors
ray.data.preprocessors.BatchMapper ray.data.preprocessors.Chain ray.data.preprocessors.Concatenator ray.data.preprocessor.Preprocessor ray.data.preprocessors.SimpleImputer
Categorical encoders
ray.data.preprocessors.Categorizer ray.data.preprocessors.LabelEncoder ray.data.preprocessors.MultiHotEncoder ray.data.preprocessors.OneHotEncoder ray.data.preprocessors.OrdinalEncoder
Feature scalers
ray.data.preprocessors.MaxAbsScaler ray.data.preprocessors.MinMaxScaler ray.data.preprocessors.Normalizer ray.data.preprocessors.PowerTransformer ray.data.preprocessors.RobustScaler ray.data.preprocessors.StandardScaler
Text encoders
ray.data.preprocessors.CountVectorizer ray.data.preprocessors.HashingVectorizer ray.data.preprocessors.Tokenizer ray.data.preprocessors.FeatureHasher
Utilities
ray.data.Dataset.train_test_split
The type of preprocessor you use depends on what your data looks like. This section provides tips on handling common data formats.
Most models expect numerical inputs. To represent your categorical data in a way your model can understand, encode categories using one of the preprocessors described below.
Categorical Data Type | Example | Preprocessor |
---|---|---|
Labels | "cat" , "dog" , "airplane" |
~ray.data.preprocessors.LabelEncoder |
Ordered categories | "bs" , "md" , "phd" |
~ray.data.preprocessors.OrdinalEncoder |
Unordered categories | "red" , "green" , "blue" |
~ray.data.preprocessors.OneHotEncoder |
Lists of categories | ("sci-fi", "action") , ("action", "comedy", "animated") |
~ray.data.preprocessors.MultiHotEncoder |
Note
If you're using LightGBM, you don't need to encode your categorical data. Instead, use ~ray.data.preprocessors.Categorizer
to convert your data to pandas.CategoricalDtype.
To ensure your models behaves properly, normalize your numerical data. Reference the table below to determine which preprocessor to use.
Data Property | Preprocessor |
---|---|
Your data is approximately normal | ~ray.data.preprocessors.StandardScaler |
Your data is sparse | ~ray.data.preprocessors.MaxAbsScaler |
Your data contains many outliers | ~ray.data.preprocessors.RobustScaler |
Your data isn't normal, but you need it to be | ~ray.data.preprocessors.PowerTransformer |
You need unit-norm rows | ~ray.data.preprocessors.Normalizer |
You aren't sure what your data looks like | ~ray.data.preprocessors.MinMaxScaler |
Warning
These preprocessors operate on numeric columns. If your dataset contains columns of type ~ray.air.util.tensor_extensions.pandas.TensorDtype
, you may need to implement a custom preprocessor <air-custom-preprocessors>
.
Additionally, if your model expects a tensor or ndarray
, create a tensor using ~ray.data.preprocessors.Concatenator
.
Tip
Built-in feature scalers like ~ray.data.preprocessors.StandardScaler
don't work on ~ray.air.util.tensor_extensions.pandas.TensorDtype
columns, so apply ~ray.data.preprocessors.Concatenator
after feature scaling. Combine feature scaling and concatenation into a single preprocessor with ~ray.data.preprocessors.Chain
.
doc_code/preprocessors.py
A document-term matrix is a table that describes text data, often used in natural language processing.
To generate a document-term matrix from a collection of documents, use ~ray.data.preprocessors.HashingVectorizer
or ~ray.data.preprocessors.CountVectorizer
. If you already know the frequency of tokens and want to store the data in a document-term matrix, use ~ray.data.preprocessors.FeatureHasher
.
Requirement | Preprocessor |
---|---|
You care about memory efficiency | ~ray.data.preprocessors.HashingVectorizer |
You care about model interpretability | ~ray.data.preprocessors.CountVectorizer |
If your dataset contains missing values, replace them with ~ray.data.preprocessors.SimpleImputer
.
doc_code/preprocessors.py
If you need to apply more than one preprocessor, compose them together with ~ray.data.preprocessors.Chain
.
~ray.data.preprocessors.Chain
applies fit
and transform
sequentially. For example, if you construct Chain(preprocessorA, preprocessorB)
, then preprocessorB.transform
is applied to the result of preprocessorA.transform
.
doc_code/preprocessors.py
If you want to implement a custom preprocessor that needs to be fit, extend the ~ray.data.preprocessor.Preprocessor
base class.
doc_code/preprocessors.py
If your preprocessor doesn't need to be fit, construct a ~ray.data.preprocessors.BatchMapper
to apply a UDF in parallel over your data. ~ray.data.preprocessors.BatchMapper
can drop, add, or modify columns, and you can specify a batch_size to control the size of the data batches provided to your UDF.
doc_code/preprocessors.py