Skip to content

Latest commit

 

History

History
342 lines (248 loc) · 12.1 KB

preprocessors.rst

File metadata and controls

342 lines (248 loc) · 12.1 KB

Using Preprocessors

Data preprocessing is a common technique for transforming raw data into features for a machine learning model. In general, you may want to apply the same preprocessing logic to your offline training data and online inference data. Ray AIR provides several common preprocessors out of the box and interfaces to define your own custom logic.

image

Overview

The most common way of using a preprocessor is by passing it as an argument to the constructor of a Trainer <air-trainers> in conjunction with a Ray Data <data>. For example, the following code trains a model with a preprocessor that normalizes the data.

doc_code/preprocessors.py

The Preprocessor class with four public methods that can we used separately from a trainer:

  1. fit(): Compute state information about a Dataset <ray.data.Dataset> (e.g., the mean or standard deviation of a column) and save it to the Preprocessor. This information is used to perform transform(), and the method is typically called on a training dataset.
  2. transform(): Apply a transformation to a Dataset. If the Preprocessor is stateful, then fit() must be called first. This method is typically called on training, validation, and test datasets.
  3. transform_batch(): Apply a transformation to a single batch <ray.train.predictor.DataBatchType> of data. This method is typically called on online or offline inference data.
  4. fit_transform(): Syntactic sugar for calling both fit() and transform() on a Dataset.

To show these methods in action, let's walk through a basic example. First, we'll set up two simple Ray Datasets.

doc_code/preprocessors.py

Next, fit the Preprocessor on one Dataset, and then transform both Datasets with this fitted information.

doc_code/preprocessors.py

Finally, call transform_batch on a single batch of data.

doc_code/preprocessors.py

Life of an AIR preprocessor

Now that we've gone over the basics, let's dive into how Preprocessors fit into an end-to-end application built with AIR. The diagram below depicts an overview of the main steps of a Preprocessor:

  1. Passed into a Trainer to fit and transform input Datasets
  2. Saved as a Checkpoint
  3. Reconstructed in a Predictor to fit_batch on batches of data

Throughout this section we'll go through this workflow in more detail, with code examples using XGBoost. The same logic is applicable to other machine learning framework integrations as well.

Trainer

The journey of the Preprocessor starts with the Trainer <ray.train.trainer.BaseTrainer>. If the Trainer is instantiated with a Preprocessor, then the following logic is executed when Trainer.fit() is called:

  1. If a "train" Dataset is passed in, then the Preprocessor calls fit() on it.
  2. The Preprocessor then calls transform() on all Datasets, including the "train" Dataset.
  3. The Trainer then performs training on the preprocessed Datasets.

doc_code/preprocessors.py

Note

If you're passing a Preprocessor that is already fitted, it is refitted on the "train" Dataset. Adding the functionality to support passing in a fitted Preprocessor is being tracked here.

Tune

If you're using Ray Tune for hyperparameter optimization, be aware that each Trial instantiates its own copy of the Preprocessor and the fitting and transforming logic occur once per Trial.

Checkpoint

Trainer.fit() returns a Result object which contains a Checkpoint. If a Preprocessor is passed into the Trainer, then it is saved in the Checkpoint along with any fitted state.

As a sanity check, let's confirm the Preprocessor is available in the Checkpoint. In practice, you don't need to check.

doc_code/preprocessors.py

Predictor

A Predictor can be constructed from a saved Checkpoint. If the Checkpoint contains a Preprocessor, then the Preprocessor calls transform_batch on input batches prior to performing inference.

In the following example, we show the Batch Predictor flow. The same logic applies to the Online Inference flow <air-key-concepts-online-inference>.

doc_code/preprocessors.py

Types of preprocessors

Built-in preprocessors

Ray AIR provides a handful of preprocessors out of the box.

Generic preprocessors

ray.data.preprocessors.BatchMapper ray.data.preprocessors.Chain ray.data.preprocessors.Concatenator ray.data.preprocessor.Preprocessor ray.data.preprocessors.SimpleImputer

Categorical encoders

ray.data.preprocessors.Categorizer ray.data.preprocessors.LabelEncoder ray.data.preprocessors.MultiHotEncoder ray.data.preprocessors.OneHotEncoder ray.data.preprocessors.OrdinalEncoder

Feature scalers

ray.data.preprocessors.MaxAbsScaler ray.data.preprocessors.MinMaxScaler ray.data.preprocessors.Normalizer ray.data.preprocessors.PowerTransformer ray.data.preprocessors.RobustScaler ray.data.preprocessors.StandardScaler

Text encoders

ray.data.preprocessors.CountVectorizer ray.data.preprocessors.HashingVectorizer ray.data.preprocessors.Tokenizer ray.data.preprocessors.FeatureHasher

Utilities

ray.data.Dataset.train_test_split

Which preprocessor should you use?

The type of preprocessor you use depends on what your data looks like. This section provides tips on handling common data formats.

Categorical data

Most models expect numerical inputs. To represent your categorical data in a way your model can understand, encode categories using one of the preprocessors described below.

Categorical Data Type Example Preprocessor
Labels "cat", "dog", "airplane" ~ray.data.preprocessors.LabelEncoder
Ordered categories "bs", "md", "phd" ~ray.data.preprocessors.OrdinalEncoder
Unordered categories "red", "green", "blue" ~ray.data.preprocessors.OneHotEncoder
Lists of categories ("sci-fi", "action"), ("action", "comedy", "animated") ~ray.data.preprocessors.MultiHotEncoder

Note

If you're using LightGBM, you don't need to encode your categorical data. Instead, use ~ray.data.preprocessors.Categorizer to convert your data to pandas.CategoricalDtype.

Numerical data

To ensure your models behaves properly, normalize your numerical data. Reference the table below to determine which preprocessor to use.

Data Property Preprocessor
Your data is approximately normal ~ray.data.preprocessors.StandardScaler
Your data is sparse ~ray.data.preprocessors.MaxAbsScaler
Your data contains many outliers ~ray.data.preprocessors.RobustScaler
Your data isn't normal, but you need it to be ~ray.data.preprocessors.PowerTransformer
You need unit-norm rows ~ray.data.preprocessors.Normalizer
You aren't sure what your data looks like ~ray.data.preprocessors.MinMaxScaler

Warning

These preprocessors operate on numeric columns. If your dataset contains columns of type ~ray.air.util.tensor_extensions.pandas.TensorDtype, you may need to implement a custom preprocessor <air-custom-preprocessors>.

Additionally, if your model expects a tensor or ndarray, create a tensor using ~ray.data.preprocessors.Concatenator.

Tip

Built-in feature scalers like ~ray.data.preprocessors.StandardScaler don't work on ~ray.air.util.tensor_extensions.pandas.TensorDtype columns, so apply ~ray.data.preprocessors.Concatenator after feature scaling. Combine feature scaling and concatenation into a single preprocessor with ~ray.data.preprocessors.Chain.

doc_code/preprocessors.py

Text data

A document-term matrix is a table that describes text data, often used in natural language processing.

To generate a document-term matrix from a collection of documents, use ~ray.data.preprocessors.HashingVectorizer or ~ray.data.preprocessors.CountVectorizer. If you already know the frequency of tokens and want to store the data in a document-term matrix, use ~ray.data.preprocessors.FeatureHasher.

Requirement Preprocessor
You care about memory efficiency ~ray.data.preprocessors.HashingVectorizer
You care about model interpretability ~ray.data.preprocessors.CountVectorizer

Filling in missing values

If your dataset contains missing values, replace them with ~ray.data.preprocessors.SimpleImputer.

doc_code/preprocessors.py

Chaining preprocessors

If you need to apply more than one preprocessor, compose them together with ~ray.data.preprocessors.Chain.

~ray.data.preprocessors.Chain applies fit and transform sequentially. For example, if you construct Chain(preprocessorA, preprocessorB), then preprocessorB.transform is applied to the result of preprocessorA.transform.

doc_code/preprocessors.py

Implementing custom preprocessors

If you want to implement a custom preprocessor that needs to be fit, extend the ~ray.data.preprocessor.Preprocessor base class.

doc_code/preprocessors.py

If your preprocessor doesn't need to be fit, construct a ~ray.data.preprocessors.BatchMapper to apply a UDF in parallel over your data. ~ray.data.preprocessors.BatchMapper can drop, add, or modify columns, and you can specify a batch_size to control the size of the data batches provided to your UDF.

doc_code/preprocessors.py