[docs] Remove trainer references from preprocessors #38348

richardliaw · 2023-08-11T05:25:47Z

Why are these changes needed?

This PR contains the following changes:

Removes a lot of the AIR trainer <> preprocessor usage and lifecycle text that is no longer relevant
Adds references and recommendation to the map_batches approach
Also adds hooks from Ray Train docs (especially tabular-oriented trainers like XGBoost).

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

doc/source/data/preprocessors.rst

doc/source/data/doc_code/preprocessors.py

doc/source/data/preprocessors.rst

doc/source/train/user-guides/data-loading-preprocessing.rst

doc/source/data/preprocessors.rst

amogkam · 2023-08-14T21:15:01Z

doc/source/data/preprocessors.rst

+This page covers *preprocessors*, which are a higher level API on top of existing Ray Data operations like `map_batches`,
+targeted towards tabular and structured data use cases.
+
+The recommended way to perform preprocessing is to :ref:`use existing Ray Data operations <transforming_data>` instead


Can we update the wording here? It's a little bit confusing right now if we say the recommended way for preprocessing is existing Ray Data operations, but then say you should consider built-in preprocessors.

Maybe just "While Ray Data supports generic transformations on datasets, for tabular data it also provides built-in preprocessors"

doc/source/data/preprocessors.rst

doc/source/train/distributed-xgboost-lightgbm.rst

amogkam · 2023-08-14T21:23:23Z

doc/source/train/user-guides/data-loading-preprocessing.rst

@@ -415,6 +415,56 @@ If your model is sensitive to shuffle quality, call :meth:`Dataset.random_shuffl

 For more information on how to optimize shuffling, and which approach to choose, see the :ref:`Optimize shuffling guide <optimizing_shuffles>`.

+Preprocessing Data


Suggested change

Preprocessing Data

Preprocessing Tabular Data

pcmoritz · 2023-08-15T17:44:17Z

We should also update the example https://docs.ray.io/en/master/ray-air/examples/gptj_deepspeed_fine_tuning.html to remove the preprocessor stuff and instead use map_batches :)

…/ray into preprocessor-changes

Signed-off-by: harborn <gangsheng.wu@intel.com>

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Signed-off-by: Victor <vctr.y.m@example.com>

richardliaw added 2 commits August 10, 2023 22:21

update-changes

91f8926

Merge branch 'master' into preprocessor-changes

5b0b0dc

amogkam reviewed Aug 12, 2023

View reviewed changes

doc/source/data/preprocessors.rst Outdated Show resolved Hide resolved

doc/source/data/preprocessors.rst Outdated Show resolved Hide resolved

richardliaw mentioned this pull request Aug 11, 2023

[docs] AIR docs changes todo list #38365

Open

12 tasks

richardliaw added 4 commits August 13, 2023 18:13

update-hoooks

6cefd7b

added-reference.

003a6b0

updated-recs

3561673

update

1fef1fc

matthewdeng reviewed Aug 14, 2023

View reviewed changes

richardliaw marked this pull request as ready for review August 14, 2023 16:51

richardliaw requested review from krfricke, xwjiang2010, Yard1, maxpumperla, a team, ericl, scv119, c21, scottjlee, bveeramani and raulchen as code owners August 14, 2023 16:51

update

dbfc9e3

amogkam reviewed Aug 14, 2023

View reviewed changes

richardliaw added 2 commits August 14, 2023 14:50

collect

ecff983

update

3155c17

richardliaw mentioned this pull request Aug 14, 2023

[docs] move some air api docs #38427

Closed

8 tasks

amogkam and others added 2 commits August 14, 2023 15:29

Update preprocessors.rst

afdad80

update

60b5baa

Merge branch 'preprocessor-changes' of https://github.com/richardliaw…

61609f8

…/ray into preprocessor-changes

matthewdeng approved these changes Aug 15, 2023

View reviewed changes

Merge branch 'master' into preprocessor-changes

3d8face

ericl approved these changes Aug 15, 2023

View reviewed changes

c21 approved these changes Aug 15, 2023

View reviewed changes

amogkam approved these changes Aug 15, 2023

View reviewed changes

richardliaw added 2 commits August 15, 2023 16:20

update

c76014e

fix

20fe1ed

richardliaw merged commit 604aaa2 into ray-project:master Aug 16, 2023
18 of 21 checks passed

harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023

[docs] Remove trainer references from preprocessors (ray-project#38348)

a2dfbf4

Signed-off-by: harborn <gangsheng.wu@intel.com>

harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023

[docs] Remove trainer references from preprocessors (ray-project#38348)

458de89

arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023

[docs] Remove trainer references from preprocessors (ray-project#38348)

1f83971

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023

[docs] Remove trainer references from preprocessors (ray-project#38348)

14eea48

Signed-off-by: Victor <vctr.y.m@example.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Remove trainer references from preprocessors #38348

[docs] Remove trainer references from preprocessors #38348

richardliaw commented Aug 11, 2023 •

edited

Loading

amogkam Aug 14, 2023

amogkam Aug 14, 2023

pcmoritz commented Aug 15, 2023

		@@ -415,6 +415,56 @@ If your model is sensitive to shuffle quality, call :meth:`Dataset.random_shuffl

		For more information on how to optimize shuffling, and which approach to choose, see the :ref:`Optimize shuffling guide <optimizing_shuffles>`.

		Preprocessing Data

[docs] Remove trainer references from preprocessors #38348

[docs] Remove trainer references from preprocessors #38348

Conversation

richardliaw commented Aug 11, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

amogkam Aug 14, 2023

Choose a reason for hiding this comment

amogkam Aug 14, 2023

Choose a reason for hiding this comment

pcmoritz commented Aug 15, 2023

richardliaw commented Aug 11, 2023 •

edited

Loading