[Docs] [Train] E2e user guide for Data+Train #37921

amogkam · 2023-07-29T01:46:04Z

Adds e2e user guide for data+train.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

ericl · 2023-07-30T01:05:00Z

A few high level requests:

Can we move this into the Train user guide and call this "Configuring Train Datasets"? This integration is really a feature of Train. We can add a cross link from Data for discoverability.
For the advanced config, can we move that to the very end of the page? Most users should really not be needing this, so we shouldn't make it the second subsection of the configuration section.
Can we split the configuration section into performance and the rest? Currently it has a large number of subheadings which indicates it's too broad of a header.

doc/source/data/training_ingest.rst

amogkam · 2023-07-31T18:30:21Z

@ericl

Can we move this into the Train user guide and call this "Configuring Train Datasets"? This integration is really a feature of Train. We can add a cross link from Data for discoverability.

I think we want it in both places, right? For this page, the configurations are all specific to Ray Datasets. For Ray Train configurations, those can be in the train docs.

For the advanced config, can we move that to the very end of the page? Most users should really not be needing this, so we shouldn't make it the second subsection of the configuration section.

Actually do we know of a use case for this advanced config? If not, then I'm inclined to just remove it.

Can we split the configuration section into performance and the rest? Currently it has a large number of subheadings which indicates it's too broad of a header.

Sounds good, will make the change

stephanie-wang

This is great! Just left some nits for clarity.

doc/source/data/training_ingest.rst

stephanie-wang · 2023-08-01T03:41:15Z

doc/source/data/training_ingest.rst

+################################
+If you are training on GPUs and have an expensive CPU preprocessing operation, this may bottleneck training throughput.
+
+If your preprocessed Dataset is small enough to fit in object store memory, the easiest thing to do is to *materialize* the preprocessed dataset in Ray object store memory, by calling :meth:`materialize() <ray.data.Dataset.materialize>` on the preprocessed dataset. This tells Ray Data to compute the entire preprocessed and pin it in the Ray object store memory. As a result, when iterating over the dataset repeatedly, the preprocessing operations do not need to be re-run. However, the trade-off is that if the preprocessed data is too large to fit into Ray object store memory, this will greatly decrease performance as data needs to be spilled to disk.


Suggested change

If your preprocessed Dataset is small enough to fit in object store memory, the easiest thing to do is to *materialize* the preprocessed dataset in Ray object store memory, by calling :meth:`materialize() <ray.data.Dataset.materialize>` on the preprocessed dataset. This tells Ray Data to compute the entire preprocessed and pin it in the Ray object store memory. As a result, when iterating over the dataset repeatedly, the preprocessing operations do not need to be re-run. However, the trade-off is that if the preprocessed data is too large to fit into Ray object store memory, this will greatly decrease performance as data needs to be spilled to disk.

If your preprocessed Dataset is small enough to fit in RAM, *materialize* the preprocessed dataset in Ray's built-in object store memory, by calling :meth:`materialize() <ray.data.Dataset.materialize>` on the preprocessed dataset. This approach tells Ray Data to compute the entire preprocessed and pin it in Ray object store memory. As a result, when iterating over the dataset repeatedly, the preprocessing operations do not need to be re-run.

However, if the preprocessed data is too large to fit into Ray object store memory, this approach greatly decreases performance as data needs to be spilled to and read back from disk.

Add a note that randomized transforms should be applied after the materialize() call?

It needs to be able to fit in object story memory right? Which is only a portion of total RAM in the cluster.

I added this wording because "Ray object store" is jargon-y for non-core users. We could change the wording to "comfortably fit" and specify the default fraction that core uses for the object store (I think 30%).

Clarified the default is 30%. But we do talk about object store memory throughout the configuration section. Object store memory is a foundational concept users would have to know to maximize performance.

stephanie-wang · 2023-08-01T03:45:00Z

doc/source/data/training_ingest.rst

+
+Adding CPU-only nodes to your cluster
+#####################################
+If you are bottlenecked on expensive CPU preprocessing and the preprocessed Dataset is too large to fit in object store memory, then the above tip will not work.


Suggested change

If you are bottlenecked on expensive CPU preprocessing and the preprocessed Dataset is too large to fit in object store memory, then the above tip will not work.

If you are bottlenecked on expensive CPU preprocessing and the preprocessed Dataset is too large to fit in object store memory, then the above tip will not work if your disk I/O is also too slow.

I think we should just discourage this entirely? I think for pretty much all use cases, disk speed will not be fast enough for disk spilling to be worth it compared to adding more CPU nodes, is that not the case?

I think that depends on cluster configuration. NVMe drives for example can support GBs/s.

doc/source/data/training_ingest.rst

ericl · 2023-08-01T04:01:22Z

As discussed offline, let's make it living in Train docs since this is documenting a pure Train feature. However, it is good to xref from data.

Actually do we know of a use case for this advanced config? If not, then I'm inclined to just remove it.

Please don't remove this as we designed this based on user feedback for advanced use cases.

doc/source/data/training_ingest.rst

doc/source/data/overview.rst

doc/source/data/training_ingest.rst

angelinalg · 2023-08-01T18:29:19Z

doc/source/data/training_ingest.rst

+* :meth:`randomize_block_order <ray.data.Dataset.randomize_block_order>` 
+* `local_shuffle_seed` argument to :meth:`iter_batches <ray.data.DataIterator.iter_batches>`
+
+**Step 3:** Follow the best practices for enabling reproducibility for your training framework of choice. For example, see the `Pytorch reproducibility guide <https://pytorch.org/docs/stable/notes/randomness.html>`_.


Suggested change

**Step 3:** Follow the best practices for enabling reproducibility for your training framework of choice. For example, see the `Pytorch reproducibility guide <https://pytorch.org/docs/stable/notes/randomness.html>`_.

**Step 3:** Follow the best practices for enabling reproducibility for your training framework of choice. For example, see `Pytorch reproducibility guide <https://pytorch.org/docs/stable/notes/randomness.html>`_.

doc/source/train/config_guide.rst

doc/source/train/dl_guide.rst

doc/source/train/key-concepts.rst

python/ray/train/_internal/data_config.py

angelinalg · 2023-08-01T18:32:42Z

Nice work on the writing, Team!

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam · 2023-08-01T22:47:41Z

For the advanced config, can we move that to the very end of the page? Most users should really not be needing this, so we shouldn't make it the second subsection of the configuration section.

This is now a subheading under "Customizing how to split datasets", it is no longer top-level.

Can we split the configuration section into performance and the rest? Currently it has a large number of subheadings which indicates it's too broad of a header.

I removed the outer most section

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>

Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: harborn <gangsheng.wu@intel.com>

Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Victor <vctr.y.m@example.com>

amogkam added 2 commits July 28, 2023 18:43

add

5bb3e2a

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

docs

accc592

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam requested review from richardliaw, krfricke, xwjiang2010, matthewdeng, Yard1, maxpumperla, a team, ericl, scv119, c21, scottjlee, bveeramani and raulchen as code owners July 29, 2023 01:46

amogkam assigned ericl, stephanie-wang, matthewdeng and bveeramani Jul 29, 2023

amogkam added 2 commits July 28, 2023 19:24

sentence

40dd15d

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix

fd62f5f

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam requested a review from gjoliver as a code owner July 29, 2023 22:04

updates

fada644

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 30, 2023

woshiyyya reviewed Jul 31, 2023

View reviewed changes

doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved

doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved

doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved

doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved

woshiyyya approved these changes Jul 31, 2023

View reviewed changes

stephanie-wang reviewed Aug 1, 2023

View reviewed changes

bveeramani reviewed Aug 1, 2023

View reviewed changes

angelinalg approved these changes Aug 1, 2023

View reviewed changes

amogkam added 3 commits August 1, 2023 15:26

comments

a7b6129

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

add

9fea39a

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

comments

b739fed

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam added 6 commits August 3, 2023 12:01

wip

7db3487

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Merge branch 'master' of github.com:ray-project/ray into data-train-docs

c2625ea

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix

411650c

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

update

cab0be7

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix

1c4d877

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix

e59f00c

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam changed the title ~~[Docs] [Data] [Train] E2e user guide for Data+Train~~ [Docs] [Train] E2e user guide for Data+Train Aug 8, 2023

amogkam added 3 commits August 7, 2023 18:32

fix

e50abed

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix

e2a8cf4

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix

6dc5508

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

bveeramani approved these changes Aug 8, 2023

View reviewed changes

amogkam added 3 commits August 9, 2023 11:07

Merge branch 'master' of github.com:ray-project/ray into data-train-docs

3401e2f

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix

3a0ae6e

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Merge branch 'master' of github.com:ray-project/ray into data-train-docs

6196a3e

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 10, 2023

amogkam merged commit 3712e8b into ray-project:master Aug 10, 2023
75 of 78 checks passed

amogkam deleted the data-train-docs branch August 10, 2023 05:31

harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023

[Docs] [Train] E2e user guide for Data+Train (ray-project#37921)

867d35a

Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com>

vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023

[Docs] [Train] E2e user guide for Data+Train (ray-project#37921)

3d0e6ac

Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Victor <vctr.y.m@example.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] [Train] E2e user guide for Data+Train #37921

[Docs] [Train] E2e user guide for Data+Train #37921

amogkam commented Jul 29, 2023

ericl commented Jul 30, 2023

amogkam commented Jul 31, 2023

stephanie-wang left a comment

stephanie-wang Aug 1, 2023 •

edited by angelinalg

Loading

stephanie-wang Aug 1, 2023

amogkam Aug 1, 2023 •

edited

Loading

stephanie-wang Aug 1, 2023

amogkam Aug 1, 2023

stephanie-wang Aug 1, 2023

amogkam Aug 1, 2023

stephanie-wang Aug 1, 2023

ericl commented Aug 1, 2023

angelinalg Aug 1, 2023

angelinalg commented Aug 1, 2023

amogkam commented Aug 1, 2023

-If your preprocessed Dataset is small enough to fit in object store memory, the easiest thing to do is to *materialize* the preprocessed dataset in Ray object store memory, by calling :meth:`materialize() <ray.data.Dataset.materialize>` on the preprocessed dataset. This tells Ray Data to compute the entire preprocessed and pin it in the Ray object store memory. As a result, when iterating over the dataset repeatedly, the preprocessing operations do not need to be re-run. However, the trade-off is that if the preprocessed data is too large to fit into Ray object store memory, this will greatly decrease performance as data needs to be spilled to disk.
+If your preprocessed Dataset is small enough to fit in RAM,  *materialize* the preprocessed dataset in Ray's built-in object store memory, by calling :meth:`materialize() <ray.data.Dataset.materialize>` on the preprocessed dataset. This approach tells Ray Data to compute the entire preprocessed and pin it in Ray object store memory. As a result, when iterating over the dataset repeatedly, the preprocessing operations do not need to be re-run.
+However, if the preprocessed data is too large to fit into Ray object store memory, this approach greatly decreases performance as data needs to be spilled to and read back from disk.

	If you are bottlenecked on expensive CPU preprocessing and the preprocessed Dataset is too large to fit in object store memory, then the above tip will not work.
	If you are bottlenecked on expensive CPU preprocessing and the preprocessed Dataset is too large to fit in object store memory, then the above tip will not work if your disk I/O is also too slow.

	Step 3: Follow the best practices for enabling reproducibility for your training framework of choice. For example, see the `Pytorch reproducibility guide <https://pytorch.org/docs/stable/notes/randomness.html>`_.
	Step 3: Follow the best practices for enabling reproducibility for your training framework of choice. For example, see `Pytorch reproducibility guide <https://pytorch.org/docs/stable/notes/randomness.html>`_.

[Docs] [Train] E2e user guide for Data+Train #37921

[Docs] [Train] E2e user guide for Data+Train #37921

Conversation

amogkam commented Jul 29, 2023

Why are these changes needed?

Related issue number

Checks

ericl commented Jul 30, 2023

amogkam commented Jul 31, 2023

stephanie-wang left a comment

Choose a reason for hiding this comment

stephanie-wang Aug 1, 2023 • edited by angelinalg Loading

Choose a reason for hiding this comment

stephanie-wang Aug 1, 2023

Choose a reason for hiding this comment

amogkam Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

stephanie-wang Aug 1, 2023

Choose a reason for hiding this comment

amogkam Aug 1, 2023

Choose a reason for hiding this comment

stephanie-wang Aug 1, 2023

Choose a reason for hiding this comment

amogkam Aug 1, 2023

Choose a reason for hiding this comment

stephanie-wang Aug 1, 2023

Choose a reason for hiding this comment

ericl commented Aug 1, 2023

angelinalg Aug 1, 2023

Choose a reason for hiding this comment

angelinalg commented Aug 1, 2023

amogkam commented Aug 1, 2023

stephanie-wang Aug 1, 2023 •

edited by angelinalg

Loading

amogkam Aug 1, 2023 •

edited

Loading