Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] [Train] E2e user guide for Data+Train #37921

Merged
merged 20 commits into from
Aug 10, 2023

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Jul 29, 2023

Adds e2e user guide for data+train.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
@amogkam amogkam requested a review from gjoliver as a code owner July 29, 2023 22:04
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
@ericl
Copy link
Contributor

ericl commented Jul 30, 2023

A few high level requests:

  1. Can we move this into the Train user guide and call this "Configuring Train Datasets"? This integration is really a feature of Train. We can add a cross link from Data for discoverability.
  2. For the advanced config, can we move that to the very end of the page? Most users should really not be needing this, so we shouldn't make it the second subsection of the configuration section.
  3. Can we split the configuration section into performance and the rest? Currently it has a large number of subheadings which indicates it's too broad of a header.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 30, 2023
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
@amogkam
Copy link
Contributor Author

amogkam commented Jul 31, 2023

@ericl

Can we move this into the Train user guide and call this "Configuring Train Datasets"? This integration is really a feature of Train. We can add a cross link from Data for discoverability.

I think we want it in both places, right? For this page, the configurations are all specific to Ray Datasets. For Ray Train configurations, those can be in the train docs.

For the advanced config, can we move that to the very end of the page? Most users should really not be needing this, so we shouldn't make it the second subsection of the configuration section.

Actually do we know of a use case for this advanced config? If not, then I'm inclined to just remove it.

Can we split the configuration section into performance and the rest? Currently it has a large number of subheadings which indicates it's too broad of a header.

Sounds good, will make the change

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Just left some nits for clarity.

doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
################################
If you are training on GPUs and have an expensive CPU preprocessing operation, this may bottleneck training throughput.

If your preprocessed Dataset is small enough to fit in object store memory, the easiest thing to do is to *materialize* the preprocessed dataset in Ray object store memory, by calling :meth:`materialize() <ray.data.Dataset.materialize>` on the preprocessed dataset. This tells Ray Data to compute the entire preprocessed and pin it in the Ray object store memory. As a result, when iterating over the dataset repeatedly, the preprocessing operations do not need to be re-run. However, the trade-off is that if the preprocessed data is too large to fit into Ray object store memory, this will greatly decrease performance as data needs to be spilled to disk.
Copy link
Contributor

@stephanie-wang stephanie-wang Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If your preprocessed Dataset is small enough to fit in object store memory, the easiest thing to do is to *materialize* the preprocessed dataset in Ray object store memory, by calling :meth:`materialize() <ray.data.Dataset.materialize>` on the preprocessed dataset. This tells Ray Data to compute the entire preprocessed and pin it in the Ray object store memory. As a result, when iterating over the dataset repeatedly, the preprocessing operations do not need to be re-run. However, the trade-off is that if the preprocessed data is too large to fit into Ray object store memory, this will greatly decrease performance as data needs to be spilled to disk.
If your preprocessed Dataset is small enough to fit in RAM, *materialize* the preprocessed dataset in Ray's built-in object store memory, by calling :meth:`materialize() <ray.data.Dataset.materialize>` on the preprocessed dataset. This approach tells Ray Data to compute the entire preprocessed and pin it in Ray object store memory. As a result, when iterating over the dataset repeatedly, the preprocessing operations do not need to be re-run.
However, if the preprocessed data is too large to fit into Ray object store memory, this approach greatly decreases performance as data needs to be spilled to and read back from disk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note that randomized transforms should be applied after the materialize() call?

Copy link
Contributor Author

@amogkam amogkam Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be able to fit in object story memory right? Which is only a portion of total RAM in the cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this wording because "Ray object store" is jargon-y for non-core users. We could change the wording to "comfortably fit" and specify the default fraction that core uses for the object store (I think 30%).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the default is 30%. But we do talk about object store memory throughout the configuration section. Object store memory is a foundational concept users would have to know to maximize performance.


Adding CPU-only nodes to your cluster
#####################################
If you are bottlenecked on expensive CPU preprocessing and the preprocessed Dataset is too large to fit in object store memory, then the above tip will not work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you are bottlenecked on expensive CPU preprocessing and the preprocessed Dataset is too large to fit in object store memory, then the above tip will not work.
If you are bottlenecked on expensive CPU preprocessing and the preprocessed Dataset is too large to fit in object store memory, then the above tip will not work if your disk I/O is also too slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just discourage this entirely? I think for pretty much all use cases, disk speed will not be fast enough for disk spilling to be worth it compared to adding more CPU nodes, is that not the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that depends on cluster configuration. NVMe drives for example can support GBs/s.

doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
@ericl
Copy link
Contributor

ericl commented Aug 1, 2023

As discussed offline, let's make it living in Train docs since this is documenting a pure Train feature. However, it is good to xref from data.

Actually do we know of a use case for this advanced config? If not, then I'm inclined to just remove it.

Please don't remove this as we designed this based on user feedback for advanced use cases.

doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/overview.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
doc/source/data/training_ingest.rst Outdated Show resolved Hide resolved
* :meth:`randomize_block_order <ray.data.Dataset.randomize_block_order>`
* `local_shuffle_seed` argument to :meth:`iter_batches <ray.data.DataIterator.iter_batches>`

**Step 3:** Follow the best practices for enabling reproducibility for your training framework of choice. For example, see the `Pytorch reproducibility guide <https://pytorch.org/docs/stable/notes/randomness.html>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Step 3:** Follow the best practices for enabling reproducibility for your training framework of choice. For example, see the `Pytorch reproducibility guide <https://pytorch.org/docs/stable/notes/randomness.html>`_.
**Step 3:** Follow the best practices for enabling reproducibility for your training framework of choice. For example, see `Pytorch reproducibility guide <https://pytorch.org/docs/stable/notes/randomness.html>`_.

doc/source/train/config_guide.rst Outdated Show resolved Hide resolved
doc/source/train/dl_guide.rst Outdated Show resolved Hide resolved
doc/source/train/key-concepts.rst Outdated Show resolved Hide resolved
python/ray/train/_internal/data_config.py Outdated Show resolved Hide resolved
@angelinalg
Copy link
Contributor

Nice work on the writing, Team!

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
@amogkam
Copy link
Contributor Author

amogkam commented Aug 1, 2023

For the advanced config, can we move that to the very end of the page? Most users should really not be needing this, so we shouldn't make it the second subsection of the configuration section.

This is now a subheading under "Customizing how to split datasets", it is no longer top-level.

Can we split the configuration section into performance and the rest? Currently it has a large number of subheadings which indicates it's too broad of a header.

I removed the outer most section

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
@amogkam amogkam changed the title [Docs] [Data] [Train] E2e user guide for Data+Train [Docs] [Train] E2e user guide for Data+Train Aug 8, 2023
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
@amogkam amogkam removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 10, 2023
@amogkam amogkam merged commit 3712e8b into ray-project:master Aug 10, 2023
75 of 78 checks passed
@amogkam amogkam deleted the data-train-docs branch August 10, 2023 05:31
shrekris-anyscale pushed a commit to shrekris-anyscale/ray that referenced this pull request Aug 10, 2023
Adds e2e user guide for data+train.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
Adds e2e user guide for data+train.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
Adds e2e user guide for data+train.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
Adds e2e user guide for data+train.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Adds e2e user guide for data+train.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Adds e2e user guide for data+train.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants