-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] [Train] E2e user guide for Data+Train #37921
Conversation
A few high level requests:
|
I think we want it in both places, right? For this page, the configurations are all specific to Ray Datasets. For Ray Train configurations, those can be in the train docs.
Actually do we know of a use case for this advanced config? If not, then I'm inclined to just remove it.
Sounds good, will make the change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Just left some nits for clarity.
doc/source/data/training_ingest.rst
Outdated
################################ | ||
If you are training on GPUs and have an expensive CPU preprocessing operation, this may bottleneck training throughput. | ||
|
||
If your preprocessed Dataset is small enough to fit in object store memory, the easiest thing to do is to *materialize* the preprocessed dataset in Ray object store memory, by calling :meth:`materialize() <ray.data.Dataset.materialize>` on the preprocessed dataset. This tells Ray Data to compute the entire preprocessed and pin it in the Ray object store memory. As a result, when iterating over the dataset repeatedly, the preprocessing operations do not need to be re-run. However, the trade-off is that if the preprocessed data is too large to fit into Ray object store memory, this will greatly decrease performance as data needs to be spilled to disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your preprocessed Dataset is small enough to fit in object store memory, the easiest thing to do is to *materialize* the preprocessed dataset in Ray object store memory, by calling :meth:`materialize() <ray.data.Dataset.materialize>` on the preprocessed dataset. This tells Ray Data to compute the entire preprocessed and pin it in the Ray object store memory. As a result, when iterating over the dataset repeatedly, the preprocessing operations do not need to be re-run. However, the trade-off is that if the preprocessed data is too large to fit into Ray object store memory, this will greatly decrease performance as data needs to be spilled to disk. | |
If your preprocessed Dataset is small enough to fit in RAM, *materialize* the preprocessed dataset in Ray's built-in object store memory, by calling :meth:`materialize() <ray.data.Dataset.materialize>` on the preprocessed dataset. This approach tells Ray Data to compute the entire preprocessed and pin it in Ray object store memory. As a result, when iterating over the dataset repeatedly, the preprocessing operations do not need to be re-run. | |
However, if the preprocessed data is too large to fit into Ray object store memory, this approach greatly decreases performance as data needs to be spilled to and read back from disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a note that randomized transforms should be applied after the materialize() call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to be able to fit in object story memory right? Which is only a portion of total RAM in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this wording because "Ray object store" is jargon-y for non-core users. We could change the wording to "comfortably fit" and specify the default fraction that core uses for the object store (I think 30%).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified the default is 30%. But we do talk about object store memory throughout the configuration section. Object store memory is a foundational concept users would have to know to maximize performance.
doc/source/data/training_ingest.rst
Outdated
|
||
Adding CPU-only nodes to your cluster | ||
##################################### | ||
If you are bottlenecked on expensive CPU preprocessing and the preprocessed Dataset is too large to fit in object store memory, then the above tip will not work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are bottlenecked on expensive CPU preprocessing and the preprocessed Dataset is too large to fit in object store memory, then the above tip will not work. | |
If you are bottlenecked on expensive CPU preprocessing and the preprocessed Dataset is too large to fit in object store memory, then the above tip will not work if your disk I/O is also too slow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just discourage this entirely? I think for pretty much all use cases, disk speed will not be fast enough for disk spilling to be worth it compared to adding more CPU nodes, is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that depends on cluster configuration. NVMe drives for example can support GBs/s.
As discussed offline, let's make it living in Train docs since this is documenting a pure Train feature. However, it is good to xref from data.
Please don't remove this as we designed this based on user feedback for advanced use cases. |
doc/source/data/training_ingest.rst
Outdated
* :meth:`randomize_block_order <ray.data.Dataset.randomize_block_order>` | ||
* `local_shuffle_seed` argument to :meth:`iter_batches <ray.data.DataIterator.iter_batches>` | ||
|
||
**Step 3:** Follow the best practices for enabling reproducibility for your training framework of choice. For example, see the `Pytorch reproducibility guide <https://pytorch.org/docs/stable/notes/randomness.html>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Step 3:** Follow the best practices for enabling reproducibility for your training framework of choice. For example, see the `Pytorch reproducibility guide <https://pytorch.org/docs/stable/notes/randomness.html>`_. | |
**Step 3:** Follow the best practices for enabling reproducibility for your training framework of choice. For example, see `Pytorch reproducibility guide <https://pytorch.org/docs/stable/notes/randomness.html>`_. |
Nice work on the writing, Team! |
This is now a subheading under "Customizing how to split datasets", it is no longer top-level.
I removed the outer most section |
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Adds e2e user guide for data+train. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Victor <vctr.y.m@example.com>
Adds e2e user guide for data+train.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.