Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] [Doc] Add Ray Data Execution Configurations doc page #44105

Merged
merged 6 commits into from Mar 20, 2024

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Mar 18, 2024

Why are these changes needed?

Add a page to describe the various configurations for Ray Data from ExecutionOptions and DataContext.

New page: https://anyscale-ray--44105.com.readthedocs.build/en/44105/data/execution-configurations.html

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Comment on lines 7 to 8
Ray Data provides a number of configurations that can be used to control various aspects
of Ray Dataset execution. These configurations can be modified by the user using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ray Data provides a number of configurations that can be used to control various aspects
of Ray Dataset execution. These configurations can be modified by the user using
Ray Data provides a number of configurations that control various aspects
of Ray Dataset execution. You can modify these configurations by using

===============================================

The :class:`~ray.data.ExecutionOptions` class is used to configure options during Ray Dataset execution.
To use it, you can modify the attributes in the current :class:`~ray.data.DataContext` object's `execution_options`. For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To use it, you can modify the attributes in the current :class:`~ray.data.DataContext` object's `execution_options`. For example:
To use it, modify the attributes in the current :class:`~ray.data.DataContext` object's `execution_options`. For example:

Comment on lines 18 to 21
.. code-block::

ctx = ray.data.DataContext.get_current()
ctx.execution_options.verbose_progress = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure ifcode-block will syntax highlight correctly.

Suggested change
.. code-block::
ctx = ray.data.DataContext.get_current()
ctx.execution_options.verbose_progress = True
.. testcode::
:hide:
import ray
.. testcode::
ctx = ray.data.DataContext.get_current()
ctx.execution_options.verbose_progress = True

ctx = ray.data.DataContext.get_current()
ctx.execution_options.verbose_progress = True

* `resource_limits`: Set a soft limit on the resource usage during execution. Auto-detected by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would you want to set a limit on resource usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an example of such a case:

For example, if there are other parts of the code which require some minimum amount of resources, you may want to limit the amount of resources that Ray Data uses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, when would you want to use resorce_limits over exclude_resources if you have other code that uses resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think they are two ways of controlling the same overall concept. one is an exclusion of resources used by non-ray data workload, while the other is a cap on data resources.

* `resource_limits`: Set a soft limit on the resource usage during execution. Auto-detected by default.
* `exclude_resources`: Amount of resources to exclude from Ray Data. Set this if you have other workloads running on the same cluster. Note:

* If using Ray Data with Ray Train, training resources are automatically excluded. Otherwise, off by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If using Ray Data with Ray Train, training resources are automatically excluded. Otherwise, off by default.
* If you're using Ray Data with Ray Train, training resources are automatically excluded. Otherwise, off by default.

Configuring :class:`~ray.data.DataContext`
==========================================
The :class:`~ray.data.DataContext` class is used to configure more general options for Ray Data usage, such as observability/logging options,
error handling/retry behavior, and internal data formats. To use it, you can modify the attributes in the current :class:`~ray.data.DataContext` object. For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
error handling/retry behavior, and internal data formats. To use it, you can modify the attributes in the current :class:`~ray.data.DataContext` object. For example:
error handling/retry behavior, and internal data formats. To use it, modify the attributes in the current :class:`~ray.data.DataContext` object. For example:

Comment on lines 41 to 44
.. code-block::

ctx = ray.data.DataContext.get_current()
ctx.verbose_stats_logs = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. code-block::
ctx = ray.data.DataContext.get_current()
ctx.verbose_stats_logs = True
.. testcode::
:hide:
import ray
.. testcode::
ctx = ray.data.DataContext.get_current()
ctx.verbose_stats_logs = True

Comment on lines 46 to 47
Many of the options in :class:`~ray.data.DataContext` are intended for advanced use cases or for debugging purposes,
and most users should not need to modify them. However, some of the most important options are:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Many of the options in :class:`~ray.data.DataContext` are intended for advanced use cases or for debugging purposes,
and most users should not need to modify them. However, some of the most important options are:
Many of the options in :class:`~ray.data.DataContext` are intended for advanced use cases or debugging,
and most users shouldn't need to modify them. However, some of the most important options are:

* `verbose_stats_logs`: Whether stats logs should be verbose. This includes fields such as ``extra_metrics`` in the stats output, which are excluded by default. Off by default.
* `log_internal_stack_trace_to_stdout`: Whether to include internal Ray Data/Ray Core code stack frames when logging to ``stdout``. The full stack trace is always written to the Ray Data log file. Off by default.

For more details on each of the preceding options, see the API documentation for :class:`~ray.data.DataContext`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For more details on each of the preceding options, see the API documentation for :class:`~ray.data.DataContext`.
For more details on each of the preceding options, see :class:`~ray.data.DataContext`.

@@ -17,6 +17,7 @@ show you how achieve several tasks.
inspecting-data
iterating-over-data
saving-data
execution-configurations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we should move this lower since the page seems advanced. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, moved the section after "working with X" sections. i didn't want to label the page explicitly as "advanced" since some of these options (verbose logging, max_errored_blocks) can be pretty useful for simple use cases like reading files from S3.

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@bveeramani bveeramani merged commit 8cc3c0c into ray-project:master Mar 20, 2024
5 checks passed
scottjlee added a commit to scottjlee/ray that referenced this pull request Mar 20, 2024
…ect#44105)

Add a page to describe the various configurations for Ray Data from ExecutionOptions and DataContext.

Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee mentioned this pull request Mar 20, 2024
8 tasks
khluu pushed a commit that referenced this pull request Mar 20, 2024
…44172)

Cherry-pick #44105. Docs-only change.

Add a page to describe the various configurations for Ray Data from ExecutionOptions and DataContext.

Signed-off-by: Scott Lee <sjl@anyscale.com>
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Mar 27, 2024
…ect#44105)

Add a page to describe the various configurations for Ray Data from ExecutionOptions and DataContext.

Signed-off-by: Scott Lee <sjl@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants