Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Account internal inqueue to previous operator's memory usage #42851

Closed
wants to merge 18 commits into from

Conversation

bveeramani
Copy link
Member

@bveeramani bveeramani commented Jan 31, 2024

Stacked on

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

raulchen and others added 17 commits January 29, 2024 17:02
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
# Spilled memory size in the object store.
obj_store_mem_spilled: int = field(
default=0, metadata={"map_only": True, "export_metric": True}
)

obj_store_mem_internal_inqueue: int = 0
obj_store_mem_internal_outqueue: int = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics currently only work for map ops. I'm wondering if we can also extend them to also support other ops. So 1) we can report them to the dashboard for all ops; 2) we don't have 2 places reporting memory metrics (here and internal_inqueue_memory_usage).

The on_input_received function is already called on all ops. I think we'll also need to add other callback APIs like on_input_freed, on_output_generated, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about that too.

OpRuntimeMetrics is tightly coupled with MapOperator, and this PR introduces a layer of indirection with PhysicalOperator.*_memory_usage() and OpRuntimeMetrics.obj_store_mem_*.

One concern I have is that if we rely on metrics directly, we need to ensure that they're updated correctly for all operators (for example, we need to make sure all operators call on_output_generated, or else obj_store_mem_internal_outqueue will be incorrect). But, I'm not sure how we can enforce that with the PhysicalOperator interface.

While I agree multiple abstractions for memory reporting isn't ideal, I think it might be the easiest way to implement this? Wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documenting tradeoffs from offline discussion:

Disadvantages of adding methods to PhysicalOperator:

  • OpRuntimeMetrics is tightly coupled with MapOperator
  • Layer of indirection for MapOperator
  • Multiple abstractions for memory reporting

Disadvantages of using OpRuntimeMetrics only:

  • Updating metrics is implicitly required (not part of interface)

Given these tradeoffs, I've opened a new PR with the second approach https://github.com/ray-project/ray/pull/42930/files

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
@bveeramani
Copy link
Member Author

Closing in favor of #42930

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants