Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Release streaming generator task metadata based on ref counter #43584

Closed
wants to merge 12 commits into from

Conversation

stephanie-wang
Copy link
Contributor

@stephanie-wang stephanie-wang commented Mar 1, 2024

Why are these changes needed?

Update streaming generator metadata management to use the standard ref counting path. Changes:

  • Task metadata is removed once all refs returned by the generator can no longer be reconstructed, instead of through an explicit delete call when the generator ref is deleted. The latter is incorrect because even if the generator ref has been deleted, the task may still get re-executed if the refs returned by the generator are reconstructed. Now we instead delete the stream metadata when the corresponding task metadata is GCed, so we can rely on the normal ref counting protocol to GC streams.
  • Refs returned by the generator are initially added with local ref count=0, and contained in the generator ref. The local ref count is incremented when the ref is returned to the caller.
  • Do not hold the TaskManager lock when calling ReferenceCounter methods. This can cause deadlock.
  • Fixes some typos

Related issue number

Closes #39151.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
@stephanie-wang stephanie-wang requested a review from a team as a code owner March 1, 2024 01:06
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
@stephanie-wang stephanie-wang added the release-blocker P0 Issue that blocks the release label Mar 1, 2024
Comment on lines +419 to +420
// caller. The executor side should just assume everything is consumed if it
// is -1.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkooo567 is this right? This code seems to suggest otherwise.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
@rkooo567
Copy link
Contributor

rkooo567 commented Mar 1, 2024

Let me finish review asap

Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also remove this in this PR?
It's a data-level workaround for this issue

@stephanie-wang
Copy link
Contributor Author

could you also remove this in this PR? It's a data-level workaround for this issue

It would be better if we can keep the changes separate. This PR is risky as is.

Also, if there is a workaround in Data, should we still mark this as a release blocker? It's not complete yet and the amount of refactoring needed is going to be significant.

…lock

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
@stephanie-wang stephanie-wang removed the release-blocker P0 Issue that blocks the release label Mar 5, 2024
@rkooo567
Copy link
Contributor

rkooo567 commented Mar 5, 2024

Btw, premrege seems to fail. Should I review before it passes CI?

ASSERT_TRUE(retry_signal_called);
CompletePendingStreamingTask(spec, caller_address, 2);
}
// TEST_F(TaskManagerTest, TestObjectRefStreamDeletedStreamIgnored) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these tests temporarily disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, working on putting these back before we merge. I made some changes to TaskManager interface so these were failing to compile.

Please review even though premerge is failing! Thanks!

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
stephanie-wang added a commit that referenced this pull request Mar 8, 2024
Fix lineage reconstruction bug for streaming generators by only garbage-collecting stream metadata once the refs' lineage have gone out of scope. Changes:

    When generator goes out of scope, add it to a list of streaming generator tasks that we scan periodically
    For each generator task, check if the task and streaming metadata can be removed. It can be removed if the generator task and all generated return refs have gone out of scope.
    Fixes an existing potential leak where task completes after the generator ref and returned refs have gone out of scope, by deleting the task metadata with the stream metadata.

#43584 is probably better long-term as it refactors stream metadata to be GCed through the normal ref counting path, and lineage can be GCed eagerly. However, this fix is safer.

Related issue number

Closes #39151.

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
@stephanie-wang
Copy link
Contributor Author

Closing for now in favor of #43772 but we should revisit once the TaskManager <> RefCounter deadlock situation is resolved. The merged PR is less ideal because it requires periodic GC and may have edge cases depending on order of task completion vs references out of scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core][Streaming Generator] Lineage reconstruction is not working if generator ref is GC'ed.
4 participants