Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable distributed ref counting by default #7628

Merged
merged 24 commits into from Mar 19, 2020

Conversation

stephanie-wang
Copy link
Contributor

@stephanie-wang stephanie-wang commented Mar 16, 2020

Why are these changes needed?

Turn on distributed ref counting by default. This can be turned off with:

ray.init(_internal_config=json.dumps({
    "distributed_ref_counting_enabled": 0,
  }))

This also turns on eager eviction for objects by default. This means that all copies of an object will be proactively evicted once the ID is no longer in scope on any workers. This can be turned off with:

ray.init(_internal_config=json.dumps({
    "free_objects_period_milliseconds,": -1,
  }))

Related issue number

Checks

@ericl
Copy link
Contributor

ericl commented Mar 16, 2020

Can we also enable eager eviction at the same time?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@simon-mo simon-mo self-assigned this Mar 16, 2020
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23245/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23246/
Test PASSed.

@stephanie-wang
Copy link
Contributor Author

Note that I had to add a DrainAndShutdown method to the ReferenceCounter, similar to the one for TaskManager. This is to make sure that if the worker is doing a clean exit and still has some objects that it owns, that it will wait for their refs to go out of scope before exiting. This became necessary because eager eviction is now enabled in this PR.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23254/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23251/
Test FAILed.

@zhijunfu
Copy link
Contributor

@stephanie-wang Sounds great. Just a few questions on this:

  • Are there any scenarios that are not yet supported with this? Or are there any limitations so far?
  • Are there any noticeable impacts with this enabled beside eager eviction? e.g. are there performance regressions .etc?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23257/
Test FAILed.

@edoakes
Copy link
Contributor

edoakes commented Mar 17, 2020

@zhijunfu

One limitation is that workloads that previously relied on LRU to function (e.g., workloads like IPython that keep references around forever even if they aren't used) will not work (and same goes for the previous two releases). However, they should be able to revert to the old LRU behavior by turning off the object_pinning_enabled flag.

We have not seen substantial performance regressions, but will be keeping an eye out for this while running the release tests (cc @simon-mo).

@edoakes
Copy link
Contributor

edoakes commented Mar 17, 2020

@stephanie-wang the diff is showing as +1314/-813 LOC, not sure why.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23286/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23297/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23304/
Test FAILed.

@stephanie-wang
Copy link
Contributor Author

@stephanie-wang Sounds great. Just a few questions on this:

* Are there any scenarios that are not yet supported with this? Or are there any limitations so far?

Please see the design doc for this.

* Are there any noticeable impacts with this enabled beside eager eviction? e.g. are there performance regressions .etc?

On my laptop, the PR for eager eviction (#7220) seemed to incur ~10% overhead for workloads with many plasma objects. The goal is to have this in for the next ray release so we can run the performance benchmarks from the release testing.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23338/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23340/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23355/
Test FAILed.

@stephanie-wang stephanie-wang merged commit b499100 into ray-project:master Mar 19, 2020
@stephanie-wang stephanie-wang deleted the enable-ref-counting branch March 19, 2020 05:39
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23363/
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet