Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Eagerly evict objects that are no longer in scope #7220

Merged
merged 4 commits into from Feb 20, 2020

Conversation

stephanie-wang
Copy link
Contributor

Why are these changes needed?

Add an option to eagerly evict copies of objects that are no longer in scope, according to the owner of the object, to reduce plasma's memory footprint. The eviction is done by the raylets. When the raylet that is pinning the object ID hears from the owner that it is OK to unpin, the raylet adds the object ID to a list of objects to free. Once the list reaches a configured size or has not been flushed after a configured time, the list will be flushed by sending a FreeObjects request to all other object managers. This in turn triggers a Delete of the object in plasma.

This PR leaves the feature off by default. To enable it, make sure object_pinning_enabled is on, then set free_objects_period_milliseconds to a non-negative value in the backend config. This will set the time period between attempts to free objects. free_objects_batch_size sets the maximum size before the list is flushed. Note that if the application uses serialized ObjectIDs, i.e. an object ID that is created in one process and another process is given a reference to it, then it is recommended that distributed_ref_counting_enabled is also turned on, or else the application may receive spurious "object lost" errors.

Example to enable:

ray.init(_internal_config=json.dumps({
    "free_objects_period_milliseconds": 1000,  # Flush objects every second.
    "free_objects_batch_size": 100,  # Flush if >=100 objects were unpinned locally since the last flush.
    "distributed_ref_counting_enabled": 1,  # Recommended for serialized IDs.
  }))

Checks

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22095/
Test FAILed.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Might've preferred first having a version that doesn't batch at all but given that it's implemented this seems good.

@@ -312,6 +314,12 @@ void NodeManager::Heartbeat() {
last_debug_dump_at_ms_ = now_ms;
}

// Evict all copies of freed objects from the cluster.
if (free_objects_period_ > 0 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this in the heartbeat means the flush duration needs to be much larger than or a multiple of the heartbeat duration - might be better to just have a separate timer on the event loop. Otherwise should indicate this in the comments in the config.

@virtualluke
Copy link
Contributor

object_pinning_enabled is on by default now?

Is there an api to add an object to the "list of objects to free" that will get evicted on the time or batch size threshold mentioned in this PR?

thanks for the work on this.

@stephanie-wang
Copy link
Contributor Author

object_pinning_enabled is on by default now?

Yes, object_pinning_enabled is on by default, I believe as of the last release.

Is there an api to add an object to the "list of objects to free" that will get evicted on the time or batch size threshold mentioned in this PR?

You could call del object in Python and assuming that's the only reference left, that should automatically add the object to the list. If that doesn't work for you, you can also try the internal API ray.internal.free, which immediately attempts to free whichever objects you passed in. Since this API is internal, though, there may not be support for it in the future.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22127/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22135/
Test FAILed.

@stephanie-wang
Copy link
Contributor Author

TestMemoryScheduling::testTuneWorkerHeapLimit failure looks like it's from master.

@stephanie-wang stephanie-wang merged commit 7e3819a into ray-project:master Feb 20, 2020
@stephanie-wang stephanie-wang deleted the eager-eviction branch February 20, 2020 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants