Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Option to fallback to LRU on OutOfMemory #7410

Merged
merged 23 commits into from
Mar 14, 2020

Conversation

stephanie-wang
Copy link
Contributor

Why are these changes needed?

Object pinning will evict objects based on the current ref count. If object pinning is off, each plasma store evicts objects independently in LRU order. This PR modifies the behavior when object_pinning_enabled is off so that we trigger GC at all worker nodes before we fallback to LRU so that we favor evicting unreachable objects over objects still in use.

TODO: Upgrade to plasma once this PR is merged.

Checks

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22643/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22645/
Test FAILed.

"available with ray.init(object_store_memory=<bytes>). "
"You can also try setting an option to fallback to LRU eviction "
"when the object store is full by calling ray.init("
"_internal_config=json.dumps({\"object_pinning_enabled\": 0})). "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't ship with such an ugly message, how about ray.init(lru_evict=True) or similar.

store_client_.Create(object_id.ToPlasmaId(), evict_if_full, data_size,
metadata ? metadata->Data() : nullptr,
metadata ? metadata->Size() : 0, &arrow_buffer);
// Always try to evict after the first attempt.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit scary because it seems plausible to me that the global GC broadcast + python GC + unpin message + unpinning from plasma could take over 1s on a busy cluster. Maybe we should only evict in the last attempt of the exponential backoff?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think evicting after a short timeout is the right decision here since it ensures reasonable performance under moderate memory pressure. If the cluster is under such high memory pressure that the timing matters, I don't think the workload will be stable under any settings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point but 1s still feels low enough to cause eviction to happen without "real" memory pressure that could.be solved by a gc.collect() due to CPU/network contention delaying messages

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can resolve that by periodically running gc collect, perhaps every 5 minutes. That would avoid ever hitting 100% memory usage under normal circumstances.

Perhaps we can also choose a middle value like 5 seconds before eviction starts. 30 seconds is much too long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed 30s is definitely too long

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could have a higher timeout for the first time we fall back to LRU (e.g., 5s) and then reduce after the first time LRU is necessary. Seems like a good compromise to avoid the condition I mentioned but still have reasonable perf for workloads with high memory pressure/that work with LRU (e.g., ipython)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm okay with extending the LRU to a longer time, but it'd be good to choose the exact number and behavior afterwards based on some real measurements; otherwise it's just guesswork and it will complicate the code unnecessarily.

As a start, how about we just increase object_store_full_initial_delay_ms to at least 5s in the startup config if object pinning is disabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would effectively mean that every put would take 5s once the object store fills in the ipython use case, right? If that's the case then I'd say let's just leave it at 1s for now.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22674/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22681/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22778/
Test FAILed.

@zhijunfu zhijunfu self-requested a review March 6, 2020 00:43
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22904/
Test FAILed.

@stephanie-wang stephanie-wang changed the title [core][wip] Option to fallback to LRU on OutOfMemory [core] Option to fallback to LRU on OutOfMemory Mar 9, 2020
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22911/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22974/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22983/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23042/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23046/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23053/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23116/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23165/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23183/
Test FAILed.

@stephanie-wang stephanie-wang merged commit 5354931 into ray-project:master Mar 14, 2020
@stephanie-wang stephanie-wang deleted the lru-fallback branch March 14, 2020 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants