[core] Option to fallback to LRU on OutOfMemory #7410

stephanie-wang · 2020-03-03T02:40:07Z

Why are these changes needed?

Object pinning will evict objects based on the current ref count. If object pinning is off, each plasma store evicts objects independently in LRU order. This PR modifies the behavior when object_pinning_enabled is off so that we trigger GC at all worker nodes before we fallback to LRU so that we favor evicting unreachable objects over objects still in use.

TODO: Upgrade to plasma once this PR is merged.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.

This reverts commit 44aded5.

This reverts commit b6359fe.

AmplabJenkins · 2020-03-03T02:43:41Z

Can one of the admins verify this patch?

AmplabJenkins · 2020-03-03T03:05:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22643/
Test FAILed.

AmplabJenkins · 2020-03-03T03:13:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22645/
Test FAILed.

ericl · 2020-03-03T04:17:58Z

python/ray/exceptions.py

+            "available with ray.init(object_store_memory=<bytes>). "
+            "You can also try setting an option to fallback to LRU eviction "
+            "when the object store is full by calling ray.init("
+            "_internal_config=json.dumps({\"object_pinning_enabled\": 0})). "


We shouldn't ship with such an ugly message, how about ray.init(lru_evict=True) or similar.

edoakes · 2020-03-03T04:19:06Z

src/ray/core_worker/store_provider/plasma_store_provider.cc

+          store_client_.Create(object_id.ToPlasmaId(), evict_if_full, data_size,
+                               metadata ? metadata->Data() : nullptr,
+                               metadata ? metadata->Size() : 0, &arrow_buffer);
+      // Always try to evict after the first attempt.


This is a little bit scary because it seems plausible to me that the global GC broadcast + python GC + unpin message + unpinning from plasma could take over 1s on a busy cluster. Maybe we should only evict in the last attempt of the exponential backoff?

I think evicting after a short timeout is the right decision here since it ensures reasonable performance under moderate memory pressure. If the cluster is under such high memory pressure that the timing matters, I don't think the workload will be stable under any settings.

That's a good point but 1s still feels low enough to cause eviction to happen without "real" memory pressure that could.be solved by a gc.collect() due to CPU/network contention delaying messages

I think we can resolve that by periodically running gc collect, perhaps every 5 minutes. That would avoid ever hitting 100% memory usage under normal circumstances.

Perhaps we can also choose a middle value like 5 seconds before eviction starts. 30 seconds is much too long.

Agreed 30s is definitely too long

Maybe we could have a higher timeout for the first time we fall back to LRU (e.g., 5s) and then reduce after the first time LRU is necessary. Seems like a good compromise to avoid the condition I mentioned but still have reasonable perf for workloads with high memory pressure/that work with LRU (e.g., ipython)

Hmm I'm okay with extending the LRU to a longer time, but it'd be good to choose the exact number and behavior afterwards based on some real measurements; otherwise it's just guesswork and it will complicate the code unnecessarily.

As a start, how about we just increase object_store_full_initial_delay_ms to at least 5s in the startup config if object pinning is disabled?

That would effectively mean that every put would take 5s once the object store fills in the ipython use case, right? If that's the case then I'd say let's just leave it at 1s for now.

AmplabJenkins · 2020-03-03T20:36:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22674/
Test FAILed.

AmplabJenkins · 2020-03-03T22:49:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22681/
Test FAILed.

AmplabJenkins · 2020-03-05T20:34:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22778/
Test FAILed.

AmplabJenkins · 2020-03-09T18:40:52Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22904/
Test FAILed.

edoakes

LGTM

AmplabJenkins · 2020-03-09T22:14:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22911/
Test FAILed.

test

AmplabJenkins · 2020-03-10T19:28:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22974/
Test FAILed.

AmplabJenkins · 2020-03-10T22:18:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22983/
Test FAILed.

AmplabJenkins · 2020-03-11T20:00:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23042/
Test PASSed.

AmplabJenkins · 2020-03-11T21:39:42Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23046/
Test FAILed.

AmplabJenkins · 2020-03-12T01:50:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23053/
Test FAILed.

This reverts commit 98f01c6.

AmplabJenkins · 2020-03-13T01:38:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23116/
Test PASSed.

AmplabJenkins · 2020-03-13T21:58:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23165/
Test FAILed.

AmplabJenkins · 2020-03-14T03:16:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23183/
Test FAILed.

stephanie-wang added 7 commits February 27, 2020 15:51

Add a test for LRU fallback

5d46867

Update error message

270197f

Upgrade arrow to master

cf729e3

Integrate with arrow

f94f41e

Merge remote-tracking branch 'upstream/master' into lru-fallback

fa39069

Revert "Bazel mirrors (ray-project#7385)"

b6359fe

This reverts commit 44aded5.

Don't LRU evict

f899f3d

stephanie-wang force-pushed the lru-fallback branch from f899f3d to 97c1220 Compare March 3, 2020 02:40

Revert "Revert "Bazel mirrors (ray-project#7385)""

3ed8efc

This reverts commit b6359fe.

stephanie-wang force-pushed the lru-fallback branch from 97c1220 to 3ed8efc Compare March 3, 2020 02:43

ericl reviewed Mar 3, 2020

View reviewed changes

edoakes reviewed Mar 3, 2020

View reviewed changes

Add lru_evict flag

f1f19ca

fix internal config

4c33a56

Fix

fc091db

zhijunfu self-requested a review March 6, 2020 00:43

stephanie-wang added 2 commits March 9, 2020 10:33

Merge remote-tracking branch 'upstream/master' into lru-fallback

5281068

upgrade arrow

7291624

stephanie-wang changed the title ~~[core][wip] Option to fallback to LRU on OutOfMemory~~ [core] Option to fallback to LRU on OutOfMemory Mar 9, 2020

edoakes approved these changes Mar 9, 2020

View reviewed changes

debug

98f01c6

Set free period in config for lru_evict, override max retries to fix

92df96a

test

Merge remote-tracking branch 'upstream/master' into lru-fallback

4e10eb5

Fix test?

2ba3c99

Merge branch 'master' into lru-fallback

1ee424d

fix test

e26c6dc

stephanie-wang added 2 commits March 12, 2020 17:16

Revert "debug"

45b73ba

This reverts commit 98f01c6.

fix exception str

9cf7c0d

Fix ref count test

cb0b3f9

Shorten travis test?

72ad862

stephanie-wang merged commit 5354931 into ray-project:master Mar 14, 2020

stephanie-wang deleted the lru-fallback branch March 14, 2020 18:28

edoakes mentioned this pull request Mar 18, 2020

Warning always printed when connecting to a running cluster #7647

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Option to fallback to LRU on OutOfMemory #7410

[core] Option to fallback to LRU on OutOfMemory #7410

stephanie-wang commented Mar 3, 2020

AmplabJenkins commented Mar 3, 2020

AmplabJenkins commented Mar 3, 2020

AmplabJenkins commented Mar 3, 2020

ericl Mar 3, 2020

edoakes Mar 3, 2020

ericl Mar 3, 2020

edoakes Mar 3, 2020

ericl Mar 3, 2020

edoakes Mar 3, 2020

edoakes Mar 3, 2020

stephanie-wang Mar 3, 2020

edoakes Mar 3, 2020

AmplabJenkins commented Mar 3, 2020

AmplabJenkins commented Mar 3, 2020

AmplabJenkins commented Mar 5, 2020

AmplabJenkins commented Mar 9, 2020

edoakes left a comment

AmplabJenkins commented Mar 9, 2020

AmplabJenkins commented Mar 10, 2020

AmplabJenkins commented Mar 10, 2020

AmplabJenkins commented Mar 11, 2020

AmplabJenkins commented Mar 11, 2020

AmplabJenkins commented Mar 12, 2020

AmplabJenkins commented Mar 13, 2020

AmplabJenkins commented Mar 13, 2020

AmplabJenkins commented Mar 14, 2020

[core] Option to fallback to LRU on OutOfMemory #7410

[core] Option to fallback to LRU on OutOfMemory #7410

Conversation

stephanie-wang commented Mar 3, 2020

Why are these changes needed?

Checks

AmplabJenkins commented Mar 3, 2020

AmplabJenkins commented Mar 3, 2020

AmplabJenkins commented Mar 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 3, 2020

AmplabJenkins commented Mar 3, 2020

AmplabJenkins commented Mar 5, 2020

AmplabJenkins commented Mar 9, 2020

edoakes left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 9, 2020

AmplabJenkins commented Mar 10, 2020

AmplabJenkins commented Mar 10, 2020

AmplabJenkins commented Mar 11, 2020

AmplabJenkins commented Mar 11, 2020

AmplabJenkins commented Mar 12, 2020

AmplabJenkins commented Mar 13, 2020

AmplabJenkins commented Mar 13, 2020

AmplabJenkins commented Mar 14, 2020