[core] Trigger global gc when plasma store is under pressure. #15775

fishbone · 2021-05-13T05:32:10Z

Why are these changes needed?

To avoid object spilling when the object is full, we trigger the global gc earlier to release memory.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rkooo567 · 2021-05-13T18:01:22Z

Why don't we implement this in Status CreateRequestQueue::ProcessRequests()? (It's the code that handles proper actions for different memory usage)

stephanie-wang · 2021-05-13T21:56:41Z

src/ray/object_manager/object_manager.h

@@ -313,6 +313,10 @@ class ObjectManager : public ObjectManagerInterface,

  int64_t GetMemoryCapacity() const { return config_.object_store_memory; }

+  double GetUsedMemoryPercentage() const {


I think that we already have something like this in the resource manager that you could reuse.

@stephanie-wang I only see resource manager in gcs, which probably doesn't fit here. Is this the one you mentioned?

I was thinking of the ClusterResourceScheduler, although I'm not totally sure it's there. @ericl should know since I think he added the memory availability reporting for the autoscaler.

stephanie-wang · 2021-05-13T21:58:49Z

src/ray/raylet/node_manager.cc

@@ -519,20 +519,27 @@ void NodeManager::FillResourceReport(rpc::ResourcesData &resources_data) {
  cluster_resource_scheduler_->FillResourceUsage(resources_data);
  cluster_task_manager_->FillResourceUsage(resources_data);

+  // If plasma store is under high pressure, we should try to schedule a global gc.
+  bool plasma_high_pressure =
+      object_manager_.GetUsedMemoryPercentage() > high_plasma_storage_usage_;


Should we adjust the interval based on the current memory usage instead of using a static threshold/interval?

Also, what are the guarantees for when this method gets called? Is it on a timer?

Ah, I've talked with @ericl about this, the conclusion is that the 30s probably should be the minimum duration to avoid overhead to the cluster. The default GC on the local node is 10min here. And here it's to issue a global gc which is different from that one.

Besides, we think maybe the global GC should be sent from GCS since that's the place that can have better global control. For example, if all nodes are at a similar usage of memory, global GC will be triggered at the same time, and it's N*N message passing which is not necessary. Although it goes with the heartbeat, the heartbeat will be forced to be broadcasted if global_gc bit is set to be true.

This goes with report, it's run periodically (code)

Ah I see, thanks for explaining!

Yeah for the report, I was just worried we might miss the GC call because this handler doesn't get called soon enough. This seems fine, though.

I like the gcs based report idea. Can you create a enhancement issue? Would there be any issue with that approach?

Ah I see, thanks for explaining!

Yeah for the report, I was just worried we might miss the GC call because this handler doesn't get called soon enough. This seems fine, though.

I agree with you. I think here if memory usage is close to 100%, we probably can do nothing. It'll help if it bump to 90% suddenly and increase slowly. That's why even with this, it won't help fix the issue.

fishbone · 2021-05-13T22:41:56Z

Why don't we implement this in Status CreateRequestQueue::ProcessRequests()? (It's the code that handles proper actions for different memory usage)

This is more like a higher-level logic I think. It's a little bit strange to put global GC logic into plasma store which is a single node in-memory storage. I actually prefer moving that triggering gc out from the store.

Another reason is that the ProcessRequests are something processing plasma query, and if there is no query, it won't trigger it (I guess :p). Think about this case, the plasma store is under high pressure and all things are being used. GC got triggered, and it doesn't help. After this, some resources released in the app layer, before app gc's scheduling, there is no traffic running to plasma store which will keep the store under high memory pressure. So I think this is the better place for it.

stephanie-wang · 2021-05-13T23:34:13Z

Also, is it possible to write a Python test for this?

fishbone · 2021-05-14T04:57:53Z

Also, is it possible to write a Python test for this?

Sure. I actually was thinking about how to add a unit test and then realized it's not that easy to do. python e2e test totally makes sense.

rkooo567 · 2021-05-14T17:45:02Z

src/ray/raylet/node_manager.cc

-  auto now = absl::GetCurrentTimeNanos();
-  if ((should_local_gc_ || now - last_local_gc_ns_ > local_gc_interval_ns_) &&
-      now - last_local_gc_ns_ > local_gc_min_interval_ns_) {
+  if ((should_local_gc_ || (absl::GetCurrentTimeNanos() -


Do we need absl::GetCurrentTimeNanos() - local_gc_throttler_.LastRunTime() here? Doesn't AbleToRun have the same logic in it?

There are two things here: 1) prevent it from running if it run too frequently. 2) force it to run if it hasn't run for a while. AbleToRun cover 1), 2) is not covered.

rkooo567 · 2021-05-14T17:46:37Z

src/ray/util/throttler_test.cc

+
+#include "ray/util/throttler.h"
+
+TEST(ThrottlerTest, BasicTest) {


This means the test takes 20 seconds right? Can we just use the logical timer like PullManager's get_time?

Let me check it

ericl · 2021-05-14T18:38:02Z

src/ray/util/throttler.h

+class Throttler {
+ public:
+  explicit Throttler(uint64_t interval_ns)
+      : last_run_ns_(absl::GetCurrentTimeNanos()), interval_ns_(interval_ns) {}


Can we set last_run_ns_ to zero by default?

(This allows the first trigger to happen immediately)

ericl · 2021-05-17T21:22:53Z

Can you rebase to fix the CI issues?

rkooo567 · 2021-05-24T05:38:03Z

(07:04:40) ERROR: BUILD.bazel:969:8: Couldn't build file _objs/throttler_test/throttler_test.obj: C++ compilation of rule '//:throttler_test' failed (Exit 2): cl.exe failed: error executing command

Failing with Windows build with this error

fishbone · 2021-05-24T07:33:16Z

(07:04:40) ERROR: BUILD.bazel:969:8: Couldn't build file _objs/throttler_test/throttler_test.obj: C++ compilation of rule '//:throttler_test' failed (Exit 2): cl.exe failed: error executing command

Failing with Windows build with this error

Thanks for the reminder! Sorry too busy with other things and totally forgot this :(

I checked the failure of the window and I think it's an important check and I'm surprised that Linux doesn't give such an error.

rkooo567 · 2021-05-24T21:05:49Z

tests:test_global_gc fails on both Linux & OSX (in flaky builds). I think it should be related to this PR

fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 13, 2021

fishbone added 3 commits May 13, 2021 05:38

up

d7ecb99

up

dfd671c

up

60b7012

fishbone force-pushed the improved-gc branch from b0fb5ff to 60b7012 Compare May 13, 2021 05:38

fishbone assigned ericl and stephanie-wang May 13, 2021

rkooo567 self-assigned this May 13, 2021

stephanie-wang reviewed May 13, 2021

View reviewed changes

rkooo567 approved these changes May 14, 2021

View reviewed changes

ericl reviewed May 14, 2021

View reviewed changes

fishbone added 2 commits May 17, 2021 06:55

fix comment

af82324

fix comment

f2b03db

fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 17, 2021

format

5e7b67f

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 17, 2021

fishbone added 2 commits May 17, 2021 23:22

Merge remote-tracking branch 'upstream/master' into improved-gc

23f5e06

Merge remote-tracking branch 'upstream/master' into improved-gc

941bca1

fishbone added 2 commits May 24, 2021 07:31

up

4d00510

Merge remote-tracking branch 'upstream/master' into improved-gc

d2a4002

fishbone added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels May 24, 2021

fishbone added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels May 24, 2021

fishbone added 3 commits May 25, 2021 21:56

Merge remote-tracking branch 'upstream/master' into improved-gc

8493cfb

up

551fca9

Merge remote-tracking branch 'upstream/master' into improved-gc

d48df96

ericl merged commit 5d0b302 into ray-project:master May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Trigger global gc when plasma store is under pressure. #15775

[core] Trigger global gc when plasma store is under pressure. #15775

fishbone commented May 13, 2021

rkooo567 commented May 13, 2021 •

edited

Loading

stephanie-wang May 13, 2021

fishbone May 14, 2021

stephanie-wang May 14, 2021

stephanie-wang May 13, 2021

fishbone May 13, 2021

stephanie-wang May 13, 2021

rkooo567 May 14, 2021

fishbone May 14, 2021

fishbone commented May 13, 2021

stephanie-wang commented May 13, 2021

fishbone commented May 14, 2021

rkooo567 May 14, 2021

fishbone May 15, 2021

rkooo567 May 14, 2021

ericl May 14, 2021

fishbone May 15, 2021

ericl May 14, 2021

ericl May 14, 2021

ericl commented May 17, 2021

rkooo567 commented May 24, 2021

fishbone commented May 24, 2021

rkooo567 commented May 24, 2021

		@@ -313,6 +313,10 @@ class ObjectManager : public ObjectManagerInterface,

		int64_t GetMemoryCapacity() const { return config_.object_store_memory; }

		double GetUsedMemoryPercentage() const {


		#include "ray/util/throttler.h"

		TEST(ThrottlerTest, BasicTest) {

[core] Trigger global gc when plasma store is under pressure. #15775

[core] Trigger global gc when plasma store is under pressure. #15775

Conversation

fishbone commented May 13, 2021

Why are these changes needed?

Related issue number

Checks

rkooo567 commented May 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fishbone commented May 13, 2021

stephanie-wang commented May 13, 2021

fishbone commented May 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented May 17, 2021

rkooo567 commented May 24, 2021

fishbone commented May 24, 2021

rkooo567 commented May 24, 2021

rkooo567 commented May 13, 2021 •

edited

Loading