Work stealing #10607

goliaro · 2020-09-06T05:46:39Z

Why are these changes needed?

These changes are needed to allow workers to steal non-actor tasks from other workers that are overloaded. In particular, if a worker is done with its own work, and there are no more tasks in the owner's task queue, the owner looks for a suitable victim, and if one is found, it initiates work stealing on behalf of the idle worker. In addition, when the work stealing mode is enabled, RequestNewWorkerIfNeeded is instructed to request new workers not only if there are more tasks in the owner's queue (and the pipelines to the current workers are all full), but also if there are stealable tasks at any of the existing workers. This approach is called "Eager Worker Requesting Mode".

Work Stealing is only available among workers that share the same owner, and that are working on tasks with the same SchedulingKey.

Related issue number

This PR supersedes the (now closed) PR #10135. It also includes the changes introduced with PR #10225 (Keeping pipelines full), now merged into master.

Checks

[ x] I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested (please justify below)

… the CoreWorkerDirectTaskReceiver mutex private

…rt_test. builds on mac

stephanie-wang · 2020-09-07T18:31:04Z

src/ray/core_worker/transport/direct_actor_transport.cc

+    }
+
+    // Set the "stolen" bool flag to true
+    reverse_it->second = true;


Instead of storing a flag on the executor side, can we store the callback, then delete the task/reply to the owner immediately? That would help to make sure that we reply to the owner as quickly as possible.

stephanie-wang · 2020-09-07T18:37:45Z

src/ray/core_worker/transport/direct_actor_transport.cc

+  absl::MutexLock lock(&mu_);
+
+  size_t half = non_actor_task_queue_.size() / 2;
+  RAY_CHECK(half >= 0);


I don't think this assertion is meaningful (size_t is always non-negative).

stephanie-wang · 2020-09-07T18:40:06Z

src/ray/core_worker/transport/direct_actor_transport.cc

+    reverse_it->second = true;
+
+    // Add the task's TaskSpecification to the StealWork RPC reply
+    reply->add_tasks_stolen()->CopyFrom(reverse_it->first.GetMessage());


Carrying on the conversation from the previous PR:

Currently, I don't think the owner keeps a {TaskId -> TaskSpecification} mapping for the tasks that are in-flight to a worker. In fact, right after calling PushNormalTask to push a task to a worker, OnWorkerIdle pops the task's TaskSpec from the relevant owner's queue. So once we push a task, the owner will not see the task's TaskSpec until the worker responds to the owner and activates the PushNormalTask callback (which captures the TaskSpec).

I think you can get the task spec from the owner's TaskManager, which caches the specs for all tasks that are still running. This may actually be relevant for performance because we can save bandwidth by not sending the full task spec back to the owner.

stephanie-wang · 2020-09-07T18:41:30Z

src/ray/core_worker/transport/direct_actor_transport.h

@@ -523,6 +542,14 @@ class CoreWorkerDirectTaskReceiver {
  /// Queue of pending requests per actor handle.
  /// TODO(ekl) GC these queues once the handle is no longer active.
  std::unordered_map<WorkerID, SchedulingQueue> scheduling_queue_;
+  /// The Worker ID of the worker running this task receiver
+  WorkerID this_worker_id_;


We should try not to add fields that are only used for debugging purposes. I think you can get the worker ID from other parts of the same log).

stephanie-wang · 2020-09-07T18:54:27Z

src/ray/core_worker/transport/direct_actor_transport.cc

@@ -320,6 +320,31 @@ void CoreWorkerDirectTaskReceiver::HandlePushTask(
    return;
  }

+  if (!task_spec.IsActorTask() && !task_spec.IsActorCreationTask()) {


Hmm I'm not sure if these handlers are guaranteed to execute in the same order that they were posted in. I think it will work for now since there is only one thread executing the event loop, but that could be a problem with multiple threads since then the queue may not match the order of callbacks. It may be safer to store the original send_reply_callback onto the queue first (basically the same code that was already in HandlePushTask), then execute from the queue in this handler.

stephanie-wang · 2020-09-07T19:16:52Z

src/ray/common/ray_config_def.h

+
+/// Enable stealing of non-actor tasks among workers that are associated with the same
+/// owner
+RAY_CONFIG(bool, work_stealing_enabled, false)


Is it possible to remove this flag? That is, we always enable work stealing, but it doesn't actually kick in unless max_tasks_in_flight is > 0?

stephanie-wang · 2020-09-07T19:21:30Z

src/ray/core_worker/transport/direct_task_transport.cc

+    if (victim_addr.worker_id == thief_addr.worker_id ||
+        ((candidate_lease_entry.stealable_tasks.size() >
+          victim_lease_entry.stealable_tasks.size()) &&
+         candidate_addr.worker_id != thief_addr.worker_id)) {


Hmm I'm not really sure if I understand this if condition! Won't this find the worker with the minimum stealable tasks? I thought that we would want the worker with the maximum number. Also, I don't quite understand this part: victim_addr.worker_id == thief_addr.worker_id.

It would be good to add some comments here to explain the criteria for the victim worker.

stephanie-wang · 2020-09-07T19:23:21Z

src/ray/core_worker/transport/direct_task_transport.cc

+  }
+  rpc::WorkerAddress victim_addr = *victim_it;
+
+  // Check that the victim is a suitable one


I think that we could simplify these checks to just checking if the victim has more than one task in flight. The thief should have 0 so that will also make sure the worker doesn't steal from itself.

stephanie-wang · 2020-09-07T19:26:56Z

src/ray/core_worker/transport/direct_task_transport.cc

+          auto res = thief_entry.stealable_tasks.emplace(task_spec.TaskId());
+          RAY_CHECK(res.second);
+          executing_tasks_.emplace(task_spec.TaskId(), thief_addr);
+          PushNormalTask(thief_addr, client, scheduling_key, task_spec,


How about reusing some of the existing code, to cut down on the additional logic? You could push the stolen tasks back onto the main queue, then call OnWorkerIdle for the thief worker, right? That would also help to make sure that we never exceed the maximum tasks in flight.

stephanie-wang · 2020-09-07T19:29:57Z

src/ray/core_worker/transport/direct_task_transport.cc

+  worker_to_lease_entry_.erase(addr);
+}
+
+void CoreWorkerDirectTaskSubmitter::StealWorkIfNeeded(


This method is pretty long! I left some suggestions on how to simplify it, but if it's still ~50+ lines, we should think about other ways to pull out parts of the logic.

goliaro · 2020-09-26T20:46:35Z

Will continue working on this PR once the code in PR #11051 gets merged. The code in #11051 will facilitate stealing from a victim's task queue

…nto cancel_queued_tasks

…d_tasks

Increase debugability of test cancel

goliaro · 2021-01-19T23:30:58Z

Continuing this PR with new PR #13570 !

Gabriele Oliaro added 13 commits September 2, 2020 23:47

one commit work stealin. all tests passing locally

3b978dd

fixed bug preventing build on mac

def1a91

refactore code in CoreWorker and CoreWorkerDirectTaskReceiver to keep…

5dddba0

… the CoreWorkerDirectTaskReceiver mutex private

added eager worker requesting mode, pre-debugging direct_task_transpo…

8319ae7

…rt_test. builds on mac

bug fix in RequestNewWorkersIfNeeded

61d5afc

added benchmarking code

1066e6f

bug fix

e32539c

fixed work stealing bugs; checked performance

178dd94

stealing tasks from end of queue in direct_actor_transport.cc

823b792

bug fix

0244250

linting

1b0bea2

fixed merge conflicts

df98e02

more linting

cc6fc50

goliaro mentioned this pull request Sep 6, 2020

[WIP] Work Stealing and Eager Worker Requesting Mode #10135

Closed

5 tasks

bug fix

50ba962

stephanie-wang requested changes Sep 7, 2020

View reviewed changes

ffbin self-assigned this Sep 8, 2020

Gabriele Oliaro added 3 commits September 16, 2020 21:33

Merge branch 'master' into work_stealing_official

2faade6

first change, debugging now

fa3b47e

Merge branch 'master' into work_stealing_official

bb62641

rkooo567 self-assigned this Sep 26, 2020

goliaro mentioned this pull request Sep 26, 2020

Queueing non-actor tasks at the workers #11051

Merged

2 tasks

Gabriele Oliaro added 7 commits November 18, 2020 11:09

wrote code to enable cancellation of queued non-actor tasks

b0b3799

minor changes

e55d3b5

bug fixes

dd033bd

added comments

5f331c8

rev1

6f6a32d

linting

4ebfae5

Merge branch 'master' into cancel_queued_tasks

09a820f

Gabriele Oliaro and others added 21 commits November 21, 2020 17:02

Merge branch 'cancel_queued_tasks' of github.com:gabrieleoliaro/ray i…

bd70ffe

…nto cancel_queued_tasks

making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error

2965eda

bug fix

134af60

Merge branch 'master' of github.com:ray-project/ray into cancel_queue…

d0f30b6

…d_tasks

added two unit tests

a2640be

linting

9c73c14

iterating through pending_normal_tasks starting from end

8961c4f

fixup! iterating through pending_normal_tasks starting from end

bafb8e6

fixup! fixup! iterating through pending_normal_tasks starting from end

331acbf

Merge branch 'master' into cancel_queued_tasks

58e1022

post merge fixes

ff05231

added debugging instructions, pulled Accept() out of guarded loop

1cd1ef9

removed debugging instructions, linting

855dcfc

first commit

15560ec

Merge pull request #3 from ijrsvt/increase-debugability-of-test-cancel

ee7c8ee

Increase debugability of test cancel

Merge branch 'master' into cancel_queued_tasks

6531e33

Merge branch 'master' into cancel_queued_tasks

4247050

lint

44a17ce

fixed merge issue

1bf220e

lint

ac6db80

working to fix merge conflicts (still wip)

deaca3d

goliaro closed this Jan 19, 2021

goliaro deleted the work_stealing_official branch January 19, 2021 18:02

goliaro restored the work_stealing_official branch January 19, 2021 23:30

goliaro deleted the work_stealing_official branch January 24, 2021 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work stealing #10607

Work stealing #10607

goliaro commented Sep 6, 2020

stephanie-wang Sep 7, 2020

stephanie-wang Sep 7, 2020

stephanie-wang Sep 7, 2020

stephanie-wang Sep 7, 2020

stephanie-wang Sep 7, 2020

stephanie-wang Sep 7, 2020

goliaro Sep 17, 2020

stephanie-wang Sep 7, 2020

stephanie-wang Sep 7, 2020

stephanie-wang Sep 7, 2020

stephanie-wang Sep 7, 2020

goliaro commented Sep 26, 2020

goliaro commented Jan 19, 2021

Work stealing #10607

Work stealing #10607

Conversation

goliaro commented Sep 6, 2020

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goliaro commented Sep 26, 2020

goliaro commented Jan 19, 2021