[xray] Resubmit tasks that fail to be forwarded #2645

stephanie-wang · 2018-08-13T21:26:31Z

What do these changes do?

Tasks that fail to be forwarded are enqueued after a timeout, to be scheduled again. However, for actor tasks, this causes the tasks to hang since they may get enqueued at the wrong node and therefore never get executed. This PR resubmits the failed tasks instead.

Related issue number

This should fix the hanging in #2626.

atumanov

In principle, this looks reasonable to me, given the current retry mechanism in place. I was wondering if it might be better to just re-enqueue as placeable the tasks that raylet failed to forward? That way, the policy may make a different decision for this task next time around, which might be a better decision than attempting to resubmit to the same raylet. I believe this would actually be consistent/similar to what you're trying to do in this PR (by calling SubmitTask), except we don't have to wait for node_manager_forward_task_retry_timeout_milliseconds().

stephanie-wang · 2018-08-13T22:12:11Z

Hmm @atumanov, I think we could do that, but we would have to change the logic depending on whether it's an actor method then. Actor methods shouldn't become placeable since they have to be retried at the same node. One thing we could do is set the timer only if it's an actor method? What do you think? You have a better understanding of how the scheduling code should fit together than I do.

AmplabJenkins · 2018-08-13T22:13:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7436/
Test FAILed.

atumanov · 2018-08-13T22:17:35Z

@stephanie-wang , yes, I would retry forwarding actor tasks (because their placement is unambiguously determined) and re-enqueue regular tasks as placeable without delay. Another reason for the latter is that it's more work preserving, since placeable tasks are (will soon be) distributed to heartbeating raylets that have available capacity (in the scheduling PR #2420).

stephanie-wang · 2018-08-14T04:46:17Z

@atumanov, that makes sense. I'll update the PR to reflect that.

AmplabJenkins · 2018-08-14T05:42:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7441/
Test FAILed.

robertnishihara · 2018-08-15T05:16:32Z

If I understand correctly, the benefit of treating regular tasks differently is that we can execute them more quickly in the event of a node failure (essentially the 10 second window between when a node dies and when the monitor broadcasts its death). The disadvantage is that we have two code paths to maintain and test, is that right?

I prefer fewer code paths, so would lean toward not special casing regular tasks, but am ok with the current approach.

robertnishihara · 2018-08-15T05:17:48Z

@stephanie-wang this seems to be hanging in component_failure_test.py in xray. Is this a known issue or possibly a bug in this PR?

stephanie-wang · 2018-08-15T16:56:55Z

@robertnishihara, I think that's a known bug. It shows up in master (this commit) too :( Will have to look into it more separately.

robertnishihara · 2018-08-15T20:25:27Z

jenkins, retest this please

atumanov

There's a TODO inside ForwardTask to handle failure to forward. We might want to remove it:

// TODO(atumanov): caller must handle ForwardTask failure.

atumanov · 2018-08-15T20:47:55Z

src/ray/raylet/node_manager.cc

+            // Timer killing will receive the boost::asio::error::operation_aborted,
+            // we only handle the timeout event.
+            RAY_CHECK(!error);
+            RAY_LOG(DEBUG) << "Retrying ForwardTask for task " << task_id;


I think we should update the log message here, as we're no longer retrying the forward. How about "resubmitting task due to a failed forward"?

robertnishihara · 2018-08-15T21:25:16Z

I can reproduce the hanging quite easily on this branch with

RAY_USE_XRAY=1 python -m pytest test/component_failures_test.py::ComponentFailureTest::testRayletFailed

I'll take a look at what is going on.

robertnishihara · 2018-08-15T21:48:48Z

src/ray/raylet/node_manager.cc

+    } else {
+      // The task is not for an actor and may therefore be placed on another
+      // node immediately.
+      local_queues_.QueuePlaceableTasks({task});


@stephanie-wang can you explain the rationale for the change here from EnqueuePlaceableTask(task); to local_queues_.QueuePlaceableTasks({task});?

Reverting that change fixes the hanging issue.

However, that does cause me to see messages of the form

/Users/rkn/Workspace/ray/src/ray/common/client_connection.cc:160: [node manager]ProcessMessage with type 6 took 107 ms

which makes me think we may still want to use a timer so that we don't loop here over and over if the scheduling policy keeps making the same decision.

Ah, nice catch! I should've tried running it locally first :)

The behavior I was hoping for was that the task would become "placeable" again, meaning that the node could choose some other node to schedule it to. Unfortunately, as of this PR, the node will only schedule tasks with a call to NodeManager::ScheduleTasks, so I think what's happening is that the task is getting stuck in "placeable" queue.

I don't necessarily want to change it back to EnqueuePlaceableTask since the task may not be feasible on that node. I think that #2420 will guarantee that tasks in the "placeable" queue will eventually get scheduled, so it will probably work after that PR is merged. Maybe @atumanov can weigh in on this?

EnqueuePlaceableTask basically makes the raylet take ownership of the task, by putting it either into the waiting or ready queue. In contrast, QueuePlaceableTasks will re-enqueue the task as placeable, so that we can make another decision about its placement. We should remember that this code executes when ForwardTask fails, which likely happens because the remote raylet died. In such a case, I would expect that our cluster resources data structure is updated, removing the dead raylet from the resource map. When that happens, the policy will no longer "make the same decision".

In the test you are referencing, @robertnishihara, it might be hanging because the cluster resource map is not updated to remove the dead raylet. We should double check that this mechanism is actually firing and the cluster resource map is updated. However, I am against automatically taking ownership of the task and bypassing the scheduling policy when we fail to forward it. Letting the scheduling policy decide what to do with that task should be strictly better.

and in response to @stephanie-wang , yes, it's likely that the task gets stuck as placeable, which is fixed in #2420 . Another possibility (if the policy is firing) is that the dead raylet is the only place this task can run (I don't know the details of the test), and the dead raylet's resources are not cleaned up on raylet failure. We need to be absolutely certain that the latter happens.

Regarding the dead raylet's resources not getting cleaned up, it will happen eventually, but it may take a while since the monitor by default waits 10s before declaring the raylet dead. So it's definitely possible that the task will be retried at the dead raylet multiple times.

@atumanov, are you saying that the dead raylet should be removed from the cluster resources data structure as soon as a write to that raylet fails? It seems far simpler to wait for the monitor to broadcast that the raylet is dead (which could take 10 seconds, and in that time, the scheduling policy could make the same decision over and over).

If we do local_queues_.QueuePlaceableTasks({task});, then don't we need to follow that up with a call to ScheduleTasks();?

AmplabJenkins · 2018-08-15T21:53:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7498/
Test PASSed.

AmplabJenkins · 2018-08-16T00:39:33Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7510/
Test FAILed.

robertnishihara · 2018-08-16T07:12:26Z

Linting passed at https://travis-ci.org/stephanie-wang/ray.

AmplabJenkins · 2018-08-16T07:22:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7520/
Test FAILed.

Resubmit tasks that fail to be forwarded

9983ce0

stephanie-wang requested a review from robertnishihara August 13, 2018 21:26

atumanov reviewed Aug 13, 2018

View reviewed changes

Only set the ForwardTask timer for actor tasks

7103000

robertnishihara approved these changes Aug 15, 2018

View reviewed changes

atumanov reviewed Aug 15, 2018

View reviewed changes

robertnishihara reviewed Aug 15, 2018

View reviewed changes

atumanov mentioned this pull request Aug 15, 2018

[xray] raylet scheduling mechanism #2420

Closed

Call scheduling policy on failed task, remove TODO

8df1378

Change logging message.

0fdb06d

robertnishihara merged commit e3e0cfc into ray-project:master Aug 16, 2018

robertnishihara deleted the fix-dead-actor-hanging branch August 16, 2018 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[xray] Resubmit tasks that fail to be forwarded #2645

[xray] Resubmit tasks that fail to be forwarded #2645

stephanie-wang commented Aug 13, 2018

atumanov left a comment

stephanie-wang commented Aug 13, 2018

AmplabJenkins commented Aug 13, 2018

atumanov commented Aug 13, 2018

stephanie-wang commented Aug 14, 2018

AmplabJenkins commented Aug 14, 2018

robertnishihara commented Aug 15, 2018

robertnishihara commented Aug 15, 2018

stephanie-wang commented Aug 15, 2018

robertnishihara commented Aug 15, 2018

atumanov left a comment

atumanov Aug 15, 2018

robertnishihara commented Aug 15, 2018 •

edited

robertnishihara Aug 15, 2018 •

edited

stephanie-wang Aug 15, 2018

atumanov Aug 15, 2018

atumanov Aug 15, 2018

stephanie-wang Aug 15, 2018

robertnishihara Aug 15, 2018

AmplabJenkins commented Aug 15, 2018

AmplabJenkins commented Aug 16, 2018

robertnishihara commented Aug 16, 2018

AmplabJenkins commented Aug 16, 2018

[xray] Resubmit tasks that fail to be forwarded #2645

[xray] Resubmit tasks that fail to be forwarded #2645

Conversation

stephanie-wang commented Aug 13, 2018

What do these changes do?

Related issue number

atumanov left a comment

Choose a reason for hiding this comment

stephanie-wang commented Aug 13, 2018

AmplabJenkins commented Aug 13, 2018

atumanov commented Aug 13, 2018

stephanie-wang commented Aug 14, 2018

AmplabJenkins commented Aug 14, 2018

robertnishihara commented Aug 15, 2018

robertnishihara commented Aug 15, 2018

stephanie-wang commented Aug 15, 2018

robertnishihara commented Aug 15, 2018

atumanov left a comment

Choose a reason for hiding this comment

atumanov Aug 15, 2018

Choose a reason for hiding this comment

robertnishihara commented Aug 15, 2018 • edited

robertnishihara Aug 15, 2018 • edited

Choose a reason for hiding this comment

stephanie-wang Aug 15, 2018

Choose a reason for hiding this comment

atumanov Aug 15, 2018

Choose a reason for hiding this comment

atumanov Aug 15, 2018

Choose a reason for hiding this comment

stephanie-wang Aug 15, 2018

Choose a reason for hiding this comment

robertnishihara Aug 15, 2018

Choose a reason for hiding this comment

AmplabJenkins commented Aug 15, 2018

AmplabJenkins commented Aug 16, 2018

robertnishihara commented Aug 16, 2018

AmplabJenkins commented Aug 16, 2018

robertnishihara commented Aug 15, 2018 •

edited

robertnishihara Aug 15, 2018 •

edited