Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xray] Resubmit tasks that fail to be forwarded #2645

Merged

Conversation

stephanie-wang
Copy link
Contributor

What do these changes do?

Tasks that fail to be forwarded are enqueued after a timeout, to be scheduled again. However, for actor tasks, this causes the tasks to hang since they may get enqueued at the wrong node and therefore never get executed. This PR resubmits the failed tasks instead.

Related issue number

This should fix the hanging in #2626.

Copy link
Contributor

@atumanov atumanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, this looks reasonable to me, given the current retry mechanism in place. I was wondering if it might be better to just re-enqueue as placeable the tasks that raylet failed to forward? That way, the policy may make a different decision for this task next time around, which might be a better decision than attempting to resubmit to the same raylet. I believe this would actually be consistent/similar to what you're trying to do in this PR (by calling SubmitTask), except we don't have to wait for node_manager_forward_task_retry_timeout_milliseconds().

@stephanie-wang
Copy link
Contributor Author

Hmm @atumanov, I think we could do that, but we would have to change the logic depending on whether it's an actor method then. Actor methods shouldn't become placeable since they have to be retried at the same node. One thing we could do is set the timer only if it's an actor method? What do you think? You have a better understanding of how the scheduling code should fit together than I do.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7436/
Test FAILed.

@atumanov
Copy link
Contributor

@stephanie-wang , yes, I would retry forwarding actor tasks (because their placement is unambiguously determined) and re-enqueue regular tasks as placeable without delay. Another reason for the latter is that it's more work preserving, since placeable tasks are (will soon be) distributed to heartbeating raylets that have available capacity (in the scheduling PR #2420).

@stephanie-wang
Copy link
Contributor Author

@atumanov, that makes sense. I'll update the PR to reflect that.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7441/
Test FAILed.

@robertnishihara
Copy link
Collaborator

If I understand correctly, the benefit of treating regular tasks differently is that we can execute them more quickly in the event of a node failure (essentially the 10 second window between when a node dies and when the monitor broadcasts its death). The disadvantage is that we have two code paths to maintain and test, is that right?

I prefer fewer code paths, so would lean toward not special casing regular tasks, but am ok with the current approach.

@robertnishihara
Copy link
Collaborator

@stephanie-wang this seems to be hanging in component_failure_test.py in xray. Is this a known issue or possibly a bug in this PR?

@stephanie-wang
Copy link
Contributor Author

@robertnishihara, I think that's a known bug. It shows up in master (this commit) too :( Will have to look into it more separately.

@robertnishihara
Copy link
Collaborator

jenkins, retest this please

Copy link
Contributor

@atumanov atumanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a TODO inside ForwardTask to handle failure to forward. We might want to remove it:

// TODO(atumanov): caller must handle ForwardTask failure.

// Timer killing will receive the boost::asio::error::operation_aborted,
// we only handle the timeout event.
RAY_CHECK(!error);
RAY_LOG(DEBUG) << "Retrying ForwardTask for task " << task_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should update the log message here, as we're no longer retrying the forward. How about "resubmitting task due to a failed forward"?

@robertnishihara
Copy link
Collaborator

robertnishihara commented Aug 15, 2018

I can reproduce the hanging quite easily on this branch with

RAY_USE_XRAY=1 python -m pytest test/component_failures_test.py::ComponentFailureTest::testRayletFailed

I'll take a look at what is going on.

} else {
// The task is not for an actor and may therefore be placed on another
// node immediately.
local_queues_.QueuePlaceableTasks({task});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephanie-wang can you explain the rationale for the change here from EnqueuePlaceableTask(task); to local_queues_.QueuePlaceableTasks({task});?

Reverting that change fixes the hanging issue.

However, that does cause me to see messages of the form

/Users/rkn/Workspace/ray/src/ray/common/client_connection.cc:160: [node manager]ProcessMessage with type 6 took 107 ms 

which makes me think we may still want to use a timer so that we don't loop here over and over if the scheduling policy keeps making the same decision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice catch! I should've tried running it locally first :)

The behavior I was hoping for was that the task would become "placeable" again, meaning that the node could choose some other node to schedule it to. Unfortunately, as of this PR, the node will only schedule tasks with a call to NodeManager::ScheduleTasks, so I think what's happening is that the task is getting stuck in "placeable" queue.

I don't necessarily want to change it back to EnqueuePlaceableTask since the task may not be feasible on that node. I think that #2420 will guarantee that tasks in the "placeable" queue will eventually get scheduled, so it will probably work after that PR is merged. Maybe @atumanov can weigh in on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EnqueuePlaceableTask basically makes the raylet take ownership of the task, by putting it either into the waiting or ready queue. In contrast, QueuePlaceableTasks will re-enqueue the task as placeable, so that we can make another decision about its placement. We should remember that this code executes when ForwardTask fails, which likely happens because the remote raylet died. In such a case, I would expect that our cluster resources data structure is updated, removing the dead raylet from the resource map. When that happens, the policy will no longer "make the same decision".

In the test you are referencing, @robertnishihara, it might be hanging because the cluster resource map is not updated to remove the dead raylet. We should double check that this mechanism is actually firing and the cluster resource map is updated. However, I am against automatically taking ownership of the task and bypassing the scheduling policy when we fail to forward it. Letting the scheduling policy decide what to do with that task should be strictly better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and in response to @stephanie-wang , yes, it's likely that the task gets stuck as placeable, which is fixed in #2420 . Another possibility (if the policy is firing) is that the dead raylet is the only place this task can run (I don't know the details of the test), and the dead raylet's resources are not cleaned up on raylet failure. We need to be absolutely certain that the latter happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the dead raylet's resources not getting cleaned up, it will happen eventually, but it may take a while since the monitor by default waits 10s before declaring the raylet dead. So it's definitely possible that the task will be retried at the dead raylet multiple times.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atumanov, are you saying that the dead raylet should be removed from the cluster resources data structure as soon as a write to that raylet fails? It seems far simpler to wait for the monitor to broadcast that the raylet is dead (which could take 10 seconds, and in that time, the scheduling policy could make the same decision over and over).

If we do local_queues_.QueuePlaceableTasks({task});, then don't we need to follow that up with a call to ScheduleTasks();?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7498/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7510/
Test FAILed.

@robertnishihara
Copy link
Collaborator

Linting passed at https://travis-ci.org/stephanie-wang/ray.

@robertnishihara robertnishihara merged commit e3e0cfc into ray-project:master Aug 16, 2018
@robertnishihara robertnishihara deleted the fix-dead-actor-hanging branch August 16, 2018 07:13
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7520/
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants