-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure final task info set on worker node failure #18175
Conversation
@arhimondr @jamesgpearce PTAL and help me understand:
|
@@ -1030,8 +1030,8 @@ private synchronized void fatalUnacknowledgedFailure(Throwable cause) | |||
updateTaskInfo(getTaskInfo().withTaskStatus(taskStatus)); | |||
} | |||
else { | |||
// Let the status callbacks trigger the cleanup command remotely after switching states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where was the trigger the cleanup command remotely
supposed to take place?
I do see that it's possible for a specific sequence of operations around
Your change addresses this by not updating the To me, the more correct fix would seem to be a change that handles failures inside of |
Yeah - it makes total sense. I missed the Still I do not understand why without the fix we are observing those tasks for which there is discrepancy between |
@pettyjamesm PTAL if this is more-or-less in line what you were thinking :) |
@@ -1001,14 +1003,18 @@ public void onFailure(Throwable t) | |||
}, executor); | |||
} | |||
|
|||
private synchronized void fatalUnacknowledgedFailure(Throwable cause) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think making this a top level method is more likely to cause confusion, because this should only be valid to call from within the failure hander in doScheduleAsyncCleanupRequest
. Maybe it should be a method defined there and more specifically named something like failedToCleanUpRemoteTask(TrinoTransportException)
?
That way, we know that the update is always guarded by the check for getTaskInfo().getTaskStatus().getState().isDone()
which will avoids the potential for replacing some other final state that TaskInfo
might have had set on it already and also avoids accumulating multiple redundant TrinoTransportException
failures reported from other calls (which will all likely fail very similarly).
By the way, @losipiuk - it would be great if we knew what the initial failure was in this scenario because there might be another bug where that exception should be a |
I added logging for debugging so I know that exact exception was
we could fix the code to translate exceptions of such shape We still need the original fix though in general mechanism, though. |
ec44e72
to
87a2e80
Compare
if (taskStatus.getState().isDone()) { | ||
log.warn("Task %d already in terminal state %s; cannot overwrite with FAILED due to %s", taskStatus.getTaskId(), taskStatus.getState(), cause); | ||
} | ||
TaskStatus failedTaskStatus = failWith(taskStatus, FAILED, failures); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pettyjamesm PTAL.
I wonder if we should still use failedTaskStatus
in updateTaskInfo
if taskStatus.getState().getDone()
is true.
Or we should just propagate whatever taskStatus
is to taskInfo
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the code to use original taskStatus
if we cannot override - seems more correct.
core/trino-main/src/main/java/io/trino/server/remotetask/HttpRemoteTask.java
Show resolved
Hide resolved
Actually the exception which caused the original issue should probably not be treated as |
@@ -960,7 +960,7 @@ public void onSuccess(JsonResponse<TaskInfo> result) | |||
// if cleanup operation has not at least started task termination, mark the task failed | |||
TaskState taskState = getTaskInfo().getTaskStatus().getState(); | |||
if (!taskState.isTerminatingOrDone()) { | |||
fatalUnacknowledgedFailure(new TrinoTransportException(REMOTE_TASK_ERROR, fromUri(request.getUri()), format("Unable to %s task at %s, last known state was: %s", action, request.getUri(), taskState))); | |||
fatalAsyncCleanupFailure(new TrinoTransportException(REMOTE_TASK_ERROR, fromUri(request.getUri()), format("Unable to %s task at %s, last known state was: %s", action, request.getUri(), taskState))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should fatalUnacknowledgedFailure
lose instanceof TrinoTransportException
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not matter much I think. With TrinoTransportException
we have extra remoteHost
information which can still be propagate via TaskInfo
to user via else
code path in fatalAsyncCleanupFailure
. May be of some use for debugging - but for sure not crucial.
I have no strong opinion. @pettyjamesm ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it's still possible for a fatal unacknowledged failure to be triggered with TrinoTransportException
before cleanup is attempted, and when that occurs we can transition the TaskInfo
to failed immediately which should be preferable (cleanup will still be attempted, but we're not anticipating that succeeding and we don't need to wait for it to fail before reflecting that in TaskInfo
).
Agreed, a 5xx response isn't necessarily a fatal exception and we don't need to reclassify that particular failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, except for a few minor suggested stylistic / clarity changes.
core/trino-main/src/main/java/io/trino/server/remotetask/HttpRemoteTask.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/server/remotetask/HttpRemoteTask.java
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/server/remotetask/HttpRemoteTask.java
Show resolved
Hide resolved
@@ -960,7 +960,7 @@ public void onSuccess(JsonResponse<TaskInfo> result) | |||
// if cleanup operation has not at least started task termination, mark the task failed | |||
TaskState taskState = getTaskInfo().getTaskStatus().getState(); | |||
if (!taskState.isTerminatingOrDone()) { | |||
fatalUnacknowledgedFailure(new TrinoTransportException(REMOTE_TASK_ERROR, fromUri(request.getUri()), format("Unable to %s task at %s, last known state was: %s", action, request.getUri(), taskState))); | |||
fatalAsyncCleanupFailure(new TrinoTransportException(REMOTE_TASK_ERROR, fromUri(request.getUri()), format("Unable to %s task at %s, last known state was: %s", action, request.getUri(), taskState))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it's still possible for a fatal unacknowledged failure to be triggered with TrinoTransportException
before cleanup is attempted, and when that occurs we can transition the TaskInfo
to failed immediately which should be preferable (cleanup will still be attempted, but we're not anticipating that succeeding and we don't need to wait for it to fail before reflecting that in TaskInfo
).
With current logic in HttpRemoteTask.fatalUnacknowledgedFailure() it was possible that we only set `FAILED` status in TaskStatus in ContinuesTaskStatusFetcher but TaskInfo in TaskInfoFetcher will not be updated. It happened when we entered the `else` branch for `(cause instanceof TrinoTransportException)` condition. The `TaskInfoFetcher.taskInfo` was not updated then. Subsequent calls to `fatalUnacknowledgedFailure` where then no-op due to `!taskStatus.getState().isDone()` condition at the top. While at this time `TaskInfoFetcher` may no longer be able to update task Info (worker node can be gone already) and it will remain stale forever. It resulted in a split brain scenario that parts of the system observe Task as failed while others (those depending on TaskInfo) reported that task is some other state (I observed either ABORTING or FAILING, but other people reported RUNNING too). It was mostly visible in UI which reported task information based on TaskInfo. The bigger problem which it caused was that it cased queries to hang forever. The reason for that is that EventDrivenFaultTolerantQueryScheduler uses finalTaskInfo delivery as a trigger for remote task completion logic.
@arhimondr is there any particular reason to use final task info listener here: Lines 1209 to 1210 in 5bfe919
Updating task.addStateChangeListener(taskStatus -> eventQueue.add(new RemoteTaskCompletedEvent(taskStatus))); |
@pettyjamesm updated |
Answering myself for sake of future readers. For case when task completes successfully we need to wait for |
CI: #18226 |
With current logic in HttpRemoteTask.fatalUnacknowledgedFailure() it was possible that we only set
FAILED
status in TaskStatus in ContinuesTaskStatusFetcher but TaskInfo in TaskInfoFetcher will not be updated.It happened when we entered the
else
branch for(cause instanceof TrinoTransportException)
condition. TheTaskInfoFetcher.taskInfo
was not updated then. Subsequent calls tofatalUnacknowledgedFailure
where then no-op due to!taskStatus.getState().isDone()
condition at the top. While at this timeTaskInfoFetcher
may no longer be able to update task Info (worker node can be gone already) and it will remain stale forever.It resulted in a split brain scenario that parts of the system observe Task as failed while others (those depending on TaskInfo) reported that task is some other state (I observed either ABORTING or FAILING, but other people reported RUNNING too).
It was mostly visible in UI which reported task information based on TaskInfo.
The bigger problem which it caused was that it cased queries to hang
forever. The reason for that is that
EventDrivenFaultTolerantQueryScheduler uses finalTaskInfo delivery as a
trigger for remote task completion logic.
Investigated it together with @findepi
Description
Additional context and related issues
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: