FTE stage gets stuck in Pending #18603

Konstantinos-Chaitas-PubNative · 2023-08-09T10:12:12Z

I have successfully enabled FTE(mode TASK) on Trino v422, with GCS as exchange. The cluster is running on Kubernetes with HPA. I can verify that FTE is working as expected by examining the retryPolicy in the queries which is correctly to TASK, and I can also see files written in the configured bucket.
However, during my testing, I encountered an issue when the cluster loses workers that were actively used by a running query. This situation causes the query to become idle and not make any progress, ultimately leading to a timeout failure. Specifically, a particular stage of the query gets stuck in Pending mode, and no further updates occur. I also noticed Coordinator was printing the following logs very frequently:
"io.trino.server.remotetask.RequestErrorTracker - Error getting info for task."
To reproduce this behaviour, you can try setting the maximum number of workers to 2 and manually killing one of them while the query is running. Please let me know if any additional input is needed

Relevant slack thread: https://trinodb.slack.com/archives/C02UY6G5TGC/p1690205830887239

The text was updated successfully, but these errors were encountered:

linzebing · 2023-08-09T12:20:51Z

cc @losipiuk

Konstantinos-Chaitas-PubNative · 2023-11-27T09:59:34Z

@linzebing @losipiuk Is there any update here? Thanks

losipiuk · 2023-11-27T12:02:25Z

To reproduce this behaviour, you can try setting the maximum number of workers to 2 and manually killing one of them while the query is running. Please let me know if any additional input is needed

Thanks @Konstantinos-Chaitas-PubNative. Can you be a bit more specific here.

What do you mean by setting the maximum number of workers to 2? Do you mean k8s deployment setting or some Trino config?
What method do you use for killing worker
Is existence of HPA related to the problem in any way - is HPA doing any work while you observe problem?

Konstantinos-Chaitas-PubNative · 2023-11-27T14:39:14Z

Hey @losipiuk, let me know if you need more information

What do you mean by setting the maximum number of workers to 2? Do you mean k8s deployment setting or some Trino config?

I mean to set the max number of workers in the K8s HPA configuration to 2

What method do you use for killing worker

I just kill the pod manually via the K8s, e.g. kubectl delete pod ....

Is existence of HPA related to the problem in any way - is HPA doing any work while you observe problem?

I do not think so, HPA just controls the min/max number of workers, and scales them up/down when necessary

losipiuk · 2023-11-27T14:52:16Z

What type of query was running on cluster when you killed the pod. What connectors were used, and how configured (especially if you used hive/iceberg/delta which object store was in use)

Konstantinos-Chaitas-PubNative · 2023-11-27T15:08:47Z

What type of query was running on cluster when you killed the pod

I couldn't find any correlation to that, I was running simple SELECT * FROM ... queries

What connectors were used, and how configured (especially if you used hive/iceberg/delta which object store was in use)

I am using the Hive catalog with GCS storage

losipiuk mentioned this issue Dec 5, 2023

Ensure taskInfo finalized on permanent worker failure #20021

Merged

losipiuk closed this as completed in #20021 Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FTE stage gets stuck in Pending #18603

FTE stage gets stuck in Pending #18603

Konstantinos-Chaitas-PubNative commented Aug 9, 2023 •

edited by losipiuk

linzebing commented Aug 9, 2023

Konstantinos-Chaitas-PubNative commented Nov 27, 2023

losipiuk commented Nov 27, 2023

Konstantinos-Chaitas-PubNative commented Nov 27, 2023

losipiuk commented Nov 27, 2023

Konstantinos-Chaitas-PubNative commented Nov 27, 2023

FTE stage gets stuck in Pending #18603

FTE stage gets stuck in Pending #18603

Comments

Konstantinos-Chaitas-PubNative commented Aug 9, 2023 • edited by losipiuk

linzebing commented Aug 9, 2023

Konstantinos-Chaitas-PubNative commented Nov 27, 2023

losipiuk commented Nov 27, 2023

Konstantinos-Chaitas-PubNative commented Nov 27, 2023

losipiuk commented Nov 27, 2023

Konstantinos-Chaitas-PubNative commented Nov 27, 2023

Konstantinos-Chaitas-PubNative commented Aug 9, 2023 •

edited by losipiuk