New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FTE stage gets stuck in Pending #18603
Comments
cc @losipiuk |
@linzebing @losipiuk Is there any update here? Thanks |
Thanks @Konstantinos-Chaitas-PubNative. Can you be a bit more specific here.
|
Hey @losipiuk, let me know if you need more information
I mean to set the max number of workers in the K8s HPA configuration to 2
I just kill the pod manually via the K8s, e.g.
I do not think so, HPA just controls the min/max number of workers, and scales them up/down when necessary |
What type of query was running on cluster when you killed the pod. What connectors were used, and how configured (especially if you used hive/iceberg/delta which object store was in use) |
I couldn't find any correlation to that, I was running simple
I am using the Hive catalog with GCS storage |
I have successfully enabled FTE(mode TASK) on Trino v422, with GCS as exchange. The cluster is running on Kubernetes with HPA. I can verify that FTE is working as expected by examining the
retryPolicy
in the queries which is correctly to TASK, and I can also see files written in the configured bucket.However, during my testing, I encountered an issue when the cluster loses workers that were actively used by a running query. This situation causes the query to become idle and not make any progress, ultimately leading to a timeout failure. Specifically, a particular stage of the query gets stuck in
Pending
mode, and no further updates occur. I also noticed Coordinator was printing the following logs very frequently:"io.trino.server.remotetask.RequestErrorTracker - Error getting info for task."
To reproduce this behaviour, you can try setting the maximum number of workers to 2 and manually killing one of them while the query is running. Please let me know if any additional input is needed
Relevant slack thread: https://trinodb.slack.com/archives/C02UY6G5TGC/p1690205830887239
The text was updated successfully, but these errors were encountered: