Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FTE stage gets stuck in Pending #18603

Closed
Konstantinos-Chaitas-PubNative opened this issue Aug 9, 2023 · 6 comments · Fixed by #20021
Closed

FTE stage gets stuck in Pending #18603

Konstantinos-Chaitas-PubNative opened this issue Aug 9, 2023 · 6 comments · Fixed by #20021

Comments

@Konstantinos-Chaitas-PubNative
Copy link

Konstantinos-Chaitas-PubNative commented Aug 9, 2023

I have successfully enabled FTE(mode TASK) on Trino v422, with GCS as exchange. The cluster is running on Kubernetes with HPA. I can verify that FTE is working as expected by examining the retryPolicy in the queries which is correctly to TASK, and I can also see files written in the configured bucket.
However, during my testing, I encountered an issue when the cluster loses workers that were actively used by a running query. This situation causes the query to become idle and not make any progress, ultimately leading to a timeout failure. Specifically, a particular stage of the query gets stuck in Pending mode, and no further updates occur. I also noticed Coordinator was printing the following logs very frequently:
"io.trino.server.remotetask.RequestErrorTracker - Error getting info for task."
To reproduce this behaviour, you can try setting the maximum number of workers to 2 and manually killing one of them while the query is running. Please let me know if any additional input is needed

Relevant slack thread: https://trinodb.slack.com/archives/C02UY6G5TGC/p1690205830887239

@linzebing
Copy link
Member

cc @losipiuk

@Konstantinos-Chaitas-PubNative
Copy link
Author

@linzebing @losipiuk Is there any update here? Thanks

@losipiuk
Copy link
Member

To reproduce this behaviour, you can try setting the maximum number of workers to 2 and manually killing one of them while the query is running. Please let me know if any additional input is needed

Thanks @Konstantinos-Chaitas-PubNative. Can you be a bit more specific here.

  • What do you mean by setting the maximum number of workers to 2? Do you mean k8s deployment setting or some Trino config?
  • What method do you use for killing worker
  • Is existence of HPA related to the problem in any way - is HPA doing any work while you observe problem?

@Konstantinos-Chaitas-PubNative
Copy link
Author

Hey @losipiuk, let me know if you need more information

  • What do you mean by setting the maximum number of workers to 2? Do you mean k8s deployment setting or some Trino config?

I mean to set the max number of workers in the K8s HPA configuration to 2

  • What method do you use for killing worker

I just kill the pod manually via the K8s, e.g. kubectl delete pod ....

  • Is existence of HPA related to the problem in any way - is HPA doing any work while you observe problem?

I do not think so, HPA just controls the min/max number of workers, and scales them up/down when necessary

@losipiuk
Copy link
Member

What type of query was running on cluster when you killed the pod. What connectors were used, and how configured (especially if you used hive/iceberg/delta which object store was in use)

@Konstantinos-Chaitas-PubNative
Copy link
Author

  • What type of query was running on cluster when you killed the pod

I couldn't find any correlation to that, I was running simple SELECT * FROM ... queries

  • What connectors were used, and how configured (especially if you used hive/iceberg/delta which object store was in use)

I am using the Hive catalog with GCS storage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants