Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler pending #3518

Open
jorahbi opened this issue Jun 13, 2024 · 5 comments
Open

scheduler pending #3518

jorahbi opened this issue Jun 13, 2024 · 5 comments

Comments

@jorahbi
Copy link

jorahbi commented Jun 13, 2024

Please provide an in-depth description of the question you have:

What do you think about this question?:
What is the reason for this issue? Is it because Volcano is unable to synchronize job yaml modifications in a timely manner? What kind of operation is needed to restore normalcy to this issue? The current approach is to resume normal scheduling after restarting the scheduler, and sometimes even after deleting the restart job, it can still be scheduled normally. Retry to delete job xxxxxx continues to appear in the log.
object-change
Environment:

  • Volcano Version: 1.8.2
  • Kubernetes version (use kubectl version):1.27
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):ubuntu 22.04
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@googs1025
Copy link
Member

Will this error affect the use of volcano?
According to my understanding, if there is any mistake, please forgive me.
This should be a transient error, since the error returned on operation conflict will trigger another reconcile, where the controller fetches the latest version of the resource from the apiserver and tries the update again. This will repeat until the update goes through.

@jorahbi
Copy link
Author

jorahbi commented Jun 17, 2024

The pending sometimes lasts for a long time. I think this is caused by the inconsistency between the cache resources and the K8S resources, because the scheduling is successful after restarting the volcano or deleting the job and republishing it.

@Monokaix
Copy link
Member

Monokaix commented Jun 20, 2024

first, update job conflict is a normal case and scheduler will retry to schedule it.
second, retrying job is normal when a pod of job is deleted, the job in cache will finally be deleted after the pod is remvoed from etcd, aka pod graceful delete terminated.
Also, you can try to use v1.9.0, this pr #3144 has fixed some problem about slow job retrying delete. @jorahbi

@jorahbi
Copy link
Author

jorahbi commented Aug 28, 2024

This is a vcjob, under what circumstances would this task need to be deleted? Resulting in continuous unsuccessful scheduling. Until the volcano schudeler pod is restarted.

version
vc-scheduler log
vc-scheduler
vc-controller log
vc-controllers

@jorahbi
Copy link
Author

jorahbi commented Aug 28, 2024

@Monokaix @lowang-bh Could you please clarify doubts, big brothers? Thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants