Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs failing due to stalling more than the allowable limit don't respect the retry strategy. #651

Closed
Derbdale opened this issue Jul 20, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@Derbdale
Copy link
Contributor

Expected Behaviour
When a job stalls more than the configured limit, I would expect it to be retried again in "x minutes" as per the retry strategy. I would expect that the stall counter is reset and it's queued up to be tried again later (perhaps my processor has gone down and will be back up in 3 minutes time?)

Actual Behaviour
When a job stalls more than the configured limit, it fails straight away without respecting the retry strategy.

I've had a look round and can't see anywhere where it explicitly says what the intended behavior is - So if it's a case of manually implementing this functionality if desired it should probably be documented somewhere.

@manast manast added the enhancement New feature or request label Jul 21, 2021
@manast
Copy link
Contributor

manast commented Jul 21, 2021

The rationale is that a job stalling is a very rare situation and if happened several times for the same job then we better fail it and leave it to be inspected by an operator. A retry strategy may be used for retrying a job that is transiently failing, maybe it is calling a third party API that is down in this moment, so we want to retry the job a couple of times before giving up. Not sure having the retry mechanism also for a job that failed due to a job stalling is the behaviour we want...

@Derbdale
Copy link
Contributor Author

Thanks Manuel,
I appreciate that this is likely an uncommon situation to be in, so I'm happy to put my own retry functionality in for my use-case.

This is probably just a case of documenting this behavior so that it's clear not to expect the usual retries on stalled jobs.

Just for extra info, the reason I'm getting fairly common stalled jobs is likely due to my setup - I have multiple "worker" processes in a kubernetes cluster which are scaled based on queue size...
Sometimes the CPU on the cluster can spike, which causes the CPU to become "compressed" on the worker processes and in turn ends up in a stall until the high-CPU task has ended. There is likely more optimization I can do on the server side to reduce this too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants