Jobs failing due to stalling more than the allowable limit don't respect the retry strategy. #651

Derbdale · 2021-07-20T12:18:55Z

Expected Behaviour
When a job stalls more than the configured limit, I would expect it to be retried again in "x minutes" as per the retry strategy. I would expect that the stall counter is reset and it's queued up to be tried again later (perhaps my processor has gone down and will be back up in 3 minutes time?)

Actual Behaviour
When a job stalls more than the configured limit, it fails straight away without respecting the retry strategy.

I've had a look round and can't see anywhere where it explicitly says what the intended behavior is - So if it's a case of manually implementing this functionality if desired it should probably be documented somewhere.

manast · 2021-07-21T08:09:40Z

The rationale is that a job stalling is a very rare situation and if happened several times for the same job then we better fail it and leave it to be inspected by an operator. A retry strategy may be used for retrying a job that is transiently failing, maybe it is calling a third party API that is down in this moment, so we want to retry the job a couple of times before giving up. Not sure having the retry mechanism also for a job that failed due to a job stalling is the behaviour we want...

Derbdale · 2021-07-21T08:18:40Z

Thanks Manuel,
I appreciate that this is likely an uncommon situation to be in, so I'm happy to put my own retry functionality in for my use-case.

This is probably just a case of documenting this behavior so that it's clear not to expect the usual retries on stalled jobs.

Just for extra info, the reason I'm getting fairly common stalled jobs is likely due to my setup - I have multiple "worker" processes in a kubernetes cluster which are scaled based on queue size...
Sometimes the CPU on the cluster can spike, which causes the CPU to become "compressed" on the worker processes and in turn ends up in a stall until the high-CPU task has ended. There is likely more optimization I can do on the server side to reduce this too.

manast added the enhancement New feature or request label Jul 21, 2021

Derbdale closed this as completed Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs failing due to stalling more than the allowable limit don't respect the retry strategy. #651

Jobs failing due to stalling more than the allowable limit don't respect the retry strategy. #651

Derbdale commented Jul 20, 2021

manast commented Jul 21, 2021 •

edited

Derbdale commented Jul 21, 2021

Jobs failing due to stalling more than the allowable limit don't respect the retry strategy. #651

Jobs failing due to stalling more than the allowable limit don't respect the retry strategy. #651

Comments

Derbdale commented Jul 20, 2021

manast commented Jul 21, 2021 • edited

Derbdale commented Jul 21, 2021

manast commented Jul 21, 2021 •

edited