You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expected Behaviour
When a job stalls more than the configured limit, I would expect it to be retried again in "x minutes" as per the retry strategy. I would expect that the stall counter is reset and it's queued up to be tried again later (perhaps my processor has gone down and will be back up in 3 minutes time?)
Actual Behaviour
When a job stalls more than the configured limit, it fails straight away without respecting the retry strategy.
I've had a look round and can't see anywhere where it explicitly says what the intended behavior is - So if it's a case of manually implementing this functionality if desired it should probably be documented somewhere.
The text was updated successfully, but these errors were encountered:
The rationale is that a job stalling is a very rare situation and if happened several times for the same job then we better fail it and leave it to be inspected by an operator. A retry strategy may be used for retrying a job that is transiently failing, maybe it is calling a third party API that is down in this moment, so we want to retry the job a couple of times before giving up. Not sure having the retry mechanism also for a job that failed due to a job stalling is the behaviour we want...
Thanks Manuel,
I appreciate that this is likely an uncommon situation to be in, so I'm happy to put my own retry functionality in for my use-case.
This is probably just a case of documenting this behavior so that it's clear not to expect the usual retries on stalled jobs.
Just for extra info, the reason I'm getting fairly common stalled jobs is likely due to my setup - I have multiple "worker" processes in a kubernetes cluster which are scaled based on queue size...
Sometimes the CPU on the cluster can spike, which causes the CPU to become "compressed" on the worker processes and in turn ends up in a stall until the high-CPU task has ended. There is likely more optimization I can do on the server side to reduce this too.
Expected Behaviour
When a job stalls more than the configured limit, I would expect it to be retried again in "x minutes" as per the retry strategy. I would expect that the stall counter is reset and it's queued up to be tried again later (perhaps my processor has gone down and will be back up in 3 minutes time?)
Actual Behaviour
When a job stalls more than the configured limit, it fails straight away without respecting the retry strategy.
I've had a look round and can't see anywhere where it explicitly says what the intended behavior is - So if it's a case of manually implementing this functionality if desired it should probably be documented somewhere.
The text was updated successfully, but these errors were encountered: