-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job retry RFC #50078
Job retry RFC #50078
Conversation
@cachedout Maybe description would help |
[summary]: #summary | ||
|
||
This feature enables Salt to detect when a minion did not run a job and when the minion comes back online, | ||
the job is published to that minion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarification needed: how this cope with the cached job? Say, minion was terminated during the job. Should job be republished or retry from the cache? How cache is reliable here? (I saw cache losing the jobs though)
rfcs/0003-job-retry.md
Outdated
# Motivation | ||
[motivation]: #motivation | ||
|
||
For a long time, people have complained that Salt has no way to send a job to a minion that was not availalble |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: available
|
||
Jobs to be retried are inserted into the cache by the LocalClient when it deterermines that a minion which it expected | ||
to return did not. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should cope also with minion restarts. For example, if one restarts salt-minion because of configuration changes, but concurrently also schedules a package list refresh. In this case the package list is not refreshed because that job is lost during the restart.
rfcs/0003-job-retry.md
Outdated
All operations will be gated behind a configuration option called `job_retry` which will default to being | ||
disabled. | ||
|
||
We propose a `job_retry` engine be created which can run on the master. It is the job of this engine to detect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was too much clashing "job" terminology that means different things. 😉 Maybe "the purpose of this engine is to detect minion "start" events....." instead.
## Alternatives | ||
[alternatives]: #alternatives | ||
|
||
No alternatives considered but ideas welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My alternative try here:
- Master throws job at minion, but first it travels from Master's job cache to Minion's job cache.
- Minion says "I've got it!"
- Master deletes the job from the cache.
- Minion is looking into the "job cache" and polls it.
- If job is found, Minion performs it and commits its deletion only when guaranteed finished.
- In case Minion has been restarted/crashed/killed/ripped by power down etc, job still there.
- Master no longer needed here, so Minion can just wake up back, and continue from No.5 step.
Why I think this could be better than original proposal:
- It does not need to be explicitly dependent on the Master.
- It can work with LocalClient
- Great benefit for Salt SSH! We should definitely reuse this mechanism for Salt SSH. I am thinking something like in
Emacs <-> emacs-client
fashion for concurrent Salt SSH calls. In this case, if already running SaltSSH process performing something, another SaltSSH spin won't create the whole isolated Minion within local Master like--local
but see if that already running around. Then it should just open Minion Job Cache and put another job inside and quit. The previous, already running process, will continue polling Job Cache until empty, and thus will find a new job and will perform it! This will resolve chronic problem of simultaneously fired SaltSSH comments under the same user name.
[master]--->[job cache]---+
|
| <<<< ...and then John Wick accidentally
+-----------------------+ shoots the network cable...
|
V
[job cache]--->[minion]
So then "job cache" on the minion would have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we used the Queue subsystem instead? and queued up jobs on a per minion basis so that the jobs can be replayed in order when a minion comes online?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the use of the engine system, i would like to see us dogfood the queue system instead of using the cache subsystem, personally i think that is a better format to store the job information.
## Alternatives | ||
[alternatives]: #alternatives | ||
|
||
No alternatives considered but ideas welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we used the Queue subsystem instead? and queued up jobs on a per minion basis so that the jobs can be replayed in order when a minion comes online?
2c. You want to be able to filter this, so it needs a filter option, with the default being "states" only. Also it should squash the same request with the same parameters by default. |
I concur with @damon-atkins, but more flexibility would be even nicer - any job have a way to specify if it wants to retry or not (maybe reuse the metadata mechanism)? |
A use case of ours require that we keep a queue of jobs that need to be run sequentially, as well as a jobs that can run in parallel (read-only jobs). I believe the Oxygen option PROCESS_COUNT_MAX, when set to 1 can give the queuing mechanism for sequential jobs (combined with the retry option on this RFC). If jobs could be tagged in a way that will prevent retry, that could give us the flexibility to queue/retry only the jobs that are tagged as retry. @rares-pop |
No description provided.