Job retry RFC #50078

cachedout · 2018-10-16T15:26:36Z

No description provided.

isbm · 2018-10-16T15:29:14Z

@cachedout Maybe description would help

isbm · 2018-10-16T15:44:58Z

rfcs/0003-job-retry.md

+[summary]: #summary
+
+This feature enables Salt to detect when a minion did not run a job and when the minion comes back online,
+the job is published to that minion.


Clarification needed: how this cope with the cached job? Say, minion was terminated during the job. Should job be republished or retry from the cache? How cache is reliable here? (I saw cache losing the jobs though)

isbm · 2018-10-16T15:45:30Z

rfcs/0003-job-retry.md

+# Motivation
+[motivation]: #motivation
+
+For a long time, people have complained that Salt has no way to send a job to a minion that was not availalble


typo: available

isbm · 2018-10-16T15:53:30Z

rfcs/0003-job-retry.md

+
+Jobs to be retried are inserted into the cache by the LocalClient when it deterermines that a minion which it expected
+to return did not.
+


Should cope also with minion restarts. For example, if one restarts salt-minion because of configuration changes, but concurrently also schedules a package list refresh. In this case the package list is not refreshed because that job is lost during the restart.

isbm · 2018-10-16T15:57:18Z

rfcs/0003-job-retry.md

+All operations will be gated behind a configuration option called `job_retry` which will default to being
+disabled.
+
+We propose a `job_retry` engine be created which can run on the master. It is the job of this engine to detect


Was too much clashing "job" terminology that means different things. 😉 Maybe "the purpose of this engine is to detect minion "start" events....." instead.

isbm · 2018-10-16T16:08:45Z

rfcs/0003-job-retry.md

+## Alternatives
+[alternatives]: #alternatives
+
+No alternatives considered but ideas welcome.


My alternative try here:

Master throws job at minion, but first it travels from Master's job cache to Minion's job cache.

Minion says "I've got it!"

Master deletes the job from the cache.

Minion is looking into the "job cache" and polls it.

If job is found, Minion performs it and commits its deletion only when guaranteed finished.

In case Minion has been restarted/crashed/killed/ripped by power down etc, job still there.

Master no longer needed here, so Minion can just wake up back, and continue from No.5 step.

Why I think this could be better than original proposal:

It does not need to be explicitly dependent on the Master.

It can work with LocalClient

Great benefit for Salt SSH! We should definitely reuse this mechanism for Salt SSH. I am thinking something like in Emacs <-> emacs-client fashion for concurrent Salt SSH calls. In this case, if already running SaltSSH process performing something, another SaltSSH spin won't create the whole isolated Minion within local Master like --local but see if that already running around. Then it should just open Minion Job Cache and put another job inside and quit. The previous, already running process, will continue polling Job Cache until empty, and thus will find a new job and will perform it! This will resolve chronic problem of simultaneously fired SaltSSH comments under the same user name.

[master]--->[job cache]---+ | | <<<< ...and then John Wick accidentally +-----------------------+ shoots the network cable... | V [job cache]--->[minion]

So then "job cache" on the minion would have it.

What if we used the Queue subsystem instead? and queued up jobs on a per minion basis so that the jobs can be replayed in order when a minion comes online?

gtmanfred

I really like the use of the engine system, i would like to see us dogfood the queue system instead of using the cache subsystem, personally i think that is a better format to store the job information.

gtmanfred · 2018-10-17T16:09:50Z

rfcs/0003-job-retry.md

+## Alternatives
+[alternatives]: #alternatives
+
+No alternatives considered but ideas welcome.


What if we used the Queue subsystem instead? and queued up jobs on a per minion basis so that the jobs can be replayed in order when a minion comes online?

damon-atkins · 2018-10-18T12:33:48Z

2c. You want to be able to filter this, so it needs a filter option, with the default being "states" only.
e.g. salt \* test.ping or salt \* disk.usage you do not want to be replayed etc.

Also it should squash the same request with the same parameters by default.

rares-pop · 2018-10-19T13:45:42Z

I concur with @damon-atkins, but more flexibility would be even nicer - any job have a way to specify if it wants to retry or not (maybe reuse the metadata mechanism)?

adelcast · 2018-10-22T15:30:23Z

A use case of ours require that we keep a queue of jobs that need to be run sequentially, as well as a jobs that can run in parallel (read-only jobs). I believe the Oxygen option PROCESS_COUNT_MAX, when set to 1 can give the queuing mechanism for sequential jobs (combined with the retry option on this RFC). If jobs could be tagged in a way that will prevent retry, that could give us the flexibility to queue/retry only the jobs that are tagged as retry. @rares-pop

Job retry RFC

9675e17

cachedout added the ZRETIRED - RFC retired label see SEP repo label Oct 16, 2018

cachedout requested review from a team October 16, 2018 15:26

isbm suggested changes Oct 16, 2018

View reviewed changes

gtmanfred approved these changes Oct 17, 2018

View reviewed changes

Mike Place added 2 commits October 18, 2018 04:52

Fix spelling error. (Thanks @isbm)

97d2404

Clarify language regarding engine purpose

7283a5a

isbm mentioned this pull request Oct 20, 2018

[2018.3] Fixes to schedule maxrunning on master #50130

Merged

rallytime merged commit ae4f0da into saltstack:develop Nov 16, 2018

alexey-zhukovin pushed a commit to alexey-zhukovin/salt that referenced this pull request May 5, 2020

Port saltstack#50078 to master

b116c52

alexey-zhukovin mentioned this pull request May 5, 2020

Port #50078 to master #57091

Merged

alexey-zhukovin added the has master-port port to master has been created label May 5, 2020

dwoz pushed a commit that referenced this pull request May 9, 2020

Port #50078 to master

d453ea9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job retry RFC #50078

Job retry RFC #50078

cachedout commented Oct 16, 2018

isbm commented Oct 16, 2018

isbm Oct 16, 2018

isbm Oct 16, 2018

isbm Oct 16, 2018

isbm Oct 16, 2018

isbm Oct 16, 2018 •

edited

Loading

gtmanfred Oct 17, 2018

gtmanfred left a comment

gtmanfred Oct 17, 2018

damon-atkins commented Oct 18, 2018 •

edited

Loading

rares-pop commented Oct 19, 2018

adelcast commented Oct 22, 2018


		Jobs to be retried are inserted into the cache by the LocalClient when it deterermines that a minion which it expected
		to return did not.

Job retry RFC #50078

Job retry RFC #50078

Conversation

cachedout commented Oct 16, 2018

isbm commented Oct 16, 2018

isbm Oct 16, 2018

Choose a reason for hiding this comment

isbm Oct 16, 2018

Choose a reason for hiding this comment

isbm Oct 16, 2018

Choose a reason for hiding this comment

isbm Oct 16, 2018

Choose a reason for hiding this comment

isbm Oct 16, 2018 • edited Loading

Choose a reason for hiding this comment

gtmanfred Oct 17, 2018

Choose a reason for hiding this comment

gtmanfred left a comment

Choose a reason for hiding this comment

gtmanfred Oct 17, 2018

Choose a reason for hiding this comment

damon-atkins commented Oct 18, 2018 • edited Loading

rares-pop commented Oct 19, 2018

adelcast commented Oct 22, 2018

isbm Oct 16, 2018 •

edited

Loading

damon-atkins commented Oct 18, 2018 •

edited

Loading