Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job retry RFC #50078

Merged
merged 3 commits into from
Nov 16, 2018
Merged

Job retry RFC #50078

merged 3 commits into from
Nov 16, 2018

Conversation

cachedout
Copy link
Contributor

No description provided.

@cachedout cachedout added the ZRETIRED - RFC retired label see SEP repo label Oct 16, 2018
@cachedout cachedout requested review from a team October 16, 2018 15:26
@isbm
Copy link
Contributor

isbm commented Oct 16, 2018

@cachedout Maybe description would help

[summary]: #summary

This feature enables Salt to detect when a minion did not run a job and when the minion comes back online,
the job is published to that minion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification needed: how this cope with the cached job? Say, minion was terminated during the job. Should job be republished or retry from the cache? How cache is reliable here? (I saw cache losing the jobs though)

# Motivation
[motivation]: #motivation

For a long time, people have complained that Salt has no way to send a job to a minion that was not availalble
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: available


Jobs to be retried are inserted into the cache by the LocalClient when it deterermines that a minion which it expected
to return did not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should cope also with minion restarts. For example, if one restarts salt-minion because of configuration changes, but concurrently also schedules a package list refresh. In this case the package list is not refreshed because that job is lost during the restart.

All operations will be gated behind a configuration option called `job_retry` which will default to being
disabled.

We propose a `job_retry` engine be created which can run on the master. It is the job of this engine to detect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was too much clashing "job" terminology that means different things. 😉 Maybe "the purpose of this engine is to detect minion "start" events....." instead.

## Alternatives
[alternatives]: #alternatives

No alternatives considered but ideas welcome.
Copy link
Contributor

@isbm isbm Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My alternative try here:

  1. Master throws job at minion, but first it travels from Master's job cache to Minion's job cache.
  2. Minion says "I've got it!"
  3. Master deletes the job from the cache.
  4. Minion is looking into the "job cache" and polls it.
  5. If job is found, Minion performs it and commits its deletion only when guaranteed finished.
  6. In case Minion has been restarted/crashed/killed/ripped by power down etc, job still there.
  7. Master no longer needed here, so Minion can just wake up back, and continue from No.5 step.

Why I think this could be better than original proposal:

  • It does not need to be explicitly dependent on the Master.
  • It can work with LocalClient
  • Great benefit for Salt SSH! We should definitely reuse this mechanism for Salt SSH. I am thinking something like in Emacs <-> emacs-client fashion for concurrent Salt SSH calls. In this case, if already running SaltSSH process performing something, another SaltSSH spin won't create the whole isolated Minion within local Master like --local but see if that already running around. Then it should just open Minion Job Cache and put another job inside and quit. The previous, already running process, will continue polling Job Cache until empty, and thus will find a new job and will perform it! This will resolve chronic problem of simultaneously fired SaltSSH comments under the same user name.
[master]--->[job cache]---+
                          |
                          |       <<<< ...and then John Wick accidentally
  +-----------------------+               shoots the network cable...
  |
  V
[job cache]--->[minion]

So then "job cache" on the minion would have it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we used the Queue subsystem instead? and queued up jobs on a per minion basis so that the jobs can be replayed in order when a minion comes online?

Copy link
Contributor

@gtmanfred gtmanfred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the use of the engine system, i would like to see us dogfood the queue system instead of using the cache subsystem, personally i think that is a better format to store the job information.

## Alternatives
[alternatives]: #alternatives

No alternatives considered but ideas welcome.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we used the Queue subsystem instead? and queued up jobs on a per minion basis so that the jobs can be replayed in order when a minion comes online?

@damon-atkins
Copy link
Contributor

damon-atkins commented Oct 18, 2018

2c. You want to be able to filter this, so it needs a filter option, with the default being "states" only.
e.g. salt \* test.ping or salt \* disk.usage you do not want to be replayed etc.

Also it should squash the same request with the same parameters by default.

@rares-pop
Copy link
Contributor

I concur with @damon-atkins, but more flexibility would be even nicer - any job have a way to specify if it wants to retry or not (maybe reuse the metadata mechanism)?

@adelcast
Copy link
Contributor

A use case of ours require that we keep a queue of jobs that need to be run sequentially, as well as a jobs that can run in parallel (read-only jobs). I believe the Oxygen option PROCESS_COUNT_MAX, when set to 1 can give the queuing mechanism for sequential jobs (combined with the retry option on this RFC). If jobs could be tagged in a way that will prevent retry, that could give us the flexibility to queue/retry only the jobs that are tagged as retry. @rares-pop

@rallytime rallytime merged commit ae4f0da into saltstack:develop Nov 16, 2018
alexey-zhukovin pushed a commit to alexey-zhukovin/salt that referenced this pull request May 5, 2020
@alexey-zhukovin alexey-zhukovin added the has master-port port to master has been created label May 5, 2020
dwoz pushed a commit that referenced this pull request May 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has master-port port to master has been created ZRETIRED - RFC retired label see SEP repo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants