[FR] add wait timer to batch mode #1038

halberom · 2012-04-03T13:41:46Z

Feature Request.

There are times when an application can restart almost immediately, but doesn't necessarily start responding for x seconds. In these cases, I'd like the ability to add a wait timer in the batch call.

e.g.
salt -G 'role:webserver' -b 2 --batch-wait 30 cmd.run 'service tomcat restart'

Where it would iterate through all servers matching the grain, performing 2 at a time, with a 30 second wait between each.

Many thanks

thatch45 · 2012-04-03T15:36:08Z

I will put this on post 1.0.0, I think it is a good idea

halberom · 2012-08-06T21:04:58Z

I know this has been marked for post 1.0, is there any chance of squeezing it in sooner? If not, we'll look at creating a quick wrapper - I wouldn't have a clue where to even start for coding it into the main app.

thatch45 · 2012-08-06T21:09:44Z

I am happy to bump it up, the problem I have right now is that I am beating down bugs and have a lot of back logged features. The batch system also does not work in large groups, but makes sure that n number of minions are working at a time.
But the place to put the code would be here:
https://github.com/saltstack/salt/blob/develop/salt/cli/batch.py

halberom · 2012-08-07T23:43:37Z

Thanks for your response thatch45, fully understand your time commitments, we'll have a look at either adding to batch.py or writing a quick wrapper. Our problem is that for things like tomcat, the init script returns almost immediately, however the daemon can take up to 30sec to fully start before it'll start responding to requests - cases like that, we need an extra buffer to ensure there's no drops in service.

thatch45 · 2012-08-08T15:54:18Z

That in and of itself is something to consider, if we should have some provision to wait for daemons to "really" start up. This is something that a lot of daemons do.

SEJeff · 2012-11-01T02:46:28Z

I believe this is also how some of the new init replacements like systemd and upstart function too. They will return as soon as they have spawned the various bits that then spawn the services and return immediately. It might be worth bumping up for that alone. Just my 2 cents.

wichert · 2013-04-15T08:50:29Z

I have applications that have a similar problem: after a restart they may take a few minutes to preload caches. Since that preload (and other tasks) happen in an uWSGI worker process salt will not wait for that to finish. A delay would be a simple system to use here, but perhaps this is better handled witg a verify-state method that can check if 1) a restart is still running, or 2) it failed in some way, for example a daemon started but aborted after backgrounding.

thatch45 · 2013-04-22T19:42:37Z

I have been doing more work in this area recently and will try to revisit this for 0.16.0

nikosg · 2013-08-16T11:42:29Z

I was wondering if there is any update-progress on the wait timer for batch.
Thank you.

minaguib · 2013-08-20T18:59:47Z

+1

For now I'm coping with adding an explicit sleep to the command to run. cmd.run "restart foo; sleep 30" - this naturally works for parts like cmd.run but not state.sls

basepi · 2013-08-21T19:07:33Z

Issues get forgotten from time to time. I'll put this on the current milestone so we see it more often.

nikosg · 2013-08-27T18:14:09Z

Thank you for your reply. Really much appreciated.
Will keep a close eye on this thread.

minaguib · 2014-07-23T16:07:32Z

Bump, nudge & tickle :)

Would still love to see this feature. The reasons are exactly as explained above (in a distributed/load-balanced service, you don't want to "shock" all the nodes by restarting at once)

basepi · 2014-07-23T21:29:17Z

So the idea would be instead of starting the next batch immediately after the first batch finished and returned, we would wait 30 seconds after receiving all the returns? Or would it be just a hard 30 second wait between batches, without waiting for returns? Just want to clarify.

halberom · 2014-07-23T22:23:23Z

@basepi - IMO a hard wait between batches runs isn't intuitive as a user would need to add up 'time to restart the service' + 'service start time' in order to identify a reasonable wait period (although it could be renamed as something like a frequency parameter). Waiting after a batch runs and returns seems more sensible because then it's just the 'service start time' to account for - I think that would also tie in better with logic where if returns aren't successful, then it doesn't continue.

basepi · 2014-07-23T23:19:17Z

I suppose that makes sense. One thing you could do as a workaround for now is add a sleep to the end of your state run to just bake this delay in.

minaguib · 2014-07-24T00:07:55Z

FWIW I hadn't imagined it per-batch, but really per minion, within batch context.

In salt the batch system maintains a window of running minions. That means if you say in "batches of 2" and it executes on 2 minions - when one finishes, the next one is scheduled in that "slot" right away, so you always have 2 minions working. Think "2 workers" that are being fed work to do (each work unit being a minion) independent of each other, until there is no more work to do.

I had imagined the batch delay to be a delay the "worker" applies, after executing the work on a minion, before announcing "ok I'm done" and accepting more work.

That way, I can say "with a sleep of 30 seconds" to allow my just-restarted service to go back into service, and this is easy to reason about with any number of minions and any amount of batch size.

basepi · 2014-07-24T18:43:22Z

Makes sense! Thanks for the explanation.

minaguib · 2014-07-31T19:23:58Z

@basepi So after a bit of rubber ducking I realized this actually, from my perspective, has nothing to do with batches per se.

If we think of it as a "post-minion-state-sleep", it'd work naturally within or without batches, again over any number of minions and optionally any batch size.

basepi · 2014-08-01T21:52:01Z

@minaguib Thanks for the update -- I still think that implementing this inside of the batch system is the right place, since it's still the primary use case, and that will help it to apply to not only states, but also remote execution calls.

lpgauth · 2014-08-07T14:00:52Z

👍

rawkode · 2014-08-28T10:29:40Z

Would love to see this implemented! Strangely enough just implemented the same solution as @minaguib and then thought "There must be a Salt-way" and found this issue

👍

abrahamrhoffman · 2014-12-18T22:24:42Z

this would be quite convenient. we run on 60k+ physical nodes. depending on call (i.e. cmd.run "service openvswitch-switch restart") we have to get kinda crafty with our salt commands if we want to avoid data plane hits. e.g. bash script with built in wait timer.

minaguib · 2015-05-12T19:21:53Z

A colleague of mine discovered a work-around

Turns out that the salt commands allows invoking different functions and arguments in the same go. We can use this to combine state.sls with a cmd.run sleep:

salt server state.sls,cmd.run statename,"sleep 60"

The syntax is a bit awkward:

fun1,fun2  argtofun1,argtofun2

The order is also weird, but good enough for our usecase - the cmd.run("sleep 60") gets executed before the state.sls(statename)

rothgar · 2015-05-21T17:30:39Z

+1
My use case is to run something on all hosts once per hour but to spread the load in batches. Ideally we could run on 1/6 of our infrastructure every 10 minutes and then have the job repeat.
The way batch currently works it will try to finish as soon as possible but not desirable in all cases.

bartlaarhoven · 2015-11-11T13:57:00Z

+1
Any news on the progress of this feature request?

basepi · 2015-11-11T19:15:53Z

No news, but thanks for the bump, I've put some eyes on it.

matthayes · 2015-12-11T04:58:39Z

Related to this, I think it would be useful if there is was a way to ping a web server before moving on. Sometimes one may implement a /ping path that returns 200 if the server is up and ready to serve requests. This could be more reliable and quicker than a fixed sleep. For example, what if there is a bug? It'd be nice to catch it immediately before rolling it out to all instances. Maybe the service.running state could have a test option of some sort?

lpgauth · 2015-12-11T15:08:37Z

@matthayes 👍

basepi · 2015-12-11T16:44:13Z

@matthayes I'm not convinced that this belongs in batch. I think it's doable without batch mode, and it's hard to generalize inside of batch mode in a useful way.

You could use state.orchestrate, target a subset of the minions to roll out changes, and then have a separate piece of the orchestration, probably using a custom module, to check if the changes were successful
You could also just use a normal state run. Have the minion check for its own web service. Again, if you need custom checks or a loop with a custom timeout to check the service, a custom module is probably the way to go here.

Here are some resources on custom modules:

https://docs.saltstack.com/en/latest/ref/modules/
https://youtu.be/7CxJGglQhxQ

Does that make sense?

Also, the original issue is now implemented, so I'm going to close this issue.

matthayes · 2015-12-11T19:23:57Z

Thanks @basepi , that makes sense. For point 2, how can the minion check its own web service in a state run? Would there need to be a state that checks the web service is responding when it is evaluated? Does such a state exist that can be used? If I were to write a custom module I would need to use something like state.orchestrate as in point 1 right?

matthayes · 2015-12-11T19:30:50Z

I just noticed you can call modules from states. Is this how you'd suggest doing it if I pursued point 2? It seems like I could write a custom module that makes an HTTP call and fails if the response is not 200. Then it seems I could do a normal state run with a wait between each node and it would work great.

basepi · 2015-12-11T21:14:17Z

You could also easily write a custom state module. It's not covered in as much detail in that youtube video, but it's just as easy. But yes, you have the idea spot on.

shinshenjs · 2016-03-17T09:59:44Z

Saw this issue when searching for the feature. As it was not released yet I figured out an easy hack for this case. simply run a sleep afterwards.

salt -G 'app:whatever' -b 1 cmd.run "echo test; sleep 300

For highstate, I just put a cmd.run state at the end of top file with the same idea.

Leave it here as it could be helpful before this feature is released.

rothgar · 2016-03-17T17:57:21Z

That's a decent hack if you want to arbitrarily wait (assuming you don't know how long echo test will take to run). My use case was to run a command everywhere twice per hour. It would be nice to have salt figure out the batch size to make sure the command is triggered (doesn't have to finish) the correct amount of times in a given timespan.

ghost assigned thatch45 Nov 8, 2012

basepi modified the milestones: Helium, Hydrogen Release Feb 4, 2014

basepi modified the milestones: Approved, Helium Apr 21, 2014

basepi added the Core relates to code central or existential to Salt label Nov 11, 2015

basepi assigned DmitryKuzmenko and unassigned thatch45 Nov 11, 2015

DmitryKuzmenko added ZRELEASED - Boron TEAM Core labels Nov 12, 2015

DmitryKuzmenko modified the milestones: B 7, Approved Nov 25, 2015

DmitryKuzmenko mentioned this issue Nov 26, 2015

Implemented batch wait timer #29224

Merged

basepi closed this as completed Dec 11, 2015

basepi added the fixed-pls-verify fix is linked, bug author to confirm fix label Dec 11, 2015

matthayes mentioned this issue Dec 14, 2015

Can't get batch mode and --failhard to work as expected #29643

Closed

[FR] add wait timer to batch mode #1038

[FR] add wait timer to batch mode #1038

Comments

halberom commented Apr 3, 2012

thatch45 commented Apr 3, 2012

halberom commented Aug 6, 2012

thatch45 commented Aug 6, 2012

halberom commented Aug 7, 2012

thatch45 commented Aug 8, 2012

SEJeff commented Nov 1, 2012

wichert commented Apr 15, 2013

thatch45 commented Apr 22, 2013

nikosg commented Aug 16, 2013

minaguib commented Aug 20, 2013

basepi commented Aug 21, 2013

nikosg commented Aug 27, 2013

minaguib commented Jul 23, 2014

basepi commented Jul 23, 2014

halberom commented Jul 23, 2014

basepi commented Jul 23, 2014

minaguib commented Jul 24, 2014

basepi commented Jul 24, 2014

minaguib commented Jul 31, 2014

basepi commented Aug 1, 2014

lpgauth commented Aug 7, 2014

rawkode commented Aug 28, 2014

abrahamrhoffman commented Dec 18, 2014

minaguib commented May 12, 2015

rothgar commented May 21, 2015

bartlaarhoven commented Nov 11, 2015

basepi commented Nov 11, 2015

matthayes commented Dec 11, 2015

lpgauth commented Dec 11, 2015

basepi commented Dec 11, 2015

matthayes commented Dec 11, 2015

matthayes commented Dec 11, 2015

basepi commented Dec 11, 2015

shinshenjs commented Mar 17, 2016

rothgar commented Mar 17, 2016