Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] add wait timer to batch mode #1038

Closed
halberom opened this issue Apr 3, 2012 · 35 comments
Closed

[FR] add wait timer to batch mode #1038

halberom opened this issue Apr 3, 2012 · 35 comments
Assignees
Labels
Core relates to code central or existential to Salt Feature new functionality including changes to functionality and code refactors, etc. fixed-pls-verify fix is linked, bug author to confirm fix ZRELEASED - Boron
Milestone

Comments

@halberom
Copy link

halberom commented Apr 3, 2012

Feature Request.

There are times when an application can restart almost immediately, but doesn't necessarily start responding for x seconds. In these cases, I'd like the ability to add a wait timer in the batch call.

e.g.
salt -G 'role:webserver' -b 2 --batch-wait 30 cmd.run 'service tomcat restart'

Where it would iterate through all servers matching the grain, performing 2 at a time, with a 30 second wait between each.

Many thanks

@thatch45
Copy link
Contributor

thatch45 commented Apr 3, 2012

I will put this on post 1.0.0, I think it is a good idea

@halberom
Copy link
Author

halberom commented Aug 6, 2012

I know this has been marked for post 1.0, is there any chance of squeezing it in sooner? If not, we'll look at creating a quick wrapper - I wouldn't have a clue where to even start for coding it into the main app.

@thatch45
Copy link
Contributor

thatch45 commented Aug 6, 2012

I am happy to bump it up, the problem I have right now is that I am beating down bugs and have a lot of back logged features. The batch system also does not work in large groups, but makes sure that n number of minions are working at a time.
But the place to put the code would be here:
https://github.com/saltstack/salt/blob/develop/salt/cli/batch.py

@halberom
Copy link
Author

halberom commented Aug 7, 2012

Thanks for your response thatch45, fully understand your time commitments, we'll have a look at either adding to batch.py or writing a quick wrapper. Our problem is that for things like tomcat, the init script returns almost immediately, however the daemon can take up to 30sec to fully start before it'll start responding to requests - cases like that, we need an extra buffer to ensure there's no drops in service.

@thatch45
Copy link
Contributor

thatch45 commented Aug 8, 2012

That in and of itself is something to consider, if we should have some provision to wait for daemons to "really" start up. This is something that a lot of daemons do.

@SEJeff
Copy link
Contributor

SEJeff commented Nov 1, 2012

I believe this is also how some of the new init replacements like systemd and upstart function too. They will return as soon as they have spawned the various bits that then spawn the services and return immediately. It might be worth bumping up for that alone. Just my 2 cents.

@ghost ghost assigned thatch45 Nov 8, 2012
@wichert
Copy link
Contributor

wichert commented Apr 15, 2013

I have applications that have a similar problem: after a restart they may take a few minutes to preload caches. Since that preload (and other tasks) happen in an uWSGI worker process salt will not wait for that to finish. A delay would be a simple system to use here, but perhaps this is better handled witg a verify-state method that can check if 1) a restart is still running, or 2) it failed in some way, for example a daemon started but aborted after backgrounding.

@thatch45
Copy link
Contributor

I have been doing more work in this area recently and will try to revisit this for 0.16.0

@nikosg
Copy link

nikosg commented Aug 16, 2013

I was wondering if there is any update-progress on the wait timer for batch.
Thank you.

@minaguib
Copy link

+1

For now I'm coping with adding an explicit sleep to the command to run. cmd.run "restart foo; sleep 30" - this naturally works for parts like cmd.run but not state.sls

@basepi
Copy link
Contributor

basepi commented Aug 21, 2013

Issues get forgotten from time to time. I'll put this on the current milestone so we see it more often.

@nikosg
Copy link

nikosg commented Aug 27, 2013

Thank you for your reply. Really much appreciated.
Will keep a close eye on this thread.

@basepi basepi modified the milestones: Helium, Hydrogen Release Feb 4, 2014
@basepi basepi modified the milestones: Approved, Helium Apr 21, 2014
@minaguib
Copy link

Bump, nudge & tickle :)

Would still love to see this feature. The reasons are exactly as explained above (in a distributed/load-balanced service, you don't want to "shock" all the nodes by restarting at once)

@basepi
Copy link
Contributor

basepi commented Jul 23, 2014

So the idea would be instead of starting the next batch immediately after the first batch finished and returned, we would wait 30 seconds after receiving all the returns? Or would it be just a hard 30 second wait between batches, without waiting for returns? Just want to clarify.

@halberom
Copy link
Author

@basepi - IMO a hard wait between batches runs isn't intuitive as a user would need to add up 'time to restart the service' + 'service start time' in order to identify a reasonable wait period (although it could be renamed as something like a frequency parameter). Waiting after a batch runs and returns seems more sensible because then it's just the 'service start time' to account for - I think that would also tie in better with logic where if returns aren't successful, then it doesn't continue.

@basepi
Copy link
Contributor

basepi commented Jul 23, 2014

I suppose that makes sense. One thing you could do as a workaround for now is add a sleep to the end of your state run to just bake this delay in.

@minaguib
Copy link

FWIW I hadn't imagined it per-batch, but really per minion, within batch context.

In salt the batch system maintains a window of running minions. That means if you say in "batches of 2" and it executes on 2 minions - when one finishes, the next one is scheduled in that "slot" right away, so you always have 2 minions working. Think "2 workers" that are being fed work to do (each work unit being a minion) independent of each other, until there is no more work to do.

I had imagined the batch delay to be a delay the "worker" applies, after executing the work on a minion, before announcing "ok I'm done" and accepting more work.

That way, I can say "with a sleep of 30 seconds" to allow my just-restarted service to go back into service, and this is easy to reason about with any number of minions and any amount of batch size.

@basepi
Copy link
Contributor

basepi commented Jul 24, 2014

Makes sense! Thanks for the explanation.

@minaguib
Copy link

@basepi So after a bit of rubber ducking I realized this actually, from my perspective, has nothing to do with batches per se.

If we think of it as a "post-minion-state-sleep", it'd work naturally within or without batches, again over any number of minions and optionally any batch size.

@basepi
Copy link
Contributor

basepi commented Aug 1, 2014

@minaguib Thanks for the update -- I still think that implementing this inside of the batch system is the right place, since it's still the primary use case, and that will help it to apply to not only states, but also remote execution calls.

@lpgauth
Copy link

lpgauth commented Aug 7, 2014

👍

@rawkode
Copy link
Contributor

rawkode commented Aug 28, 2014

Would love to see this implemented! Strangely enough just implemented the same solution as @minaguib and then thought "There must be a Salt-way" and found this issue

👍

@abrahamrhoffman
Copy link

this would be quite convenient. we run on 60k+ physical nodes. depending on call (i.e. cmd.run "service openvswitch-switch restart") we have to get kinda crafty with our salt commands if we want to avoid data plane hits. e.g. bash script with built in wait timer.

@minaguib
Copy link

A colleague of mine discovered a work-around

Turns out that the salt commands allows invoking different functions and arguments in the same go. We can use this to combine state.sls with a cmd.run sleep:

salt server state.sls,cmd.run statename,"sleep 60"

The syntax is a bit awkward:

fun1,fun2  argtofun1,argtofun2

The order is also weird, but good enough for our usecase - the cmd.run("sleep 60") gets executed before the state.sls(statename)

@rothgar
Copy link

rothgar commented May 21, 2015

+1
My use case is to run something on all hosts once per hour but to spread the load in batches. Ideally we could run on 1/6 of our infrastructure every 10 minutes and then have the job repeat.
The way batch currently works it will try to finish as soon as possible but not desirable in all cases.

@bartlaarhoven
Copy link
Contributor

+1
Any news on the progress of this feature request?

@basepi basepi added the Core relates to code central or existential to Salt label Nov 11, 2015
@basepi basepi assigned DmitryKuzmenko and unassigned thatch45 Nov 11, 2015
@basepi
Copy link
Contributor

basepi commented Nov 11, 2015

No news, but thanks for the bump, I've put some eyes on it.

@matthayes
Copy link

Related to this, I think it would be useful if there is was a way to ping a web server before moving on. Sometimes one may implement a /ping path that returns 200 if the server is up and ready to serve requests. This could be more reliable and quicker than a fixed sleep. For example, what if there is a bug? It'd be nice to catch it immediately before rolling it out to all instances. Maybe the service.running state could have a test option of some sort?

@lpgauth
Copy link

lpgauth commented Dec 11, 2015

@matthayes 👍

@basepi
Copy link
Contributor

basepi commented Dec 11, 2015

@matthayes I'm not convinced that this belongs in batch. I think it's doable without batch mode, and it's hard to generalize inside of batch mode in a useful way.

  1. You could use state.orchestrate, target a subset of the minions to roll out changes, and then have a separate piece of the orchestration, probably using a custom module, to check if the changes were successful
  2. You could also just use a normal state run. Have the minion check for its own web service. Again, if you need custom checks or a loop with a custom timeout to check the service, a custom module is probably the way to go here.

Here are some resources on custom modules:

https://docs.saltstack.com/en/latest/ref/modules/
https://youtu.be/7CxJGglQhxQ

Does that make sense?

Also, the original issue is now implemented, so I'm going to close this issue.

@basepi basepi closed this as completed Dec 11, 2015
@basepi basepi added the fixed-pls-verify fix is linked, bug author to confirm fix label Dec 11, 2015
@matthayes
Copy link

Thanks @basepi , that makes sense. For point 2, how can the minion check its own web service in a state run? Would there need to be a state that checks the web service is responding when it is evaluated? Does such a state exist that can be used? If I were to write a custom module I would need to use something like state.orchestrate as in point 1 right?

@matthayes
Copy link

I just noticed you can call modules from states. Is this how you'd suggest doing it if I pursued point 2? It seems like I could write a custom module that makes an HTTP call and fails if the response is not 200. Then it seems I could do a normal state run with a wait between each node and it would work great.

@basepi
Copy link
Contributor

basepi commented Dec 11, 2015

You could also easily write a custom state module. It's not covered in as much detail in that youtube video, but it's just as easy. But yes, you have the idea spot on.

@shinshenjs
Copy link

Saw this issue when searching for the feature. As it was not released yet I figured out an easy hack for this case. simply run a sleep afterwards.

salt -G 'app:whatever' -b 1 cmd.run "echo test; sleep 300

For highstate, I just put a cmd.run state at the end of top file with the same idea.

Leave it here as it could be helpful before this feature is released.

@rothgar
Copy link

rothgar commented Mar 17, 2016

That's a decent hack if you want to arbitrarily wait (assuming you don't know how long echo test will take to run). My use case was to run a command everywhere twice per hour. It would be nice to have salt figure out the batch size to make sure the command is triggered (doesn't have to finish) the correct amount of times in a given timespan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core relates to code central or existential to Salt Feature new functionality including changes to functionality and code refactors, etc. fixed-pls-verify fix is linked, bug author to confirm fix ZRELEASED - Boron
Projects
None yet
Development

No branches or pull requests