-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] add wait timer to batch mode #1038
Comments
I will put this on post 1.0.0, I think it is a good idea |
I know this has been marked for post 1.0, is there any chance of squeezing it in sooner? If not, we'll look at creating a quick wrapper - I wouldn't have a clue where to even start for coding it into the main app. |
I am happy to bump it up, the problem I have right now is that I am beating down bugs and have a lot of back logged features. The batch system also does not work in large groups, but makes sure that n number of minions are working at a time. |
Thanks for your response thatch45, fully understand your time commitments, we'll have a look at either adding to batch.py or writing a quick wrapper. Our problem is that for things like tomcat, the init script returns almost immediately, however the daemon can take up to 30sec to fully start before it'll start responding to requests - cases like that, we need an extra buffer to ensure there's no drops in service. |
That in and of itself is something to consider, if we should have some provision to wait for daemons to "really" start up. This is something that a lot of daemons do. |
I believe this is also how some of the new init replacements like systemd and upstart function too. They will return as soon as they have spawned the various bits that then spawn the services and return immediately. It might be worth bumping up for that alone. Just my 2 cents. |
I have applications that have a similar problem: after a restart they may take a few minutes to preload caches. Since that preload (and other tasks) happen in an uWSGI worker process salt will not wait for that to finish. A delay would be a simple system to use here, but perhaps this is better handled witg a verify-state method that can check if 1) a restart is still running, or 2) it failed in some way, for example a daemon started but aborted after backgrounding. |
I have been doing more work in this area recently and will try to revisit this for 0.16.0 |
I was wondering if there is any update-progress on the wait timer for batch. |
+1 For now I'm coping with adding an explicit sleep to the command to run. cmd.run "restart foo; sleep 30" - this naturally works for parts like cmd.run but not state.sls |
Issues get forgotten from time to time. I'll put this on the current milestone so we see it more often. |
Thank you for your reply. Really much appreciated. |
Bump, nudge & tickle :) Would still love to see this feature. The reasons are exactly as explained above (in a distributed/load-balanced service, you don't want to "shock" all the nodes by restarting at once) |
So the idea would be instead of starting the next batch immediately after the first batch finished and returned, we would wait 30 seconds after receiving all the returns? Or would it be just a hard 30 second wait between batches, without waiting for returns? Just want to clarify. |
@basepi - IMO a hard wait between batches runs isn't intuitive as a user would need to add up 'time to restart the service' + 'service start time' in order to identify a reasonable wait period (although it could be renamed as something like a frequency parameter). Waiting after a batch runs and returns seems more sensible because then it's just the 'service start time' to account for - I think that would also tie in better with logic where if returns aren't successful, then it doesn't continue. |
I suppose that makes sense. One thing you could do as a workaround for now is add a sleep to the end of your state run to just bake this delay in. |
FWIW I hadn't imagined it per-batch, but really per minion, within batch context. In salt the batch system maintains a window of running minions. That means if you say in "batches of 2" and it executes on 2 minions - when one finishes, the next one is scheduled in that "slot" right away, so you always have 2 minions working. Think "2 workers" that are being fed work to do (each work unit being a minion) independent of each other, until there is no more work to do. I had imagined the batch delay to be a delay the "worker" applies, after executing the work on a minion, before announcing "ok I'm done" and accepting more work. That way, I can say "with a sleep of 30 seconds" to allow my just-restarted service to go back into service, and this is easy to reason about with any number of minions and any amount of batch size. |
Makes sense! Thanks for the explanation. |
@basepi So after a bit of rubber ducking I realized this actually, from my perspective, has nothing to do with batches per se. If we think of it as a "post-minion-state-sleep", it'd work naturally within or without batches, again over any number of minions and optionally any batch size. |
@minaguib Thanks for the update -- I still think that implementing this inside of the batch system is the right place, since it's still the primary use case, and that will help it to apply to not only states, but also remote execution calls. |
👍 |
Would love to see this implemented! Strangely enough just implemented the same solution as @minaguib and then thought "There must be a Salt-way" and found this issue 👍 |
this would be quite convenient. we run on 60k+ physical nodes. depending on call (i.e. cmd.run "service openvswitch-switch restart") we have to get kinda crafty with our salt commands if we want to avoid data plane hits. e.g. bash script with built in wait timer. |
A colleague of mine discovered a work-around Turns out that the salt commands allows invoking different functions and arguments in the same go. We can use this to combine state.sls with a cmd.run sleep:
The syntax is a bit awkward:
The order is also weird, but good enough for our usecase - the cmd.run("sleep 60") gets executed before the state.sls(statename) |
+1 |
+1 |
No news, but thanks for the bump, I've put some eyes on it. |
Related to this, I think it would be useful if there is was a way to ping a web server before moving on. Sometimes one may implement a /ping path that returns 200 if the server is up and ready to serve requests. This could be more reliable and quicker than a fixed sleep. For example, what if there is a bug? It'd be nice to catch it immediately before rolling it out to all instances. Maybe the service.running state could have a test option of some sort? |
@matthayes I'm not convinced that this belongs in batch. I think it's doable without batch mode, and it's hard to generalize inside of batch mode in a useful way.
Here are some resources on custom modules: https://docs.saltstack.com/en/latest/ref/modules/ Does that make sense? Also, the original issue is now implemented, so I'm going to close this issue. |
Thanks @basepi , that makes sense. For point 2, how can the minion check its own web service in a state run? Would there need to be a state that checks the web service is responding when it is evaluated? Does such a state exist that can be used? If I were to write a custom module I would need to use something like state.orchestrate as in point 1 right? |
I just noticed you can call modules from states. Is this how you'd suggest doing it if I pursued point 2? It seems like I could write a custom module that makes an HTTP call and fails if the response is not 200. Then it seems I could do a normal state run with a wait between each node and it would work great. |
You could also easily write a custom state module. It's not covered in as much detail in that youtube video, but it's just as easy. But yes, you have the idea spot on. |
Saw this issue when searching for the feature. As it was not released yet I figured out an easy hack for this case. simply run a sleep afterwards.
For highstate, I just put a cmd.run state at the end of top file with the same idea. Leave it here as it could be helpful before this feature is released. |
That's a decent hack if you want to arbitrarily wait (assuming you don't know how long |
Feature Request.
There are times when an application can restart almost immediately, but doesn't necessarily start responding for x seconds. In these cases, I'd like the ability to add a wait timer in the batch call.
e.g.
salt -G 'role:webserver' -b 2 --batch-wait 30 cmd.run 'service tomcat restart'
Where it would iterate through all servers matching the grain, performing 2 at a time, with a 30 second wait between each.
Many thanks
The text was updated successfully, but these errors were encountered: