-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper way to upgrade salt-minions / salt-master packages without losing minion connectivity #7997
Comments
you may need my fix (already on git/develop) for #7987 |
Was this always a problem or just something specific to the v.17 -> v.17.1 upgrade? I've personally never got it to work reliably across all minion upgrades with past version upgrades but I had assumed I was going about it the wrong way. I just tried an alternative upgrade approach. Unfortunately, it didn't work either.
The package upgrade seems to stop the running service. |
This process has never been as consistent or stable as we would like it to be. But the first thing you need to do is make sure all of your minions have ZMQ 3.2 or higher. That minion that you listed above with ZMQ 2 is definitely going to cause problems with keeping the connection alive or reconnecting. The rest of the process tends to depend on the init system in question and a lot of other factors. As soon as we get the general bug count under control, we want to dedicate some resources to solving this upgrade problem for good. |
I don't have much of a choice here unless I deviate from the repos. ZMQ3 is only available in epel for RHEL6. I have the latest version of ZMQ offered for RHEL5 which happens to be the 2.2 version listed above. Minion01 replicates our legacy RHEL5 nodes (which we have a lot of) and I'm using it as a control for testing issues. Minion00 is running RHEL6 and mimics the configuration of our latest app/service deployments (and where we'd like to transition everything once I have enough cycles to complete migrations). |
I should also say, you're right. Most of my upgrade and losing minions when the salt-master reboots woes have been with RHEL5 and the legacy ZMQ. That said, this latest upgrade to v.17.1 hasn't worked cleanly for any client. Every single upgrade has stopped the salt-minion service and requires me to login and reboot the services. Can I get around this by using salt-ssh to restart the minion services? I could use an example of how to use salt-ssh. I'm not entirely clear what user / keys / password it uses (or if this is all hidden under the covers with salt's key management system) or how to issue commands. The docs I've run across don't have too many examples of executing shell commands. I'm presuming using the -r flag (raw) works much like "ssh -t"? |
Hi, I have a workaround for this: For my Linux minion i set a crontab to restart minions every night.
On Windows, same thing, i add a task to restart minions.
Regards |
That's actually quite clever and I'm going to steal that idea. That said, it would not have helped with the v.17 to v.17.1 upgrade via epel-testing packages. From what I'm seeing, the package upgrade itself stops the running minion service (whether via yum or from within salt's framework -- which makes sense since salt calls yum's methods) which is very peculiar behavior (I don't recall seeing this happen before with any daemon installed from rpm packages). This seems to be a new upgrade artifact that I don't remember seeing before but I've only done 4-5 upgrades so far. I'm going to run the upgrade in verbose mode and see if I can find any other artifacts but need to troubleshoot some 10GbE networking problems we're having first. |
Maybe I dismissed the idea too early... I could extend your example and poll to see if the service is running (every 5 minutes or something) and if it is not, start the salt-minion. That seems a little overkill but would resolve this specific issue. I should be able to use salt-ssh to login to all of minions and start the minion service manually however right? |
@shantanub, I look at the spec file. There is some code to stop salt-minion before upgrade and if it's an upgrade to restart the minion with service salt-minion condrestart I don't know why this part of code doesn't work. Before 0.17 version, the minion wasn't stopped during upgrade. |
Strange. I wonder if we changed something between 0.17.0 and 0.17.1 that would have caused this change. @equinoxefr I can't see any recent changes that changed whether the salt process was stopped or not. Looks like the stop of the salt-minion has been in there for awhile (at least for the 0.16 release, I didn't go earlier). Just wondering where you were looking to see that change. @shantanub You've had successful upgrades from epel before, then? But not 0.17.0-0.17.1? |
@basepi i didn't see any change in the code i have seen a change in the functioning of minion upgrade (on linux RPM). Now with 0.17.1 salt-minion is stopped and not restarted. Perhaps the piece of code that use the state of RPM operation doesn't work. 0.16.0 -> 0.16.3 = Minion not restarted I did my tests on centos 6.4. I don't know why but something has changed ;-) |
Hrm, well, it doesn't appear to be in the spec file, so it must be somewhere else. Maybe the init script? |
@basepi That's correct. I've always upgraded and used the epel/epel-testing rpms to install salt so far. This is the first time I've noticed the minion stopped after/during the upgrade (this is something that would have been obvious since I would have to login to every minion and restart the service). The upgrade itself seems to have executed fine in every other regard that I can tell (no errors, etc...). Now, I have in the past lost all of the rhel5 minions when the salt-master service is restarted. I haven't had a problem with that in a few versions, but as I mentioned above, I very much would like to depart from rhel5 as soon as possible. |
Upon upgrading to v.17.2, it looks like the minion restarts as a part of the upgrade just fine. Has this issue with the package upgrade been resolved? As a fail-safe, I "start" my minions every 5 minutes via cron just in case they're down for some reason or another. I'll need to add the windows specific scheduler task as well. I would still like a definitive guide for how-to-upgrade minions and the salt-master. We're moving salt to production once v.17 is available in epel (as opposed to epel-testing), and I'd very much like upgrades to go smoothly. |
In general, the upgrades themselves tend to go swimmingly. The problem is the restart after the upgrade. We have an open issue specifically for the restarting of the minion: #5721 The issue also varies in severity from system to system (specifically between different init systems). Making it so the salt minion can restart itself consistently is high on our priority list. |
Are you sure the minion upgrade/restart problem isn't just a bug in the rpm post install script? That's where it should be restarted ... did you have a look at the source rpm? |
Restarting the minion process from inside a state running against that minion process (e.g. with a service / watch type state rule) will always fail. It's hard to imagine getting it to work without either a moderate re-architecture of the salt workflow or some cumbersome custom code inside the state machine to handle this special case. That said, the following has been allowing flawless upgrades for me since I started deploying minions around 0.16.1, up to 0.17.4 we're running now:
Obviously, you can set the wait time to whatever you want -- just make sure it's long enough that the minion proc doesn't get whacked during the current run... |
Using |
Ah, 'order: last' is a damned fine idea. Ashamed I didn't think of it On Fri, Dec 20, 2013 at 11:41 AM, Colton Myers notifications@github.comwrote:
|
Well, I can't believe I've never thought of using |
@tkwilliams: Woah cool. That's really helpful. Now you mentioned you guys watch your minion file as well. Does that mean you specify the contents of the minion file for every host/vm? Are you just pulling the fqdn from the environment for the contents of that file or doing something else? I'm having a lot of trouble with renaming minions (I have the master copy keys and change the contents of the minion file on the minion but a simple service restart of the salt-minion doesn't seem to be sufficient to get the master talking to the minion with the new minion hostname). This is an artifact of our kickstart setup. All of our hosts start up with a name that looks like "preconfig-macaddr.domain.org" where macaddr is the macaddress of the primary kickstarted interface (aa-bb-cc-dd-ee-ff). We then set the hostname and role of hosts via script but this has been a little painful since salt doesn't seem to readily want to move to the new hostname. Rebooting the host/VM after the change seems to work but I'd prefer not to have to do that. I'll experiment with this little wrinkle and see if it helps with renaming minions. |
The "at now" trick didn't work for renaming minions on rhel6 unfortunately. Something is caching the old hostname even though I've changed it just about everywhere I can imagine. Oh well, back to restarting vms/hosts when I rename them. |
Are you also deleting the /etc/salt/minion_id file so that the minion isn't caching the old name? |
Nope. I'm putting the new hostname in that file and it doesn't seem to be doing anything without a reboot. |
I've actually passed along to Seth the actual scripts I'm using to perform the name change. Feel free to see if I'm doing something silly. He thinks there may be a timing issue I've glossed over. |
You do need to restart the minion to change the minion ID. Don't know if "without a reboot" meant system reboot or minion restart. |
@basepi: I mean a system reboot/restart of the minion host/vm is required. As I mentioned restarting the minion service post name change even with the 'at now +1 minutes', several different sleep lengths, etc... doesn't get the minion to show up on the master as up. An interesting factoid is I can restart the minion reverting the keys on the master back to the original minion hostname and the minion shows up on the master just fine (this is without changing the contents of the minion_id file or the actual hostname of the minion which now point to the new hostname). So something is being cached somewhere and I'm not sure why/what. I don't run nscd if that matters. |
Just so we're all on the same page here's what I do: I have 2 hostname change scripts. One that resides on and is called from the master and one on each minion. The master calls the minion script as a part of its script via salt's remote execution framework: rename-minion.sh script run on salt-master:
set_hostname.sh script called on minion:
I've commented out the sleeps and the minion service start since they weren't doing anything post name change (in favor of a full host/vm restart of the minion but did experiment with a number of combinations of sleep times, calls to restart and stop/start of the minion service without avail. |
I think I see what's going on here. I think what you need to do is delete the keys on the master before the minion is restarted with the new ID. Nevermind, though, that particular issue should be resolved with a minion restart, not a minion system restart. Still, would be something to try. |
Umm.. how exactly do I target the minion if I delete its key before restarting it? That nested remote-execution call to the minion will never execute. Are you implying this can't be done without an out-of-band restart of the salt minion (via salt-ssh or some other method?)? |
@andrejohansson Do any of the workarounds in this fellow's repo work on 2012? https://github.com/markuskramerIgitt/LearnSalt/blob/master/learn-run-as.sls |
We ended up creating a batch file to uninstall and reinstall the salt minion, since simply upgrading in place had weird behaviors on 2014.7.x to newer 2014.7.x. We then schedule the batch file using the method above which with /ru "SYSTEM" works quite well. The one thing we added to the batch was backing up the minion.pem and minion.pub (could also add the minion [.conf]) so that when it talks back to the master it isn't colliding with its old key, it is reusing it. We also had to trigger a service start after we installed the new version otherwise we are unable to connect to it from the salt master. |
@dragon788 yes, I ended up doing something similar, but I havent saved the key files yet. Smart!
The reboot I've found necessary because even in the newest 2015.5.X releases sometimes nssm.exe won't be deleted by the uninstaller and remains active in c:\salt. This can prevent successful installs and startups later. The wait 10 minutes I've found necessary because sometimes the installer won't start the minion after exit. |
under systemd systems, this upgrade of salt-minion via salt is really a pain in the neck. Setting this to KillMode=process helps there. I guess the salt-minion.service should be modified in this way. Meanwhile, I deploy the file /etc/systemd/system/salt-minion.service.d/killmode.conf with
It allows me to properly run
|
@douardda Thanks for the update. Looks like your pull request is merged, and I'm working on merging it forward today. |
Having this issue too on SmartOS/Solaris. There should probably be a |
Hm, this issue is labeled "documentation", but the discussions doesn't really be about documentation. Is this properly labeled? |
This is the relevant comment: #7997 (comment) |
Ideally there would be a 'reload' command instead that would properly reload everything for the salt-master/salt-minion without actually stopping and starting the process. |
@sjorge That would be nice but that's a pretty tall order. We'd have to refresh the opts dict everywhere and that's non-trivial. |
Can't the bigger brush be use? Closing the socks on the current process and forking a new copy, then existing. Since all listening sockets are close the new minion process should start fine while the old can still send a reply that all is OK to the master? |
The reload sounds really similar to how nginx handles config changes, it lets the sockets from the old config survive and spawns new ones with the new config until the old ones all expire. One thing we've noticed is that upgrading a debian version forces the service restart due to how Debian derivatives handles services in general, ie "you requested this be installed so we are enabling/starting the service NOW!". We have worked around this when preseeding salt-minion on machines by using the rc-policy.d trick (I might have dyslexia'd the name) which basically prevents a service from starting during apt-get operations, though it shouldn't affect running services. This could then possibly be followed up by a salt-call --local service.restart as mentioned above to flip from old version (in memory) to new version (on disk). This is completely untested, I'm just going through my watched issues and seeing if I've found any new creative ways to fix them. |
I'm wondering if we should close this in favor of #5721? |
I need to update out-of-date (0.17.5) Ubuntu Minions from an up-to-date (2016.3.2) salt-master; Minion-ID must remain the same, minion key must remain the same, minion cache should remain the same.
Documentation at https://docs.saltstack.com/en/latest/ref/modules/all/salt.modules.saltutil.html#salt.modules.saltutil.update seems not updated since 2014. Does
|
The relatively new minion config option |
I've solved this by having salt simply fork off an upgrade script that lets the minion return instantly and has the service restart in the background. The script is Ubuntu 14.04 specific but could easily be adapted. minion-upgrade.sls /tmp/salt-minion-upgrade-deb.sh:
cmd.script:
- source: salt://salt/upgrade-minion-deb.sh minion-upgrade-deb.sh #!/bin/bash
# This script forks off and runs in the background so salt can continue
{
DEBIAN_FRONTEND=noninteractive apt-get install -y -o Dpkg::Options::=--force-confold salt-minion
service salt-minion restart
} >>/var/log/salt/minion-upgrade.log 2>&1 &
disown |
Just to complete the ideas: |
Speaking only about restarting Minion, I really like the solution from here: #5721 salt '*' cmd.run_bg 'sleep 10; service salt-minion restart' As for Debian 8 and Ubuntu 16 with salt -C 'G@init:systemd and G@os_family:Debian' service.mask salt-minion
salt -C 'G@init:systemd and G@os_family:Debian' pkg.install salt-minion refresh=True
salt -C 'G@init:systemd and G@os_family:Debian' service.unmask salt-minion There is another solution for Upstart and SystemV init -- using of salt -C '( G@init:upstart or G@init:sysvinit ) and G@os_family:Debian' file.manage_file \
/usr/sbin/policy-rc.d '' '{}' '' '{}' root root '755' base '' contents=''
salt -C '( G@init:upstart or G@init:sysvinit ) and G@os_family:Debian' file.append \
/usr/sbin/policy-rc.d '!#/bin/sh' 'exit 101'
salt -C '( G@init:upstart or G@init:sysvinit ) and G@os_family:Debian' pkg.install \
salt-minion refresh=True
salt -C '( G@init:upstart or G@init:sysvinit ) and G@os_family:Debian' file.remove \
/usr/sbin/policy-rc.d I've found that it's the most reliable way to get Salt Minion upgraded properly. Also I've discovered that restarting Minions just with: salt '*' service.restart salt-minion works like a charm with recent Salt version from Need to do more testing, but I think using |
Hi, i had been trying to restart a windows minion for ages, and finally worked out a way to get it to work everytime: salt '*' cmd.run_bg 'Restart-Service salt-minion' shell=powershell |
Thanks @Trouble123 👍 That worked. |
What about And more, I see that |
Fix #7997: describe how to upgrade Salt Minion in a proper way
* upstream/develop: (57 commits) Gate class definitions Don't hardcode the webserver port number INFRA-4506 - fix indentation INFRA-4506 - test=True should not return False on success INFRA-4506 - add list_rules() function to boto_cloudwatch module. Change 'Name' param of present and absent functions to default to the value of name if not provided. Update Azure ARM cache add utils to engines Disable mentionbot delay on develop Mention bot delay disable for 2016.11 Add a function to list PRs to the GitHub execution module Fix saltstack#7997: describe how to upgrade Salt Minion in a proper way minionswarm.py: allow random UUID INFRA-4506 - add CLI example before lint complains :) INFRA-4506 - boto_lambda module is mysteriously missing a 'list_functions' function add specific docs for cmd_subset Avahi/Bonjour: Detect hostname or IP address change Add special token to insert the minion id into the default_include path Pylint fixes Correct comment lines output got list_hosts Code cleanup and make sure the beacons config file is deleted after testing ...
We ran through the right way of doing this in salt training with Seth but I think I'm still missing something. I'm not sure if this is a bug or if I've missed something. I tried to run through the upgrade the master first / use salt to upgrade the minion service steps to upgrade from v.17 to v.17.1 of salt and ended up with losing access to most of my minions.
Long story short, I need a reliable way of upgrading all of the salt-minions and salt-master packages without losing access to the minions. From what I can tell, every time I perform such an upgrade I lose access to some if not all of my minions and need to login to each host/VM and restart the salt-minion package. This is doable in test/dev where we have 30 nodes being managed but not when I move this infrastructure to prod where I have over 200 nodes to manage. I need the upgrade path not to break the remote execution framework established between minions and master.
So without further ado here's what I did:
Update the master.
I restart the master and minion on my master VM.
Try to upgrade some of my test minion VMs.
I login to each minion VM and restart the salt-minion service.
Now I can ping the VMs again.
Versions reports:
You'll notice that the upgrade proceeded correctly. The packages were upgraded, but the salt-minion services were not restarted as a part of the upgrade process (for both minion VMs - one is RHEL5 and the other is RHEL6). Unfortunately, I didn't think to run the upgrade packages command in verbose mode at the time.
Do I need to find some external remote-execution method to restart all of the minions post-upgrade (mussh/omnitty, etc...)? This is probably not a bug but it's still very frustrating... I'm unlikely to upgrade again until I can figure out how to do this properly.
The text was updated successfully, but these errors were encountered: