Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[systemd] saltstack (git master version) can't start with systemd 216 #15721

Closed
aboe76 opened this issue Sep 11, 2014 · 11 comments
Closed

[systemd] saltstack (git master version) can't start with systemd 216 #15721

aboe76 opened this issue Sep 11, 2014 · 11 comments
Labels
Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists
Milestone

Comments

@aboe76
Copy link
Contributor

aboe76 commented Sep 11, 2014

If I start salt-master with systemctl start salt-master it will hang for a long time.

This is the output of journalctl -u salt-master -l

sep 11 22:21:16 salt.example.com systemd[1]: salt-master.service: Got notification message from PID 19888, but reception only permitted for main PID 19868
sep 11 22:22:46 salt.example.com systemd[1]: salt-master.service start operation timed out. Terminating.
sep 11 22:22:46 salt.example.com salt-master[19868]: Process Process-4:
sep 11 22:22:46 salt.example.com salt-master[19868]: Traceback (most recent call last):
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
sep 11 22:22:46 salt.example.com salt-master[19868]: self.run()
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
sep 11 22:22:46 salt.example.com salt-master[19868]: self._target(*self._args, **self._kwargs)
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/site-packages/salt/master.py", line 316, in run_reqserver
sep 11 22:22:46 salt.example.com salt-master[19868]: reqserv.run()
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/site-packages/salt/master.py", line 513, in run
sep 11 22:22:46 salt.example.com salt-master[19868]: self.__bind()
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/site-packages/salt/master.py", line 507, in __bind
sep 11 22:22:46 salt.example.com salt-master[19868]: self.process_manager.run()
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/site-packages/salt/utils/process.py", line 180, in run
sep 11 22:22:46 salt.example.com salt-master[19868]: pid, exit_status = os.wait()
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/site-packages/salt/utils/process.py", line 205, in kill_children
sep 11 22:22:46 salt.example.com salt-master[19868]: del self._process_map[pid]
sep 11 22:22:46 salt.example.com salt-master[19868]: KeyError: 19879
sep 11 22:22:46 salt.example.com salt-master[19868]: Exception AttributeError: "'ReqServer' object has no attribute 'clients'" in <bound method ReqServer.__del__ of <salt.master.ReqServer object at 0x7fb6b6
sep 11 22:22:46 salt.example.com salt-master[19868]: [ERROR   ] An un-handled exception was caught by salt's global exception handler:
sep 11 22:22:46 salt.example.com salt-master[19868]: OSError: [Errno 4] Interrupted system call
sep 11 22:22:46 salt.example.com salt-master[19868]: Traceback (most recent call last):
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/bin/salt-master", line 10, in <module>
sep 11 22:22:46 salt.example.com salt-master[19868]: salt_master()
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/site-packages/salt/scripts.py", line 49, in salt_master
sep 11 22:22:46 salt.example.com salt-master[19868]: master.start()
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/site-packages/salt/__init__.py", line 138, in start
sep 11 22:22:46 salt.example.com salt-master[19868]: self.master.start()
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/site-packages/salt/master.py", line 319, in start
sep 11 22:22:46 salt.example.com salt-master[19868]: process_manager.run()
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/site-packages/salt/utils/process.py", line 180, in run
sep 11 22:22:46 salt.example.com salt-master[19868]: pid, exit_status = os.wait()
sep 11 22:22:46 salt.example.com salt-master[19868]: OSError: [Errno 4] Interrupted system call
sep 11 22:22:46 salt.example.com salt-master[19868]: Traceback (most recent call last):
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/bin/salt-master", line 10, in <module>
sep 11 22:22:46 salt.example.com salt-master[19868]: salt_master()
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/site-packages/salt/scripts.py", line 49, in salt_master
sep 11 22:22:46 salt.example.com salt-master[19868]: master.start()
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/site-packages/salt/__init__.py", line 138, in start
sep 11 22:22:46 salt.example.com salt-master[19868]: self.master.start()
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/site-packages/salt/master.py", line 319, in start
sep 11 22:22:46 salt.example.com salt-master[19868]: process_manager.run()
sep 11 22:22:46 salt.example.com salt-master[19868]: File "/usr/lib/python2.7/site-packages/salt/utils/process.py", line 180, in run
sep 11 22:22:46 salt.example.com salt-master[19868]: pid, exit_status = os.wait()
sep 11 22:22:46 salt.example.com salt-master[19868]: OSError: [Errno 4] Interrupted system call
sep 11 22:22:46 salt.example.com systemd[1]: salt-master.service: main process exited, code=exited, status=1/FAILURE
sep 11 22:22:46 salt.example.com systemd[1]: Failed to start The Salt Master Server.
sep 11 22:22:46 salt.example.com systemd[1]: Unit salt-master.service entered failed state.

If I'm correct this means that systemd notify expects that only the mainpid will be the one that sends the notify, not all the forked processes.

this is from the following patch:
systemd/systemd@336c6e4

So it looks like salt needs to send the mainpid with the notifies.
I think this line: systemd.daemon.notify('READY=1')
here: https://github.com/saltstack/salt/blob/develop/salt/master.py#L471

needs to change to something (sorry not good with pythoncode myself)
systemd.daemon.notify('READY=1','MAINPID=(some pythoncode that prints the mainpid of its process)')

@aboe76
Copy link
Contributor Author

aboe76 commented Sep 11, 2014

@gtmanfred as requested

@gtmanfred
Copy link
Contributor

@terminalmage can you take a look at this too since we worked on this part of the systemd stuff

@gtmanfred
Copy link
Contributor

@thatch45 @SmithSamuelM can yall or someone else verify for me where in the code the __bind for zmq and raet are done?

If we move the systemd.daemon.notify('READY=1') to here https://github.com/saltstack/salt/blob/develop/salt/__init__.py#L124 would the zmq socket be done __bind ing? since it is after the daemonize_if_required function?

diff --git a/salt/__init__.py b/salt/__init__.py
index 4da5735..69b7293 100644
--- a/salt/__init__.py
+++ b/salt/__init__.py
@@ -48,6 +48,11 @@ from salt.exceptions import SaltSystemExit, MasterExit
 # leaves us alone and stops complaining about an un-used import
 logger = salt.log.setup.logging.getLogger(__name__)

+try:
+    import systemd.daemon
+    HAS_PYTHON_SYSTEMD = True
+except ImportError:
+    HAS_PYTHON_SYSTEMD = False

 class Master(parsers.MasterOptionParser):
     '''
@@ -122,6 +127,12 @@ class Master(parsers.MasterOptionParser):
             self.master = salt.daemons.flo.IofloMaster(self.config)
         self.daemonize_if_required()
         self.set_pidfile()
+        try:
+            if HAS_PYTHON_SYSTEMD and systemd.daemon.booted():
+                systemd.daemon.notify('READY=1')
+        except SystemError:
+            # Daemon wasn't started by systemd
+            pass

     def start(self):
         '''
diff --git a/salt/master.py b/salt/master.py
index 93c0225..2ddd778 100644
--- a/salt/master.py
+++ b/salt/master.py
@@ -56,12 +56,6 @@ try:
 except ImportError:
     HAS_HALITE = False

-try:
-    import systemd.daemon
-    HAS_PYTHON_SYSTEMD = True
-except ImportError:
-    HAS_PYTHON_SYSTEMD = False
-

 log = logging.getLogger(__name__)

@@ -466,13 +460,6 @@ class ReqServer(object):

         self.workers.bind(self.w_uri)

-        try:
-            if HAS_PYTHON_SYSTEMD and systemd.daemon.booted():
-                systemd.daemon.notify('READY=1')
-        except SystemError:
-            # Daemon wasn't started by systemd
-            pass
-
         while True:
             try:
                 zmq.device(zmq.QUEUE, self.clients, self.workers)

Thanks
Daniel

@rallytime rallytime added Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists labels Sep 17, 2014
@rallytime rallytime added this to the Approved milestone Sep 17, 2014
@rallytime
Copy link
Contributor

Thanks for the report @aboe76! Looks like @gtmanfred already has some ideas.

The __bind code for zmq is located in salt/master.py and for raet it's in salt/daemons/flo/core.py. Does that help?

@gtmanfred
Copy link
Contributor

It does not.

The problem is that the master process that runs the __bind, is not the main master process, which is why the systemd.daemon.notify is complaining.

My change would move the notify to after the __bind is run, but the parent master process is the one that runs it, but i don't know if it is going to be run at a point that actually is after all of the sockets are done with __bind and the master process is completly started.

@aboe76
Copy link
Contributor Author

aboe76 commented Sep 19, 2014

@gtmanfred isn't it possible to find out what the parent master process pid is and just add it to the systemd.daemon.notify ? I think python is smart enough to find out what it parent pid is...
https://docs.python.org/2/library/os.html#os.getppid

@gtmanfred
Copy link
Contributor

python-systemd does not support sending the MAINPID from what I can see. It would need to have notifyf wrapped. Even if it did work, there is another process between the __bind() and the MAINPID, in my trying they getppid() at that place and the was not the same pid that was placed in set_pidfile.

@aboe76
Copy link
Contributor Author

aboe76 commented Sep 23, 2014

@gtmanfred, thanks for trying, so if only the MAINPID process can do systemd.daemon.notify, than there needs to be some logic change to have all other processes report back to the MAINPID process before it sends the systemd.daemon.notify, but that sound like something stable...

@thatch45
Copy link
Member

I would rather pull this out before we cut the release then deal with it, also, that should be the main process, but has it been daemonized?

@gtmanfred
Copy link
Contributor

it appears that the point where it does the __bind is no longer the main pid.

@aboe76
Copy link
Contributor Author

aboe76 commented Sep 26, 2014

hi @gtmanfred, thanks for the fix, the salt-master starts with systemd notify, but it still logs this:

sep 26 23:29:18 salt.example systemd[1]: salt-master.service: Got notification message from PID 24614, but reception only permitted for main PID 24608

if I look closer:

systemctl status salt-master -l
● salt-master.service - The Salt Master Server
   Loaded: loaded (/usr/lib/systemd/system/salt-master.service; enabled)
   Active: active (running) since vr 2014-09-26 23:29:18 CEST; 14s ago
 Main PID: 24608 (salt-master)
   CGroup: /system.slice/salt-master.service
           ├─24608 /usr/bin/python2 /usr/bin/salt-master
           ├─24611 /usr/bin/python2 /usr/bin/salt-master
           ├─24612 /usr/bin/python2 /usr/bin/salt-master
           ├─24613 /usr/bin/python2 /usr/bin/salt-master
           ├─24614 /usr/bin/python2 /usr/bin/salt-master
           ├─24616 /usr/bin/python2 /usr/bin/salt-master
           ├─24621 /usr/bin/python2 /usr/bin/salt-master
           ├─24624 /usr/bin/python2 /usr/bin/salt-master
           ├─24625 /usr/bin/python2 /usr/bin/salt-master
           ├─24628 /usr/bin/python2 /usr/bin/salt-master
           └─24633 /usr/bin/python2 /usr/bin/salt-master

but it starts without and a minion can connect to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists
Projects
None yet
Development

No branches or pull requests

4 participants