New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add log_exceptions decorator and wrap Maintenance.run. #30383

Closed
wants to merge 1 commit into
base: develop
from

Conversation

Projects
None yet
2 participants
@wdoekes
Copy link
Contributor

wdoekes commented Jan 15, 2016

In order to debug problems with unhandled errors, it is beneficial if
dying subprocesses write their fatal exception to the log.

This changeset adds a decorator to wrap the run() function, and wraps
the Maintenance run method. Other run() methods could benefit from
this wrapping as well.


When debugging the problem which is fixed by #30382 it would have
helped if the maintenance subprocess had output its exception to
the salt/master log. This patch should fix that it does.

Add log_exceptions decorator and wrap Maintenance.run.
In order to debug problems with unhandled errors, it is beneficial if
dying subprocesses write their fatal exception to the log.

This changeset adds a decorator to wrap the run() function, and wraps
the Maintenance run method.  Other run() methods could benefit from
this wrapping as well.
@s0undt3ch

This comment has been minimized.

Copy link
Member

s0undt3ch commented Jan 15, 2016

Se do have an uncaught exceptions handler in salt, however, as you pointed out, its not enabled on processes that we start. So instead of wrapping every run method as you suggest, we should instead do whatever is necessary to make that handler work for processes thatbsakt starts...

@s0undt3ch

This comment has been minimized.

Copy link
Member

s0undt3ch commented Jan 15, 2016

@wdoekes is there a way I can ask you to confirm if #30388 would allow you to catch the problem you fixed in #30382 please?

@wdoekes

This comment has been minimized.

Copy link
Contributor Author

wdoekes commented Jan 15, 2016

(oh wait, grepped for the wrong thing.. let me get back to you on this one)

@s0undt3ch

This comment has been minimized.

Copy link
Member

s0undt3ch commented Jan 15, 2016

You should grep instead for __global_logging_exception_handler which you should have, but I'll try to force the suggested ValueError.

@wdoekes

This comment has been minimized.

Copy link
Contributor Author

wdoekes commented Jan 15, 2016

Well.. that didn't work out at all. My process.py looked very different from develop, so I ended up doing this:

--- /usr/lib/python2.7/dist-packages/salt/log/setup.py.orig 2016-01-14 00:01:38.333600053 +0100
+++ /usr/lib/python2.7/dist-packages/salt/log/setup.py  2016-01-15 17:31:05.335495222 +0100
@@ -649,6 +649,9 @@ def __global_logging_exception_handler(e
     '''
     This function will log all python exceptions.
     '''
+    with open('/tmp/test', 'a') as f:
+        f.write('x\n')
+        
     # Log the exception
     logging.getLogger(__name__).error(
         'An un-handled exception was caught by salt\'s global exception '
@@ -664,5 +667,11 @@ def __global_logging_exception_handler(e
     __GLOBAL_EXCEPTION_HANDLER(exc_type, exc_value, exc_traceback)


+def setup_global_exceptions_handler():
+    if sys.excepthook is not __global_logging_exception_handler:
+        # Set our own exception handler as the one to use
+        sys.excepthook = __global_logging_exception_handler
+
+
 # Set our own exception handler as the one to use
-sys.excepthook = __global_logging_exception_handler
+setup_global_exceptions_handler()

And this:

--- /usr/lib/python2.7/dist-packages/salt/master.py.orig    2016-01-14 00:10:41.218945899 +0100
+++ /usr/lib/python2.7/dist-packages/salt/master.py 2016-01-15 17:37:13.118832679 +0100
@@ -45,6 +45,7 @@ import salt.minion
 import salt.search
 import salt.key
 import salt.fileserver
+import salt.log.setup
 import salt.utils.atomicfile
 import salt.utils.event
 import salt.utils.verify
@@ -183,6 +184,12 @@ class Master(SMaster):
         controller for the Salt master. This is where any data that needs to
         be cleanly maintained from the master is maintained.
         '''
+        salt.log.setup.setup_global_exceptions_handler()
+        import time
+        time.sleep(5)
+        log.warning('_clear_old_jobs no exception at all')
+        raise ValueError('pena')
+
         jid_root = os.path.join(self.opts['cachedir'], 'jobs')
         search = salt.search.Search(self.opts)
         last = int(time.time())

But that didn't log anything. (I added the write to /tmp/test later because I didn't notice anything happening.)

And then I raised an exception from the Master instead:

@@ -331,6 +338,9 @@ class Master(SMaster):
         enable_sigusr1_handler()

         self.__set_max_open_files()
+        log.error('HERE')
+        log.error(repr(sys.excepthook))
+        raise ValueError('test')
         clear_old_jobs_proc = multiprocessing.Process(
             target=self._clear_old_jobs)
         clear_old_jobs_proc.start()

The HERE is logged, and so is the excepthook:

2016-01-15 17:33:28,698 [salt.master      ][ERROR   /14802] HERE
2016-01-15 17:33:28,699 [salt.master      ][ERROR   /14802] <function __global_logging_exception_handler at 0x7fe1a8fb38c0>

But the __global_logging_exception_handler does not seem to get called at all.

@s0undt3ch

This comment has been minimized.

Copy link
Member

s0undt3ch commented Jan 15, 2016

So, actually, on latest develop, without my proposed PR in place, if you raise a ValueError in salt.master.Maintenance.run(), it will get logged...

What version of salt were you using when you were bitten by what you fixed in #30382

@wdoekes

This comment has been minimized.

Copy link
Contributor Author

wdoekes commented Jan 15, 2016

What version of salt were you using when you were bitten by what you fixed in #30382

0.17.5+ds-1 (Ubuntu Trusty)

As for calling sys.excepthook, I see this:

https://mail.python.org/pipermail/python-list/2012-June/625258.html
http://bugs.python.org/issue1230540
which ends with a fix, that looks like this:

@@ -178,11 +179,25 @@ class Master(SMaster):
         SMaster.__init__(self, opts)

     def _clear_old_jobs(self):
+        try:
+            self.__clear_old_jobs()
+        except (KeyboardInterrupt, SystemExit):
+            raise
+        except:
+            sys.excepthook(*sys.exc_info())
+
+    def __clear_old_jobs(self):

Which works. But that's basically what the log_exceptions decorator did.

@wdoekes

This comment has been minimized.

Copy link
Contributor Author

wdoekes commented Jan 15, 2016

So, actually, on latest develop, without my proposed PR in place, if you raise a ValueError in salt.master.Maintenance.run(), it will get logged...

Oh. Well. That should've logged the issue I stumbled upon too then.

That begs the question whether this issue has been fixed since 0.17, or whether my system has some other excepthook issue..

@wdoekes

This comment has been minimized.

Copy link
Contributor Author

wdoekes commented Jan 15, 2016

Additional info:
If I place this SMaster.__init__:

        import time
        time.sleep(5)
        log.warning('log works')
        raise ValueError()

then I get both the warning and the exception.

If I move that bit to Master.start or to Master._clear_old_jobs, then I get only my warning, but no exception. (I.e. the excepthook is not called.)

@s0undt3ch

This comment has been minimized.

Copy link
Member

s0undt3ch commented Jan 15, 2016

Placing that value error on salt.master.Master.start() also shows me the value error.

Can you get us the --versions-report of the salt code you're running on please?

@wdoekes

This comment has been minimized.

Copy link
Contributor Author

wdoekes commented Jan 15, 2016

$ salt-master --versions-report
           Salt: 0.17.5
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
 msgpack-python: 0.3.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
         PyYAML: 3.10
          PyZMQ: 14.0.1
            ZMQ: 4.0.4
@wdoekes

This comment has been minimized.

Copy link
Contributor Author

wdoekes commented Jan 15, 2016

Okay, so, I took a similar Ubuntu Trusty, and installed the 0.17.5 from repo: same results.

Then I took the development version from git (af26817) and ran it:

sudo env PYTHONPATH=`pwd` ./scripts/salt-master -d

It's better than it was, but I don't get the backtrace here either.

Changeset:

--- a/salt/master.py
+++ b/salt/master.py
@@ -202,6 +202,10 @@ class Maintenance(SignalHandlingMultiprocessingProcess):
         This is where any data that needs to be cleanly maintained from the
         master is maintained.
         '''
+        import time; time.sleep(2)
+        log.warning('warning')
+        raise ValueError('error')
+
         salt.utils.appendproctitle('Maintenance')

         # init things that need to be done after the process is forked

Log output:

2016-01-15 21:36:05,077 [salt.transport.zeromq                            ][INFO    ][14533] Worker binding to socket ipc:///var/run/salt/master/workers.ipc
2016-01-15 21:36:05,176 [salt.transport.zeromq                            ][INFO    ][14532] Worker binding to socket ipc:///var/run/salt/master/workers.ipc
2016-01-15 21:36:07,025 [salt.master                                      ][WARNING ][15167] warning
2016-01-15 21:36:07,033 [salt.utils.process                               ][INFO    ][14476] Process <class 'salt.master.Maintenance'> (15167) died with exit status None, restarting...
2016-01-15 21:36:10,040 [salt.master                                      ][WARNING ][15189] warning
2016-01-15 21:36:10,047 [salt.utils.process                               ][INFO    ][14476] Process <class 'salt.master.Maintenance'> (15189) died with exit status None, restarting...
2016-01-15 21:36:13,057 [salt.master                                      ][WARNING ][15190] warning

A couple of things stand out here:

  • it restarts the daemon -- which is better than nothing
  • it does print which job got lost, which is also an improvement
  • it does not print a traceback -- similar to the 0.17.5 version
@wdoekes

This comment has been minimized.

Copy link
Contributor Author

wdoekes commented Jan 15, 2016

import functools
import multiprocessing
import os
import six
import sys

def global_handler(type_, value, traceback):
    sys.stderr.write('STDERR [{}]: {}, {}, {}\n'.format(
        os.getpid(), type_, value, traceback))

def log_exceptions(log_object, where):
    def wrapper(function):
        @functools.wraps(function)
        def _wrapped(*args, **kwargs):
            try:
                return function(*args, **kwargs)
            except:
                exc = sys.exc_info()
                try:
                    msg = 'Got exception inside {}'.format(where)
                    sys.stderr.write('STDERR: {}\n'.format(msg))
                    # log_object.exception('Got exception inside {}'.format(
                    #     where))
                except:
                    pass
                six.reraise(*exc)
        return _wrapped

    return wrapper

class Process(multiprocessing.Process):
    # @log_exceptions(None, 'Process.run')
    def run(self):
        sys.excepthook = global_handler
        print('RUNNING[{}]'.format(os.getpid()))
        raise ValueError('error in child process')

# sys.excepthook = global_handler

print('STARTED[{}]'.format(os.getpid()))
# raise ValueError('error in master process')
p = Process()
p.start()
p.join()
print('DONE   [{}]'.format(os.getpid()))

There, whatever I try, I cannot get the excepthook to work from the subprocess.

The 'error in master process' gets handled just fine by the excepthook (when uncommented).

The 'error in child process', not so. With python2.7 nor with python3.4.

@s0undt3ch

This comment has been minimized.

Copy link
Member

s0undt3ch commented Jan 16, 2016

First, I'm sorry. You had already mentioned that you were running 0.17 when I asked you for the versions report.

Only recently, multiprocessing logging was added to salt, and by recently, the develop branch. This code hasn't been released yet in any stable version.
So, only the next major release of salt will properly handle this situations automatically.

This multiprocessing logging that I'm talking about is what enables processes started by salt to continue to log to the parent process which is the one which actually writes to log files and the console.

In conclusion, the next major version of salt will tell you about these issues.

Thank You for your time and patience with getting us to the right conclusion.

I believe we can close this pull request?

@wdoekes

This comment has been minimized.

Copy link
Contributor Author

wdoekes commented Jan 16, 2016

I believe we can close this pull request?

Well. No.

I'm telling you, I tried the development version -- which is the "newest version", yes? And yes, I get a log of the dying child, but I don't get a backtrace.

Quoting myself:

Then I took the development version from git (af26817) and ran it:
...
Process <class 'salt.master.Maintenance'> (15167) died with exit status None, restarting...

^-- that, is a "log of the event", but not a backtrace.

Quoting myself:

  • it restarts the daemon -- which is better than nothing
  • it does print which job got lost, which is also an improvement

^-- yes, better than 0.17.5

  • it does not print a traceback -- similar to the 0.17.5 version

^-- no, no backtrace. equally bad as 0.17.5

Do you really get a backtrace? Or are you simply getting the "Process has exited" message?

This multiprocessing logging that I'm talking about is what enables processes started by salt to continue to log to the parent process which is the one which actually writes to log files and the console.

I already get logs from the children with 0.17.5.

The logging I don't get, with 0.17.5 nor with 'development', is sys.excepthook calling.

@s0undt3ch

This comment has been minimized.

Copy link
Member

s0undt3ch commented Jan 16, 2016

❯ git pull upstream --tags                                                                                                                                                  develop dc95b28 ✗
remote: Counting objects: 36, done.
remote: Total 36 (delta 26), reused 26 (delta 26), pack-reused 10
Unpacking objects: 100% (36/36), done.
From github.com:saltstack/salt
   4b8bdd0..85aa70a  2014.7     -> upstream/2014.7
   a26c10a..5a923b3  2015.5     -> upstream/2015.5
   7775d65..1bace55  2015.8     -> upstream/2015.8
 * [new tag]         v2015.5.8  -> v2015.5.8
 * [new tag]         v2015.5.9  -> v2015.5.9
 * [new tag]         v2015.8.3  -> v2015.8.3
You asked to pull from the remote 'upstream', but did not specify
a branch. Because this is not the default configured remote
for your current branch, you must specify a branch on the command line.
~/projects/SaltStack/salt/develop virtualenv:PytestSalt27 exitcode:1
❯ git describe                                                                                                                                                              develop dc95b28 ✗
v2015.8.3-5104-gdc95b28
~/projects/SaltStack/salt/develop virtualenv:PytestSalt27 
❯  
@s0undt3ch

This comment has been minimized.

Copy link
Member

s0undt3ch commented Jan 16, 2016

❯ git diff                                                                                                                                                                  develop dc95b28 ✗
diff --git a/salt/master.py b/salt/master.py
index 455b598..65f3a9f 100644
--- a/salt/master.py
+++ b/salt/master.py
@@ -202,6 +202,7 @@ class Maintenance(SignalHandlingMultiprocessingProcess):
         This is where any data that needs to be cleanly maintained from the
         master is maintained.
         '''
+        1/0
         salt.utils.appendproctitle('Maintenance')

         # init things that need to be done after the process is forked
@@ -552,6 +553,8 @@ class Master(SMaster):
             # No custom signal handling was added, install our own
             signal.signal(signal.SIGINT, self._handle_signals)

+        1/0
+
         self.process_manager.run()

     def _handle_signals(self, signum, sigframe):  # pylint: disable=unused-argument
~/projects/SaltStack/salt/develop virtualenv:PytestSalt27 
❯  
❯ salt-master -c ~/projects/SaltStack/raas/.config                                                                                                                          develop dc[32/176]
[INFO    ][MainProcess       ][salt.cli.daemons:129 ] Setting up the Salt Master
[INFO    ][MainProcess       ][salt.daemons.masterapi                                           :215 ] Preparing the vampas key for local communication
[INFO    ][MainProcess       ][salt.cli.daemons                                                 :163 ] The salt master is starting up
[INFO    ][MainProcess       ][salt.master                                                      :476 ] salt-master is starting as user 'vampas'
[INFO    ][MainProcess       ][salt.master                                                      :367 ] Current values for max open files soft/hard setting: 1024/4096
[INFO    ][MainProcess       ][salt.master                                                      :378 ] The value for the 'max_open_files' setting, 100000, is higher than what the user running
 salt is allowed to raise to, 4096. Defaulting to 4096.
[INFO    ][MainProcess       ][salt.master                                                      :384 ] Raising max open files value to 4096
[INFO    ][MainProcess       ][salt.master                                                      :390 ] New values for max open files soft/hard values: 4096/4096
[INFO    ][MainProcess       ][salt.master                                                      :490 ] Creating master process manager
[INFO    ][MainProcess       ][salt.master                                                      :493 ] Creating master publisher process
[INFO    ][MainProcess       ][salt.master                                                      :499 ] Creating master event publisher process
[INFO    ][Process-2         ][salt.transport.zeromq                                            :637 ] Starting the Salt Publisher on tcp://0.0.0.0:4505
[INFO    ][Process-2         ][salt.transport.zeromq                                            :641 ] Starting the Salt Puller on ipc:///home/vampas/projects/SaltStack/raas/.config/cache/.sa
lt-unix/publish_pull.ipc
[INFO    ][MainProcess       ][salt.master                                                      :504 ] Creating master maintenance process
[INFO    ][MainProcess       ][salt.master                                                      :508 ] Creating master reactor process
Process Maintenance-5:
Traceback (most recent call last):
  File "/home/vampas/.dotfiles/.ext/pyenv/versions/2.7.8/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/vampas/projects/SaltStack/salt/develop/salt/master.py", line 205, in run
    1/0
ZeroDivisionError: integer division or modulo by zero
[INFO    ][MainProcess       ][salt.master                                                      :512 ] Creating master event return process
[INFO    ][MainProcess       ][salt.master                                                      :540 ] Creating master request server process
[ERROR   ][MainProcess       ][salt.log.setup                                                   :1034] An un-handled exception was caught by salt's global exception handler:
ZeroDivisionError: integer division or modulo by zero                                                                                                                                   [4/176]
Traceback (most recent call last):
  File "/home/vampas/.dotfiles/.ext/pyenv/versions/PytestSalt27/bin/salt-master", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/home/vampas/projects/SaltStack/salt/develop/scripts/salt-master", line 22, in <module>
    salt_master()
  File "/home/vampas/projects/SaltStack/salt/develop/salt/scripts.py", line 47, in salt_master
    master.start()
  File "/home/vampas/projects/SaltStack/salt/develop/salt/cli/daemons.py", line 164, in start
    self.master.start()
  File "/home/vampas/projects/SaltStack/salt/develop/salt/master.py", line 556, in start
    1/0
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
  File "/home/vampas/.dotfiles/.ext/pyenv/versions/PytestSalt27/bin/salt-master", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/home/vampas/projects/SaltStack/salt/develop/scripts/salt-master", line 22, in <module>
    salt_master()
  File "/home/vampas/projects/SaltStack/salt/develop/salt/scripts.py", line 47, in salt_master
    master.start()
  File "/home/vampas/projects/SaltStack/salt/develop/salt/cli/daemons.py", line 164, in start
    self.master.start()
  File "/home/vampas/projects/SaltStack/salt/develop/salt/master.py", line 556, in start
    1/0
ZeroDivisionError: integer division or modulo by zero
[INFO    ][Process-8:1       ][salt.transport.zeromq                                            :466 ] Setting up the master communication server
[ERROR   ][EventReturn-7     ][salt.utils.event                                                 :1071] Could not store events [{'tag': 'salt/event_listen/start', 'data': {'_stamp': '2016-01-1
6T13:37:01.305293'}}]. Returner raised exception: [Errno 111] Connection refused
[INFO    ][MWorker-8:3       ][salt.transport.zeromq                                            :524 ] Worker binding to socket ipc:///home/vampas/projects/SaltStack/raas/.config/cache/.salt-
unix/workers.ipc
[INFO    ][MWorker-8:2       ][salt.transport.zeromq                                            :524 ] Worker binding to socket ipc:///home/vampas/projects/SaltStack/raas/.config/cache/.salt-
unix/workers.ipc

I'm getting the tracebacks....

@wdoekes

This comment has been minimized.

Copy link
Contributor Author

wdoekes commented Jan 16, 2016

I'm getting the tracebacks....

Yes. But not how you think.

Thank you for that log output, as it explains everything.

Let us go back to the excepthook:
https://github.com/saltstack/salt/blob/develop/salt/log/setup.py#L1029

Do you see any An un-handled exception was caught by salt\'s global exception... in your log message? I don't, and as far as I can tell, that's because we're looking at stderr.

Try running the salt-master with -d, and checking the log file instead. If I'm correct, you'll find no tracebacks in your log.

@s0undt3ch

This comment has been minimized.

Copy link
Member

s0undt3ch commented Jan 16, 2016

I wiped my master log file completely, and started the master in daemon mode, the logfile log level is default, INFO.
The contents of the log file can be found here.

I did however notice that I only got the traceback from the master's process. I didn't get the traceback from the maintenance process because python hides that from the except hook.

I will need some more time to cook a solution similar to yours without the need to decorate all run or start methods.

@s0undt3ch

This comment has been minimized.

Copy link
Member

s0undt3ch commented Jan 16, 2016

#30388 has been updated to log uncaught exception on multiprocessing processes that salt starts. Care to take the updated code for a spin?

@wdoekes

This comment has been minimized.

Copy link
Contributor Author

wdoekes commented Jan 16, 2016

I'm glad we agree.

That new code is more to my liking. I didn't run it, but it looks like it should work (if you add a reraise, as noted there).

Thanks for fixing! I'll close this one out.

@wdoekes wdoekes closed this Jan 16, 2016

@s0undt3ch

This comment has been minimized.

Copy link
Member

s0undt3ch commented Jan 16, 2016

Thank you for your time and patience in getting us to the best result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment