Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log filled with "Exception occurred while Subscriber handling stream: Already reading" #49147

Open
furgerf opened this issue Aug 16, 2018 · 45 comments

Comments

@furgerf
Copy link

commented Aug 16, 2018

Description of Issue/Question

Hi, on a server running salt-minion and salt-master, the minion's logfile was today filled (quite quickly) with the following message:

2018-08-12 06:25:10,762 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,762 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,762 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,764 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading

Stopping the minion service didn't stop the process writing to the logfile, neither did SIGTERM - I only got rid of it with SIGKILL (PID 25008 here):

$ ps aux | grep [p]ython
root     24851  0.0  0.1 101600  6048 ?        S    Aug10   0:00 /usr/bin/python /usr/bin/salt-minion
root     24854  0.0  0.2 548420 11264 ?        Sl   Aug10   0:09 /usr/bin/python /usr/bin/salt-minion
root     25005  0.0  0.1 101600  6412 ?        S    Aug10   0:00 /usr/bin/python /usr/bin/salt-minion
root     25008 99.9  0.6 549964 27508 ?        Sl   Aug10 8403:05 /usr/bin/python /usr/bin/salt-minion

Setup

I don't know what's relevant here...

Steps to Reproduce Issue

Don't know either :/

Versions Report

Salt Version:
           Salt: 2017.7.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
      docker-py: Not Installed
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 0.9.1
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.6 (default, Nov 23 2017, 15:49:48)
   python-gnupg: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: Ubuntu 14.04 trusty
         locale: ANSI_X3.4-1968
        machine: x86_64
        release: 3.19.0-80-generic
         system: Linux
        version: Ubuntu 14.04 trusty

Same version of minion, master, and common.

@Ch3LL

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2018

was there possibly anything before these log messages appeared that might be more telling as to why this occurred?

Any other reason as to why this would have occured? Where you running a particular command, state, etc?

@Ch3LL Ch3LL added the Info Needed label Aug 16, 2018
@Ch3LL Ch3LL added this to the Blocked milestone Aug 16, 2018
@furgerf

This comment has been minimized.

Copy link
Author

commented Aug 16, 2018

As far as I know, no manual activity caused this. This looks like the beginning, there don't seem to be any earlier messages:

2018-08-10 13:01:57,661 [salt.log.setup   ][WARNING ][24854] Temporary logging is already configured
2018-08-10 13:01:57,720 [salt.log.setup   ][WARNING ][24854] Console logging already configured
2018-08-10 13:01:57,721 [salt.log.setup   ][WARNING ][24854] Console logging already configured
2018-08-10 13:01:57,723 [salt.log.setup   ][WARNING ][24854] Logfile logging already configured
2018-08-10 13:01:57,727 [salt.log.setup   ][ERROR   ][24854] An un-handled exception was caught by salt's global exception handler:
IOError: [Errno 5] Input/output error
Traceback (most recent call last):
  File "/usr/bin/salt-minion", line 26, in <module>
    salt_minion()
  File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 204, in salt_minion
    minion.start()
  File "/usr/lib/python2.7/dist-packages/salt/cli/daemons.py", line 340, in start
    super(Minion, self).start()
  File "/usr/lib/python2.7/dist-packages/salt/utils/parsers.py", line 1032, in start
    self.prepare()
  File "/usr/lib/python2.7/dist-packages/salt/cli/daemons.py", line 299, in prepare
    self.shutdown(1)
  File "/usr/lib/python2.7/dist-packages/salt/cli/daemons.py", line 397, in shutdown
    self.__class__.__name__, (exitmsg or '')).strip()))
  File "/usr/lib/python2.7/dist-packages/salt/utils/parsers.py", line 1045, in shutdown
    self.exit(exitcode, exitmsg)
  File "/usr/lib/python2.7/dist-packages/salt/utils/parsers.py", line 268, in exit
    optparse.OptionParser.exit(self, status, msg)
  File "/usr/lib/python2.7/optparse.py", line 1573, in exit
    sys.stderr.write(msg)
IOError: [Errno 5] Input/output error
2018-08-10 13:02:01,413 [salt.log.setup   ][WARNING ][25008] Temporary logging is already configured
2018-08-10 13:02:01,473 [salt.log.setup   ][WARNING ][25008] Console logging already configured
2018-08-10 13:02:01,473 [salt.log.setup   ][WARNING ][25008] Console logging already configured
2018-08-10 13:02:01,475 [salt.log.setup   ][WARNING ][25008] Logfile logging already configured
2018-08-10 13:02:01,477 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-10 13:02:01,477 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading

The master's log file contains many stacktraces because it's unable to create temp files, as well as messages like those:

2018-08-14 11:57:41,801 [salt.utils.gitfs ][WARNING ][13836] Update lock file is present for gitfs remote 'http://XXX', skipping. If this warning persists, it is possible that the update process was interrupted, but the lock could also have been manually set. Removing /var/cache/salt/master/gitfs/67557cee86485c172576f530bfb03bbdb4c87c5971858ce20b19388e07b85606/.git/update.lk or running 'salt-run cache.clear_git_lock gitfs type=update' will allow updates to continue for this remote.
2018-08-14 11:58:42,687 [salt.utils.gitfs ][WARNING ][13836] gitfs_global_lock is enabled and update lockfile /var/cache/salt/master/gitfs/67557cee86485c172576f530bfb03bbdb4c87c5971858ce20b19388e07b85606/.git/update.lk is present for gitfs remote 'http://XXX'.

Sorry, I could've added that right at the start, hope this helps!

@Ch3LL

This comment has been minimized.

Copy link
Contributor

commented Aug 17, 2018

ping @saltstack/team-transport any ideas here? or follow up questions since this might be difficult to replicate.

@DmitryKuzmenko

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2018

This is a race condition somewhere in the event bus client usage on minion side. We use the IPCMessageSubscriber to read the event bus. It's a singleton so there could be 2 instances of the bus client that uses one subscriber and since it's asynchronous, one tries to read while another one didn't done yet. I can't say more without detailed analyze.
@furgerf is it easily reproducible on your side? Is it possible to ask you to reproduce this on patched minion code to get better logging?

@furgerf

This comment has been minimized.

Copy link
Author

commented Aug 20, 2018

Hi @DmitryKuzmenko, no unfortunately I have no idea how to reproduce it. By your description, it sounds like you might be able to trigger this more easily by creating a bunch of bus clients and having them read from the same subscriber if that's an option in the relevant code.
Let me know if I can help with more info.

@DmitryKuzmenko

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2018

@furgerf it's OK.
What about the problem, I think we could try to catch this by code review. There are not many places where that subscriber is used.
Did you do something on that minion right before? I see the strange exception report in your log 4 seconds before the issue happened. Did you restarted minion manually or by cron? Did minion execute some state or schedule task?

@Ch3LL I propose to mark it as a bug and put to the backlog. If we'd have more reports here, we could increase the severity of this.

@furgerf

This comment has been minimized.

Copy link
Author

commented Aug 21, 2018

@DmitryKuzmenko I've looked into the logs and see that there was some activity before the problem started, someone ran salt '*' cmd.run 'curl ifconfig.me'. There were no restarts though.

Like I mentioned before, this server contains both a minion and a master; the master manages around 40 minions in total. About half of them are currently connected.

@DmitryKuzmenko

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2018

@furgerf thank you for the info. I'll review the related code when I'll have time. If you'll reproduce it again any new information is very appreciated.

@viq

This comment has been minimized.

Copy link
Contributor

commented Nov 9, 2018

Any news with this? I was just bitten by it on a couple of minions, eg

# salt-call --versions-report
Salt Version:
           Salt: 2018.3.3
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.2
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.21.1
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.9 (default, Sep 25 2018, 20:42:16)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.4.3
            ZMQ: 4.0.5
 
System Versions:
           dist: debian 8.11 
         locale: UTF-8
        machine: x86_64
        release: 3.16.0-7-amd64
         system: Linux
        version: debian 8.11 
@rochen-joshua

This comment has been minimized.

Copy link

commented Nov 14, 2018

Getting the same here in our setup, about 10 or so of our minions are getting this and filling up drive space rapidly.

@rochen-joshua

This comment has been minimized.

Copy link

commented Nov 14, 2018

FYI - We have about 300 or so minions, so it's not global, and reinstalling did not fix issue.

@tomwalsh

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2018

Started getting this as well. Only a few minions, but it fills up the drive rather quickly causing other issues.

@Ch3LL

This comment has been minimized.

Copy link
Contributor

commented Nov 26, 2018

@DmitryKuzmenko seems more people are reporting this so i will increase the severity. any idea as to when you can get to this issue? I know its currently on your backlog due to higher priority issues.

@DmitryKuzmenko

This comment has been minimized.

Copy link
Contributor

commented Nov 27, 2018

Had some time today to work on this. I confirmed my thinking I wrote here above. Tomorrow I'll provide the description and solution.

@DmitryKuzmenko

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2018

I have no any code yet, but I promised to post an update and I'm doing this: the main problem is that IPCMessageSubscriber is a singleton. It's used by event that could be instantiated anywhere just to wait for some event, handle it and forget. So we can have a number of event instances at once. All of them will share one subscriber. If a number of instances run read_async simultaneously all will call stream.read_bytes that will fail for all subsequent calls if the first one isn't done yet. Moreover it conforms with the subscriber pattern, all the instance have to get the same data if some received.

The simplest way to fix this is to make the IPCMessageSubscriber non-sigleton. Here we need to make IPCClient a non-singleton and slightly rewrite IPCMessageClient to implement singleton by itself (or make it non-singleton too). This could have a performance impact as well.

One more of my minds is to split singleton to non-singleton handler class strong referencing the singleton instance and singleton weak referencing the handlers. This will allow singleton to know who is waiting for the event data and pass data to all event objects.

Let me some time to think about the best solution. I'll provide one soon.

@Ch3LL

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2019

@DmitryKuzmenko i know you are working on the tornado stuff. Are you going to be able to get to this soon? If not we can re-assign it to someone else.

@DmitryKuzmenko

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

I'll take care of this.

@vandernorth

This comment has been minimized.

Copy link

commented Feb 26, 2019

We're using CentOs7 and updating and restarting our minions with kernel 3.10.0-957.5.1.el7.x86_64 seem to have solved this problem for us. The affected minions had kernels between 3.10.0-5xx and 3.10.0-8xx.

Edit: Might not be kernel related, might just have been the reboot, after the update to salt-minion-2019.2.0-1.el7 it seemed there were 2 minions running on some of our machines.

@sfozz

This comment has been minimized.

Copy link

commented Feb 26, 2019

+1 here, we're seeing it on a selection of CentOS7 systems as well. Interestingly the most recent was also running cpanel.

Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Oct 30 2018, 23:45:53)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-957.5.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core
@DejanKolar

This comment has been minimized.

Copy link

commented Mar 20, 2019

Updating kernel to latest release didn't solve the problem.

@DmitryKuzmenko

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2019

@DejanKolar the PR is still not merged yet. I'm waiting for the CI system fix to re-run tests. I expect this to be done in a couple of days.

@KChandrashekhar KChandrashekhar modified the milestones: March 29th, Priority Apr 5, 2019
@gsagwan

This comment has been minimized.

Copy link

commented Apr 7, 2019

I am also facing the same issue after upgrading to 2018.3.3

@sfishback

This comment has been minimized.

Copy link

commented Apr 11, 2019

I observed the same issue after upgrading from salt-2018.3.0-1.el7 to salt-2019.2.0-1.el7 today. Our monitoring noticed /var was almost full. Doing a "systemctl stop salt-minion.service" did not work. I had to do a "kill -9" on two remaining orphaned salt-minion daemons to get the error messages in /var/log/salt/minion to stop.

@sfishback

This comment has been minimized.

Copy link

commented Apr 11, 2019

Also, for what it's worth...
Running RHEL7 kernel 3.10.0-957.el7.x86_64. Kernel kernel-3.10.0-957.10.1.el7.x86_64 is installed but the system hasn't been rebooted yet to apply.

@jimmyat

This comment has been minimized.

Copy link

commented May 1, 2019

I'm still seeing this issue on a large number of servers- any idea when the fix will be available?

@afletch

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

Perhaps @DmitryKuzmenko can update everyone on which release this will get into

@sfozz

This comment has been minimized.

Copy link

commented May 1, 2019

I've added in the following on cpanel servers which appears to have addressed this:

cpanel_local:
  file.directory:
    - name:     /etc/cpanel/local
    - makedirs: True

cpanel_ignored:
  file.managed:
    - name:     /etc/cpanel/local/ignore_outdated_services
    - contents:
      - salt-minion
@DmitryKuzmenko

This comment has been minimized.

Copy link
Contributor

commented May 7, 2019

Currently the fix is in all branches (2018.3, 2019.2 and develop). It will be released in the nearest builds.

@anitakrueger

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

Could you reference the PR or commit by any chance? Got bitten by this today and it'd be interesting what the fix looks like. Thanks!

@DmitryKuzmenko

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

@anitakrueger that was a long way finally done with #52445 and #52564

@anitakrueger

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

Awesome, much appreciated!

@elipsion

This comment has been minimized.

Copy link

commented May 22, 2019

We also encountered this late last evening, roughly 2~4% of our 250 minions were affected.

Any idea of when the fix will hit http://repo.saltstack.com/?

@DmitryKuzmenko

This comment has been minimized.

Copy link
Contributor

commented May 23, 2019

@elipsion in June I think.

@doesitblend doesitblend added the ZD label Jun 28, 2019
@doesitblend

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

ZD-3869

@perfecto25

This comment has been minimized.

Copy link

commented Oct 17, 2019

Had the same issue running salt 2018.3.4 on Centos 7

It kept writing to minion log every millisecond, and caused one of my hosts to run out of disk space,

I think the reason was multiple salt procs running at same time, you can see PID 1114 has a strange "1-20:07.." in the timespan column, maybe a zombi?

                                                                                                                                                                                                                                                                                root@testbox /var# ps -ef | grep salt
root      1114     1 30 Oct11 ?        1-20:07:34 /usr/bin/python /usr/bin/salt-minion
root     10122  1114  0 Oct15 ?        00:00:00 /usr/bin/python /usr/bin/salt-minion
root     25791     1  7 11:18 ?        00:00:14 /usr/bin/python /usr/bin/salt-minion
root     26009     1  0 11:19 ?        00:00:00 /usr/bin/python /usr/bin/salt-minion
root     26165 24719 10 11:21 pts/0    00:00:00 grep --color=auto salt

killing all salt procs and restarting minion fixed the issue, but luckily this was a testbox, if was production and this caused a disk space issue, would be much bigger problem.

Not sure how to reproduce it again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.