Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salt-master crippled after receiving too much output from salt command. #55180

Closed
H20-17 opened this issue Oct 31, 2019 · 11 comments · Fixed by #58283
Closed

Salt-master crippled after receiving too much output from salt command. #55180

H20-17 opened this issue Oct 31, 2019 · 11 comments · Fixed by #58283
Labels
cannot-reproduce cannot be replicated with info/context provided Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged stale
Milestone

Comments

@H20-17
Copy link

H20-17 commented Oct 31, 2019

Description of Issue

I have a salt-master (v 2019.2.2 py3) and a salt minion (v 2019.2.2 py3). The salt minion is a Windows salt-minion and I have no idea if that makes any difference. If the salt-master receives too much output from the minion it cripples the salt master and the salt master does not auto-recover. Instead it has to be restarted.

Setup

No special setup other than as described in the Description

Steps to Reproduce Issue

Run the following command

/usr/local/bin/salt -t 100 windows_minion cmd.powershell_all ps

That command will translate to

ps | convertto-json

on the minion, which produces thousands of lines of output (in my case approx 35000 lines of output).

I don't have any opinion on whether or not the master should be able to handle this volume of output. EIther it should or it shouldn't, but the master shouldn't get done in by this. If the master is receiving too much output it should just cut the minion off and handle this gracefully.
What I observe is that the master just hangs or times out (I'm not sure which) and it doesn't print the job response (although jobs.lookup_jid does show you something).

That in itself isn't a big issue, but the thing is is that the master never recovers. The salt command on the master does not work after this and you have to restart the master.

Versions Report

Salt Version:
           Salt: 2019.2.2

Dependency Versions:
           cffi: 1.11.5
       cherrypy: Not Installed
       dateutil: 2.8.0
      docker-py: Not Installed
          gitdb: 2.0.3
      gitpython: 2.1.11
          ioflo: Not Installed
         Jinja2: 2.10.1
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.1
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.7.4 (default, Jul  9 2019, 16:32:37)
   python-gnupg: Not Installed
         PyYAML: 3.13
          PyZMQ: 17.0.0
           RAET: Not Installed
          smmap: 2.0.3
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2

System Versions:
           dist: fedora 30 Thirty
         locale: UTF-8
        machine: x86_64
        release: 5.3.7-200.fc30.x86_64
         system: Linux
        version: Fedora 30 Thirty

@frogunder
Copy link
Contributor

@H20-17 Thank you for reporting this issue.

I am however not able to replicate it.

For me running that command, it produces about 32000 lines of output and takes about 15 seconds to run. The master is still alive and can continue to run commands.
I have a Fedora 30 py3 master and Win 2019 py3 minion.

@frogunder frogunder added this to the Blocked milestone Nov 1, 2019
@frogunder frogunder added the cannot-reproduce cannot be replicated with info/context provided label Nov 1, 2019
@H20-17
Copy link
Author

H20-17 commented Nov 1, 2019

I tried this again today and I got the same behaviour I got when I reported this. I also get the following error in the salt master log

Nov 01 13:05:48 OURSERGORSALT.MYORG.ORG salt-master[21849]: Total run time:  84.985 s[ERROR   ] Exception occurred while handling stream: 1563469 exceeds max_bin_len(1048576)

It looks like some problem is occuring that isn't getting handled properly. I didn't succeed in getting it to compromise the functioning of the salt master again today, but it happened yesterday.

@H20-17
Copy link
Author

H20-17 commented Nov 1, 2019

It still seems to trigger the condition where the salt-master is crippled. It's not definite though since I can't get 100% reliability. After this happens once, about 1/5 salt commands timeout. Probably repeating it will increase the frequency.

@frogunder
Copy link
Contributor

@saltstack/team-core @saltstack/team-windows

Can any of you take a look at this? Thanks.

@frogunder frogunder added the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label Nov 1, 2019
@arizvisa
Copy link
Contributor

arizvisa commented Nov 5, 2019

@H20-17, can you check to see if the solutions/discussions in #54884, or #53230 solve your problem? Some versions of msgpack impose outrageous defaults to prevent denial of service attacks and result in that particular exception being thrown. Pinning your msgpack to a different version will likely solve that exception that you're seeing.

The thing that really sucks is that msgpack literally switched the kwargs used for this between minor versions, so it's not too easy for saltstack to support every single version of msgpack out there.

Maybe warnings should be thrown when using weird/untested versions of a module or we should bump the msgpack version in requirements/base.txt?

@H20-17
Copy link
Author

H20-17 commented Nov 5, 2019

I'm sure it is msgpack and I'm not blaming the salt people for whatever is going on but something like a DOS occurs right now and I don't understand why the salt master software isn't dealing with it anyway. (However I don't know any of the underlying details if I'm asking too much). I'm OK with this being closed now since it seems to be the focus of several other open issues.

@xeacott
Copy link
Contributor

xeacott commented Nov 5, 2019

I just set this environment up and my master is having no problem. I ran the command back-to-back and it seems fine. Is it possible this is a system issue or a network issue outside of salt? I mean, just thinking kind of outside the box but I piped the output to a file and it came back as 4.07mb. That doesn't seem to be anything the master would trip over....

@stale
Copy link

stale bot commented Jan 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label Jan 7, 2020
@arizvisa
Copy link
Contributor

arizvisa commented Jan 7, 2020

This should be closed as a msgpack issue...but @xeacott. Generally to test these msgpack-related issues we have to keep in mind that msgpack gets compressed and so to properly test these boundaries we'd need the state data to contain non-compressible data in order to exercise this particular code path.

The best way to generate noncompressible data is with white noise or output from a block cipher (which is nearly random as well). If you're in a posix environment, you can feed the /dev/random block device through tr -cd '[:alpha:]' to filter in yaml-compatible parameters.

i.e. to generate a 1.5mb (mostly) non-compressible kwarg to output into a state so it will get compressed by msgpack:

$ dd if=/dev/random | tr -cd '[:alpha:]' | dd bs=1024 count=1536

@stale
Copy link

stale bot commented Jan 7, 2020

Thank you for updating this issue. It is no longer marked as stale.

@stale stale bot removed the stale label Jan 7, 2020
@stale
Copy link

stale bot commented Feb 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label Feb 6, 2020
oeuftete pushed a commit to oeuftete/salt that referenced this issue Oct 6, 2020
salt 3000 introduced a dependency on Python msgpack >= 0.6. In msgpack 0.6 their
developers changed the max length of a string that could be unpacked from 2GiB
to 1MiB. They then proceeded to change it to 100MiB in msgpack 1.0. Now when
using msgpack >=0.6 and < 1.0, any message that is unpacked larger than 1MiB
will throw an error. For example a file state which tries to send a file larger
than 1MiB will get the following error:

Unable to manage file: 1048688 exceeds max_str_len(1048576)

This error could send the master into a tailspin where subsequent file
operations take a long time or just time out.

This patch sets the keyword argument max_buffer_size for the Unpacker to 100MiB
which is the msgpack 1.0 default now. It will allow people to use msgpack
versions >.6 and < 1.0 without this crazy low limit. It should be discussed if
we go back to the 2GiB size from < 0.6 days (my vote), or just anything larger
than 100MB. This code does a version check, and will only set the keyword on
msgpack > = 0.6 and < 1.0.

The test checks that we have a buffer size of at least 100MiB in case the Python
msgpack devs decide to lower this value again, and break things. We want enforce
a decent minimum.

fixes saltstack#53230, fixes saltstack#54884,
fixes saltstack#55180, fixes saltstack#55184,
fixes saltstack#57026, fixes saltstack#57921
twangboy pushed a commit to psyer/salt that referenced this issue Oct 6, 2020
salt 3000 introduced a dependency on Python msgpack >= 0.6. In msgpack 0.6 their
developers changed the max length of a string that could be unpacked from 2GiB
to 1MiB. They then proceeded to change it to 100MiB in msgpack 1.0. Now when
using msgpack >=0.6 and < 1.0, any message that is unpacked larger than 1MiB
will throw an error. For example a file state which tries to send a file larger
than 1MiB will get the following error:

Unable to manage file: 1048688 exceeds max_str_len(1048576)

This error could send the master into a tailspin where subsequent file
operations take a long time or just time out.

This patch sets the keyword argument max_buffer_size for the Unpacker to 100MiB
which is the msgpack 1.0 default now. It will allow people to use msgpack
versions >.6 and < 1.0 without this crazy low limit. It should be discussed if
we go back to the 2GiB size from < 0.6 days (my vote), or just anything larger
than 100MB. This code does a version check, and will only set the keyword on
msgpack > = 0.6 and < 1.0.

The test checks that we have a buffer size of at least 100MiB in case the Python
msgpack devs decide to lower this value again, and break things. We want enforce
a decent minimum.

fixes saltstack#53230, fixes saltstack#54884,
fixes saltstack#55180, fixes saltstack#55184,
fixes saltstack#57026, fixes saltstack#57921
twangboy pushed a commit to psyer/salt that referenced this issue Oct 6, 2020
salt 3000 introduced a dependency on Python msgpack >= 0.6. In msgpack 0.6 their
developers changed the max length of a string that could be unpacked from 2GiB
to 1MiB. They then proceeded to change it to 100MiB in msgpack 1.0. Now when
using msgpack >=0.6 and < 1.0, any message that is unpacked larger than 1MiB
will throw an error. For example a file state which tries to send a file larger
than 1MiB will get the following error:

Unable to manage file: 1048688 exceeds max_str_len(1048576)

This error could send the master into a tailspin where subsequent file
operations take a long time or just time out.

This patch sets the keyword argument max_buffer_size for the Unpacker to 100MiB
which is the msgpack 1.0 default now. It will allow people to use msgpack
versions >.6 and < 1.0 without this crazy low limit. It should be discussed if
we go back to the 2GiB size from < 0.6 days (my vote), or just anything larger
than 100MB. This code does a version check, and will only set the keyword on
msgpack > = 0.6 and < 1.0.

The test checks that we have a buffer size of at least 100MiB in case the Python
msgpack devs decide to lower this value again, and break things. We want enforce
a decent minimum.

fixes saltstack#53230, fixes saltstack#54884,
fixes saltstack#55180, fixes saltstack#55184,
fixes saltstack#57026, fixes saltstack#57921
dwoz pushed a commit that referenced this issue Oct 8, 2020
salt 3000 introduced a dependency on Python msgpack >= 0.6. In msgpack 0.6 their
developers changed the max length of a string that could be unpacked from 2GiB
to 1MiB. They then proceeded to change it to 100MiB in msgpack 1.0. Now when
using msgpack >=0.6 and < 1.0, any message that is unpacked larger than 1MiB
will throw an error. For example a file state which tries to send a file larger
than 1MiB will get the following error:

Unable to manage file: 1048688 exceeds max_str_len(1048576)

This error could send the master into a tailspin where subsequent file
operations take a long time or just time out.

This patch sets the keyword argument max_buffer_size for the Unpacker to 100MiB
which is the msgpack 1.0 default now. It will allow people to use msgpack
versions >.6 and < 1.0 without this crazy low limit. It should be discussed if
we go back to the 2GiB size from < 0.6 days (my vote), or just anything larger
than 100MB. This code does a version check, and will only set the keyword on
msgpack > = 0.6 and < 1.0.

The test checks that we have a buffer size of at least 100MiB in case the Python
msgpack devs decide to lower this value again, and break things. We want enforce
a decent minimum.

fixes #53230, fixes #54884,
fixes #55180, fixes #55184,
fixes #57026, fixes #57921
dwoz pushed a commit that referenced this issue Oct 8, 2020
salt 3000 introduced a dependency on Python msgpack >= 0.6. In msgpack 0.6 their
developers changed the max length of a string that could be unpacked from 2GiB
to 1MiB. They then proceeded to change it to 100MiB in msgpack 1.0. Now when
using msgpack >=0.6 and < 1.0, any message that is unpacked larger than 1MiB
will throw an error. For example a file state which tries to send a file larger
than 1MiB will get the following error:

Unable to manage file: 1048688 exceeds max_str_len(1048576)

This error could send the master into a tailspin where subsequent file
operations take a long time or just time out.

This patch sets the keyword argument max_buffer_size for the Unpacker to 100MiB
which is the msgpack 1.0 default now. It will allow people to use msgpack
versions >.6 and < 1.0 without this crazy low limit. It should be discussed if
we go back to the 2GiB size from < 0.6 days (my vote), or just anything larger
than 100MB. This code does a version check, and will only set the keyword on
msgpack > = 0.6 and < 1.0.

The test checks that we have a buffer size of at least 100MiB in case the Python
msgpack devs decide to lower this value again, and break things. We want enforce
a decent minimum.

fixes #53230, fixes #54884,
fixes #55180, fixes #55184,
fixes #57026, fixes #57921
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cannot-reproduce cannot be replicated with info/context provided Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants