Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failed to send msg SaltReqTimeoutError('Message timed out') after upgrade 3003.5 -> 3005.1 #63582

Open
3 of 9 tasks
gyro666 opened this issue Jan 29, 2023 · 13 comments
Open
3 of 9 tasks
Labels
Bug broken, incorrect, or confusing behavior Duplicate Duplicate of another issue or PR - will be closed needs-triage Performance Regression The issue is a bug that breaks functionality known to work in previous releases. Transport

Comments

@gyro666
Copy link

gyro666 commented Jan 29, 2023

Description
intermittently minion state runs into above timeout
run time is then delayed 60s or more.

IMPORTANT: behavior starts after 4-7days of running and goes away after master restart.

usually after a timeout, state.apply continues.
This can cause huge delays with complex high states. Some can run longer then hour. (instead of minutes)

no obvious errors are reported in the logs on either master or minion side.

Setup
1 master, 58 minions (debian10)
onedir installation (but it was the same with the old style)

issue is observed within same subnet (no firewalls)
there is no iptables rules on the salt-master or minion

~# sed '/^#/d;/^$/d' /etc/salt/master

interface: (redacted primary NIC IP)
keep_jobs: 720
archive_jobs: True
max_open_files: 100000 # added to resolve issue no effect
worker_threads: 10 # added to resolve issue no effect
zmq_backlog: 1000 # added to resolve issue no effect
pub_hwm: 1000
ipc_write_buffer: 'dynamic' # added to resolve issue no effect
batch_safe_limit: 100 # added to resolve issue no effect
batch_safe_size: 8 # added to resolve issue no effect
file_roots:
  base:
    - /srv/salt
pillar_cache: False
reactor:
  - 'salt/minion/*/start':
    - /srv/reactor/sync_all.sls
log_level_logfile: trace
tcp_keepalive: True
mysql.host: '127.0.0.1'
mysql.user: '(redacted)'
mysql.pass: '(redacted)'
mysql.db: 'salt'
mysql.port: 3306
master_job_cache: mysql
~# cat /srv/reactor/sync_all.sls
sync_grains:
  local.saltutil.sync_all:
    - tgt: {{ data['id'] }}
~# tree /etc/salt/master.d
/etc/salt/master.d
└── vmware_yaml.conf

0 directories, 1 file

cat salt/tests/event_test123.sls

event_test123:
  event.send

Please be as specific as possible and give set-up details.

  • on-prem machine
  • VM (VMware)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD
  • classic packaging
  • onedir packaging
  • used bootstrap to install

Steps to Reproduce the behavior
run (but also anything else like test.ping timeouts)

~# time salt-call state.apply tests/event_test123 -l trace test=true
... (snip)
[DEBUG   ] LazyLoaded jinja.render
[DEBUG   ] The functions from module 'yaml' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded yaml.render
[TRACE   ] ReqChannel send crypt load={'saltenv': 'base', 'prefix': '', 'cmd': '_file_list'}
[TRACE   ] ReqChannel send crypt load={'path': 'tests/event_test123.sls', 'saltenv': 'base', 'cmd': '_file_hash'}

... (timeout 60s)

[TRACE   ] Failed to send msg SaltReqTimeoutError('Message timed out')
[TRACE   ] ReqChannel send crypt load={'path': 'tests/event_test123.sls', 'saltenv': 'base', 'cmd': '_file_hash', 'nonce': 'd40f71acfb6641c4a4245f2ecdb0b2ea'}
[TRACE   ] ReqChannel send crypt load={'path': 'tests/event_test123.sls', 'saltenv': 'base', 'cmd': '_file_find'}
[DEBUG   ] In saltenv 'base', looking at rel_path 'tests/event_test123.sls' to resolve 'salt://tests/event_test123.sls'
[DEBUG   ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/tests/event_test123.sls' to resolve 'salt://tests/event_test123.sls'
[DEBUG   ] compile template: /var/cache/salt/minion/files/base/tests/event_test123.sls
[DEBUG   ] Jinja search path: ['/var/cache/salt/minion/files/base']
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', '(redacted)', 'tcp://(redacted):4506')
[PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/base/tests/event_test123.sls' using 'jinja' renderer: 0.008725404739379883
[DEBUG   ] Rendered data from file: /var/cache/salt/minion/files/base/tests/event_test123.sls:
event_test123:
  event.send

[PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/base/tests/event_test123.sls' using 'yaml' renderer: 0.0001895427703857422
[DEBUG   ] The functions from module 'config' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded config.option
... (snip) 
local:
----------
          ID: event_test123
    Function: event.send
      Result: None
     Comment: Event would have been fired
     Started: 16:24:09.026465
    Duration: 0.927 ms
     Changes:
              ----------
              data:
                  None
              tag:
                  event_test123

Summary for local
------------
Succeeded: 1 (unchanged=1, changed=1)
Failed:    0
------------
Total states run:     1
Total run time:   0.927 ms

real    1m1.888s
user    0m0.870s
sys     0m0.281s

at the same time event log on the master show:

salt/auth       {
    "_stamp": "2023-01-25T15:42:03.171180",
    "act": "accept",
    "id": "(redacted)",
    "pub": "-----BEGIN PUBLIC KEY-----(redacted)-----END PUBLIC KEY-----",
    "result": true
}
minion/refresh/(redacted)  {
    "Minion data cache refresh": "(redacted)",
    "_stamp": "2023-01-25T15:42:03.601606"
}

... (timeout)

salt/job/20230125154303962005/ret/(redacted)       {
    "_stamp": "2023-01-25T15:43:03.967454",
    "arg": [
        "tests/event_test123",
        "test=true"
    ],
    "cmd": "_return",
    "fun": "state.apply",
    "fun_args": [
        "tests/event_test123",
        "test=true"
    ],
    "id": "(redacted)",
    "jid": "20230125154303962005",
    "out": "highstate",
    "retcode": 0,
    "return": {
        "event_|-event_test123_|-event_test123_|-send": {
            "__id__": "event_test123",
            "__run_num__": 0,
            "__sls__": "tests/event_test123",
            "changes": {
                "data": null,
                "tag": "event_test123"
            },
            "comment": "Event would have been fired",
            "duration": 1.163,
            "name": "event_test123",
            "result": null,
            "start_time": "16:43:03.968992"
        }
    },
    "tgt": "(redacted)",
    "tgt_type": "glob"
}

I can't include full traces as I don't have same version in the staging ENV, however I was able to perform stack traces on few occasions.

strace parser link: https://gitlab.com/gitlab-com/support/toolbox/strace-parser

strace done with:

ps auxwwu | grep "salt/run/run master" | awk '{print"-p " $2}' | xargs strace -fttTyyy -s 1024 -o /tmp/slow-salt.strace8

Here is a clear cut example of a succsefful run (no delays ~7s):

./strace-parser slow-salt.strace5 summary

Top 25 PIDs by Active Time
-----------

  pid         actv (ms)     wait (ms)     user (ms)    total (ms)    % of actv     syscalls     children
  -------    ----------    ----------    ----------    ----------    ---------    ---------    ---------
  1217          141.783      4642.889       466.752      5251.424       27.34%         8766            0
  1195          123.776      4136.611       236.396      4496.783       23.87%         2371            9
  2695           33.465         0.000        29.629        63.094        6.45%          621            0
  2699           27.311         0.000        28.591        55.902        5.27%          619            0
  2697           26.759         0.000        24.724        51.483        5.16%          620            0
  2701           24.789         0.000        24.817        49.606        4.78%          621            0
  2703           24.651         0.000        24.595        49.246        4.75%          619            0
  2709           24.275         0.000        24.371        48.646        4.68%          619            0
  2705           24.246         0.000        24.313        48.559        4.68%          619            0
  2707           24.165         0.000        24.390        48.555        4.66%          619            0
  1190           20.136      5286.575        17.244      5323.955        3.88%          301            0
  1202            7.715      5302.287        13.825      5323.827        1.49%          181            0
  1191            3.214      5308.719         9.035      5320.968        0.62%          101            1
  1185            2.037      5306.695         6.438      5315.170        0.39%           63            4
  1216            1.877      3747.416        21.825      3771.118        0.36%           97            1
  1204            1.867      5308.335        13.469      5323.671        0.36%          496            0
  1183            1.401      5374.442         4.928      5380.771        0.27%           60            0
  2693            0.459         0.156         2.381         2.996        0.09%           16            0
  2692            0.356      1010.606         3.239      1014.201        0.07%           24            0
  2694            0.355       717.827         1.682       719.864        0.07%           20            0
  2684            0.331      2747.060         1.291      2748.682        0.06%            7            0
  2713            0.289         0.199         1.807         2.295        0.06%           20            0
  1208            0.268      3768.628         0.949      3769.845        0.05%           11            0
  1179            0.263      5377.113         3.477      5380.853        0.05%           45            0
  1180            0.262      5309.562         2.206      5312.031        0.05%           21            0

PIDs   98
real   0m7.272s
user   0m1.043s
sys    0m0.518s

here is a 60s delayed run:

./strace-parser slow-salt.strace4 summary                                                                                                                                                          1 

Top 25 PIDs by Active Time
-----------

  pid         actv (ms)     wait (ms)     user (ms)    total (ms)    % of actv     syscalls     children
  -------    ----------    ----------    ----------    ----------    ---------    ---------    ---------
  1187          310.092     23521.801       356.283     24188.176       48.84%         3269           22
  2666           81.964         0.000       103.157       185.121       12.91%         2666            0
  2657           68.165         0.000        79.302       147.467       10.74%         2665            0
  2683           35.160         0.124       104.403       139.687        5.54%         2050            0
  2656           15.618         0.000        42.349        57.967        2.46%         1062            0
  2667           13.969         0.000        29.215        43.184        2.20%          651            0
  575             8.386     66074.359        51.820     66134.562        1.32%          724            1
  2655            7.907         0.000        18.592        26.499        1.25%          430            0
  2639            7.237         0.000        10.976        18.213        1.14%          219            0
  2640            6.856         0.000         9.809        16.665        1.08%          213            0
  2653            5.340         0.000        10.043        15.383        0.84%          263            0
  2638            5.201         0.000        10.878        16.079        0.82%          210            0
  2646            5.175         0.000         8.692        13.867        0.81%          211            0
  1185            5.132     64603.148        25.008     64633.289        0.81%          193           14
  2641            4.889         0.000         8.275        13.164        0.77%          213            0
  2647            4.819         0.000         8.979        13.798        0.76%          220            0
  2643            4.726         0.000         9.039        13.765        0.74%          220            0
  2642            4.649         0.000         8.553        13.202        0.73%          211            0
  2648            4.612         0.000         8.533        13.145        0.73%          211            0
  2649            4.598         0.000         8.976        13.574        0.72%          220            0
  2654            3.918         0.000         7.340        11.258        0.62%          175            0
  2682            3.831      2001.160        24.773      2029.764        0.60%          218            1
  2650            3.356         0.000         4.887         8.243        0.53%          130            0
  2651            3.248         0.000         4.870         8.118        0.51%          132            0
  2652            3.150         0.000         4.957         8.107        0.50%          130            0

PIDs   122
real   1m8.412s
user   0m1.200s
sys    0m0.634s

here is a bit of insight into the most delayed pid run there

PID 575

  724 syscalls, active time: 8.386ms, user time: 51.820ms, total time: 66134.562ms
  start time: 14:45:01.113204    end time: 14:46:07.247768

  syscall                 count    total (ms)      max (ms)      avg (ms)      min (ms)    errors
  -----------------    --------    ----------    ----------    ----------    ----------    --------
  select                      8     66073.844     10010.049      8259.230       461.675    ERESTARTNOHAND: 1
  clone                       1         2.003         2.003         2.003         2.003
  getdents64                 88         1.603         0.105         0.018         0.005
  openat                    120         1.092         0.017         0.009         0.006    ENOENT: 45   ENOTDIR: 1
  read                       63         0.743         0.045         0.012         0.006
  fstat                     104         0.625         0.018         0.006         0.005
  close                      75         0.580         0.021         0.008         0.005
  stat                       62         0.527         0.024         0.009         0.006    ENOENT: 12
  wait4                      47         0.513         0.116         0.011         0.005
  lseek                      65         0.392         0.007         0.006         0.006
  write                      17         0.335         0.054         0.020         0.008
  ioctl                      45         0.274         0.008         0.006         0.006    ENOTTY: 30
  getpid                     19         0.137         0.017         0.007         0.005
  rt_sigaction                8         0.050         0.007         0.006         0.006
  pipe2                       2         0.025         0.017         0.013         0.008
  ---------------

  Child PIDs:  2682

PID 2682

  218 syscalls, active time: 3.831ms, user time: 24.773ms, total time: 2029.764ms
  start time: 14:46:07.243889    end time: 14:46:09.273653

  syscall                 count    total (ms)      max (ms)      avg (ms)      min (ms)    errors
  -----------------    --------    ----------    ----------    ----------    ----------    --------
  futex                       6      2001.160      1000.104       333.527         0.011    EAGAIN: 1   ETIMEDOUT: 2
  getdents64                 28         0.646         0.079         0.023         0.011
  openat                     27         0.520         0.025         0.019         0.016    ENOENT: 2
  stat                       24         0.448         0.023         0.019         0.016    ENOENT: 2
  read                       16         0.440         0.059         0.027         0.011
  fstat                      34         0.420         0.017         0.012         0.009
  close                      26         0.397         0.021         0.015         0.011
  lseek                      22         0.310         0.026         0.014         0.011
  ioctl                      11         0.142         0.016         0.013         0.010    ENOTTY: 11
  write                       5         0.127         0.033         0.025         0.021
  getpid                      7         0.120         0.021         0.017         0.015
  clone                       1         0.053         0.053         0.053         0.053
  umask                       2         0.032         0.019         0.016         0.013
  set_robust_list             1         0.030         0.030         0.030         0.030
  rt_sigaction                2         0.027         0.016         0.014         0.011
  mmap                        1         0.024         0.024         0.024         0.024
  getrandom                   1         0.024         0.024         0.024         0.024
  munmap                      1         0.023         0.023         0.023         0.023
  prctl                       1         0.019         0.019         0.019         0.019
  mkdir                       1         0.016         0.016         0.016         0.016    EEXIST: 1
  gettid                      1         0.013         0.013         0.013         0.013
  ---------------

  Parent PID:  575
  Threads:  2683
  Child PIDs:  2683

PID 2683

  2050 syscalls, active time: 35.160ms, user time: 104.403ms, total time: 139.687ms
  start time: 14:46:07.271803    end time: 14:46:07.411490

  syscall                 count    total (ms)      max (ms)      avg (ms)      min (ms)    errors
  -----------------    --------    ----------    ----------    ----------    ----------    --------
  stat                      855        15.450         0.090         0.018         0.011    ENOENT: 29
  getdents64                386         6.409         0.055         0.017         0.010
  openat                    195         3.734         0.075         0.019         0.015
  lstat                     188         3.326         0.035         0.018         0.014
  close                     195         2.917         0.117         0.015         0.011
  fstat                     195         2.567         0.044         0.013         0.010
  write                      13         0.343         0.054         0.026         0.020
  read                        6         0.133         0.034         0.022         0.017
  futex                       3         0.124         0.051         0.041         0.029
  getpid                      3         0.088         0.037         0.029         0.016
  fcntl                       4         0.048         0.012         0.012         0.012
  set_robust_list             1         0.037         0.037         0.037         0.037
  gettid                      1         0.034         0.034         0.034         0.034
  ioctl                       2         0.028         0.016         0.014         0.012    ENOTTY: 2
  lseek                       2         0.024         0.012         0.012         0.012
  mprotect                    1         0.022         0.022         0.022         0.022
  ---------------

  Parent PID:  2682
  Threads:  2682

  

Expected behavior
Behave reliable consistent without timeouts and delays.

Screenshots
(n/a)

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) No differences between master/minion versions (3005.1)
:~# salt --versions-report
Salt Version:
          Salt: 3005.1

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.0
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.21
      pycrypto: 2.6.1
  pycryptodome: 3.9.8
        pygit2: Not Installed
        Python: 3.9.16 (main, Jan  6 2023, 22:50:03)
  python-gnupg: 0.4.8
        PyYAML: 5.4.1
         PyZMQ: 23.2.0
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: debian 10 buster
        locale: utf-8
       machine: x86_64
       release: 4.19.0-23-amd64
        system: Linux
       version: Debian GNU/Linux 10 buster

Additional context

Add any other context about the problem here.
None

@gyro666 gyro666 added Bug broken, incorrect, or confusing behavior needs-triage labels Jan 29, 2023
@welcome
Copy link

welcome bot commented Jan 29, 2023

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@gyro666
Copy link
Author

gyro666 commented Jan 29, 2023

honorable mentions:
#63310 .. it's only ~50 minions .. mostly not doing anything also performance tuning made no diff
#63325 solid network connection and network traces do not show any networking issues, just a long pause
#62881 the issue is observed intermittently at any time with no other high load activity
oh and most importantly .. with 3003.5 was working just fine. (same number of minions)

@gyro666 gyro666 changed the title [BUG] Failed to send msg SaltReqTimeoutError('Message timed out') [BUG] Failed to send msg SaltReqTimeoutError('Message timed out') after upgrade 3003.5 -> 3005.1 Jan 29, 2023
@OrangeDog
Copy link
Contributor

This is a duplicate of the already mentioned regressions. Good to have more data though.

@OrangeDog OrangeDog added Regression The issue is a bug that breaks functionality known to work in previous releases. Performance Transport Duplicate Duplicate of another issue or PR - will be closed labels Jan 30, 2023
@frebib
Copy link
Contributor

frebib commented Jan 31, 2023

Out of curiosity, does TCP transport have this problem? In my/our experience, it's much more reliable than zmq

@gyro666
Copy link
Author

gyro666 commented Feb 2, 2023

@OrangeDog could you please reference/link duplicate(s)?

@OrangeDog
Copy link
Contributor

@gyro666 you already did

@BlackMetalz
Copy link

BlackMetalz commented Apr 17, 2023

Confirmed on 3005.1, happens randomly

@aarnaud
Copy link
Contributor

aarnaud commented Apr 19, 2023

Same for me, seem link to the latency between minion and master

@BlackMetalz
Copy link

Confirmed

@jeffrysleddens
Copy link

We were having the same issues in our, relatively small (~250 minions), Salt environment. For us swapping from the zeromq transport to the tcp transport solved the problem.

The issues we were seeing when we were still on the zeromq transport were exactly the same as described in the first issue. After a while we got more and more timeout issues and minions "not responding" and when we would restart the master everything worked fine again for a while. Now with the tcp transport everything keeps running great without the issues.

@krax13
Copy link

krax13 commented Aug 24, 2023

Confirmed in SUSE Manager. It happens about weekly. Changing to TCP transport is not an option, because the proxy does not support this method.

@dwoz
Copy link
Contributor

dwoz commented Dec 18, 2023

@gyro666 @krax13
This should be resolved in 3006.5. Can you please confirm?

@alrf
Copy link

alrf commented Feb 1, 2024

Newly installed minion can't connect (test.ping) to the salt-master, both are 3006.5.
The same Failed to send msg SaltReqTimeoutError('Message timed out') message in the logs.
The minion can connect to 4505, 4506 ports on the salt-master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Duplicate Duplicate of another issue or PR - will be closed needs-triage Performance Regression The issue is a bug that breaks functionality known to work in previous releases. Transport
Projects
None yet
Development

No branches or pull requests

9 participants