Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Possible concurrency issue with reactor/orchestrate running on master only #57626

Closed
dpizzle opened this issue Jun 10, 2020 · 6 comments
Closed
Assignees
Labels
Bug broken, incorrect, or confusing behavior Phosphorus v3005.0 Release code name and version Reactor severity-high 2nd top severity, seen by most users, causes major problems
Milestone

Comments

@dpizzle
Copy link

dpizzle commented Jun 10, 2020

Description
I have a salt master which is configured with 3 reactor/orchestrate pairs. The master listens to messages from network devices via napalm-logs, then performs 3 tasks -

  1. updates a mariadb with the event details

  2. updates the db with the host details

  3. runs a python script which connects to the network device and collects further information to add to the db.

As the number of messages has increased, I'm noticing that more of the jobs are failing with either of these 2 responses. The errors are sporadic and not linked to specific network devices.

Can state.orchestrate only run one job at a time, so if a second state.orchestrate is called from the reactor, whilst the first is running, the second job will fail?

"The function \"state.orchestrate\" is running as PID 6668 and was started at 2020, Jun 10 19:11:53.798538 with jid 20200610191153798538"
"Rendering SLS 'base:orchestrate.register_host_in_db' failed: Jinja variable 'salt.utils.context.NamespacedDictWrapper object' has no attribute 'data'"

Setup

/etc/salt # more master.d/reactors.conf
reactor:
  - 'napalm/syslog/*/*/*':
    - salt://reactor/register_host_in_db.sls
    - salt://reactor/register_event_in_db.sls

  - 'napalm/syslog/junos/PAM_AUTH/*':
    - salt://reactor/interrogate_node.sls
/etc/salt # more reactor/register_host_in_db.sls

register host in db:
  runner.state.orchestrate:
    - args:
      - mods: orchestrate.register_host_in_db
      - pillar:
          data: {{ data|json }}
/etc/salt # more orchestrate/register_host_in_db.sls
  register host in db - {{ hostname }}:
    mysql_query.run:
      - database: db
      - query: |
          INSERT ...

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
           Salt: 2019.2.2

Dependency Versions:
           cffi: 1.14.0
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.10
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 1.0.0
   mysql-python: Not Installed
      pycparser: 2.20
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.9 (default, Oct 17 2019, 11:10:22)
   python-gnupg: Not Installed
         PyYAML: 3.13
          PyZMQ: 19.0.1
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2

System Versions:
           dist:
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-1062.12.1.el7.x86_64
         system: Linux
        version: Not Installed

Additional Information

I've upgraded to 3000.3 and still experience the same behaviour.

@dpizzle dpizzle added the Bug broken, incorrect, or confusing behavior label Jun 10, 2020
@dpizzle
Copy link
Author

dpizzle commented Jun 11, 2020

To remove the 3 tasks from the troubleshooting, I've replaced with one tasks that writes to the log file and still have the same issue.

/etc/salt # more master.d/reactors.conf
reactor:
  - 'napalm/syslog/*/*/*':
    - salt://reactor/debugger.sls
/etc/salt # more reactor/debugger.sls
debug:
  runner.state.orchestrate:
    - args:
      - mods: orchestrate.debugger
      - pillar:
          data: {{ data|json }}
/etc/salt # more orchestrate/debugger.sls
{# log host and pillar data to /var/log/salt/master #}
{% do salt.log.warning("orch.debugger: host:" + pillar['data']['host']|string|lower + " data:" + pillar['data']|string) %}

fix empty state - debugger:
  cmd.run:
    - name: /bin/true
    - unless: /bin/true

@xeacott xeacott changed the title Possible concurrency issue with reactor/orchestrate running on master only [BUG] Possible concurrency issue with reactor/orchestrate running on master only Jun 17, 2020
@xeacott
Copy link
Contributor

xeacott commented Jun 17, 2020

Interesting, thanks for the report @dpizzle unfortunately I don't have much experience with reactors so I'll ask if anyone on @saltstack/team-core is able to assist here. 😄

@xeacott xeacott added severity-high 2nd top severity, seen by most users, causes major problems Reactor labels Jun 17, 2020
@xeacott xeacott added this to the Approved milestone Jun 17, 2020
@sagetherage sagetherage modified the milestones: Approved, Aluminium Jul 29, 2020
@sagetherage sagetherage added the Aluminium Release Post Mg and Pre Si label Jul 29, 2020
@sagetherage sagetherage added Silicon v3004.0 Release code name and removed Aluminium Release Post Mg and Pre Si labels Feb 25, 2021
@sagetherage sagetherage modified the milestones: Aluminium, Silicon Feb 25, 2021
@garethgreenaway
Copy link
Contributor

Based on the conversation in #54045 is seems that #56513 might resolve this one with the Aluminum release.

@oeuftete
Copy link
Contributor

oeuftete commented Mar 4, 2021

Based on the conversation in #54045 is seems that #56513 might resolve this one with the Aluminum release.

Yep, although I believe #58853 is the fix ultimately going into Aluminum.

@garethgreenaway
Copy link
Contributor

Following up on this one and seeing if anyone had a chance to upgrade to 3002 and verify if that release resolves this issue.

@sagetherage sagetherage modified the milestones: Silicon, Approved Aug 12, 2021
@garethgreenaway garethgreenaway added Phosphorus v3005.0 Release code name and version and removed Silicon v3004.0 Release code name labels Sep 20, 2021
@Ch3LL
Copy link
Contributor

Ch3LL commented Oct 25, 2021

Closing this due to no response. Please open an new issue if the PR #56513 does not resolve the issue.

@Ch3LL Ch3LL closed this as completed Oct 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Phosphorus v3005.0 Release code name and version Reactor severity-high 2nd top severity, seen by most users, causes major problems
Projects
None yet
Development

No branches or pull requests

6 participants