Reactor

richie-tt · 2018-05-08T13:31:41Z

Description of Issue/Question

After my deep investigation, why I receiving randomly following message:

2018-05-08 15:05:55,227 [salt.template    :62  ][ERROR   ][18028] Template was specified incorrectly: False

I figure out, that it happening only, when 2 or more minion hits in the reactor in the same time.
It looks like that some cache is cleared or something and not generated on the time, when second minion hit to the Reactor

salt/run/20180508150554686064/ret       {
    "_stamp": "2018-05-08T13:05:57.144721", 
    "fun": "runner.state.orchestrate", 
    "fun_args": [
        "orch.minion-onBoot", 
        {
            "pillar": {
                "event_data": {
                    "_stamp": "2018-05-08T13:05:54.546766", 
                    "cmd": "_minion_event", 
                    "data": "Minion web220c-stg-1.inet started at Tue May  8 13:05:53 2018", 
                    "id": "web220c-stg-1.inet", 
                    "pretag": null, 
                    "tag": "salt/minion/web220c-stg-1.inet/start"
                }, 
                "event_id": "web220c-stg-1.inet"
            }
        }
    ], 
    "jid": "20180508150554686064", 
    "return": {
        "data": {
            "salt01-tech.inet_master": [
                "No matching sls found for 'orch.minion-onBoot' in env 'base'"
            ]
        }, 
        "outputter": "highstate", 
        "retcode": 1
    }, 
    "success": false, 
    "user": "Reactor"
}

I have Auto-Scaling-Group from AWS, which deploys VM to cloud and make basic setup for minion(update master address, setup grains and start service).
After that SALT-Master make the highstate. The problem is that the Reactor works perfectly when there is only 1 new minion, but if I will deploy 2 or more new machine in the same time, then Reacotor will works only for first one which makes attempts, for others Reactor will throw error that some SLS was not found.

For me it looks like that after hit in the Reactor there is some cache clearing and not regenrated on the time.

Flow how looks like the "communication"
minion -> reactor -> orchestrate -> highstate for minion

Setup

Reactor config

reactor:
  - 'salt/minion/*/start':
    - /opt/salt/salt-pillar/reactor/minion-onStart.sls

Reactor state:

log_new_minion:
  local.cmd.run:
    - name: log new minion
    - tgt: 'apps:saltmaster'
    - expr_form: grain
    - arg:
      - 'echo "[{{ data['id'] }}][minion started]  A new Minion has (re)born on $(date). Say hello to him ({{ tag }})" >> /var/log/salt/reactor.log'

hello_state:
  runner.state.orchestrate:
    - mods: orch.minion-onBoot
    - kwarg:
        pillar:
          event_data: {{ data|json }}
          event_id: {{ data['id'] }}

Orchestrate

{%- set message = "A new Minion has (re)born.\nSay *hello* to him :thinking_face:" %}
{% set service = "Reactor-NEW Minion" %}

{%- set data = salt.pillar.get("event_data") %}
{%- set id = salt.pillar.get("event_id") %}
{%- import "system/module_slack.sls" as slack %}

{% do slack.sendSlackMessage(service, None, id, message, 2) %}


boot_check_version:
  salt.state:
    - tgt: {{ data["id"] }}
    - sls: prepare.version

Steps to Reproduce Issue

Create 2-3 new VM

Versions Report

Salt Version:
           Salt: 2017.7.5
 
Dependency Versions:
           cffi: 1.11.5
       cherrypy: unknown
       dateutil: 2.7.2
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.1
   mysql-python: Not Installed
      pycparser: 2.18
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Aug  4 2017, 00:39:18)
   python-gnupg: 0.3.8
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4
 
System Versions:
           dist: centos 7.4.1708 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-693.21.1.el7.x86_64
         system: Linux

Ch3LL · 2018-05-08T16:02:56Z

i'm having a hard time replicating this. Does this occur every time you create 2 minions? If you just restart two minions at the same time do you see the same behavior? That should trigger the reactor as well

richie-tt · 2018-05-09T07:12:08Z

Hi @Ch3LL thank you for your involve in that case.
I believe that it is hard to reproduce, because after manually restart of the minions, everything works fine, I think because the time between hit to reactor is bigger.

To be honest I was surprised why this issue came up now, than I figure out that a 1 week ago I upgraded SALT to version 2018.3.0 (here was reported problem with generation token for Vault) but than I make downgrade to 2017.7.5 and the problem show up. After moving back to initial version 2017.7.4 all worked normally. I think the bug was introduced in version 2017.7.5, in my opinion.

I will try to prepare some easy steps to reproduce it, because I noticed the same problem, when there was 10 hits to API which also calls the orchestration.

Ch3LL · 2018-05-09T16:23:49Z

yeah if you can include an easy use case that would be awesome. I tried replicating by using your states you provided and starting up 2 minions at the same time to try to hit the reactor at the exact time.

Also in the meantime while you are preparing the use case @saltstack/team-core any ideas on why this might be occurring?

Ch3LL · 2018-05-11T15:12:20Z

ping @DmitryKuzmenko can you take a look here?

johje349 · 2018-05-21T12:56:22Z

I can confirm that the problem was introduced in 2017.7.5. Works fine in 2017.7.4.

calc84maniac · 2018-05-21T15:59:56Z

I can confirm getting the same error fairly consistently on 2018.3.0 when running an orchestrate in reaction to cloud created events while deploying a cloud in parallel. The error seems to go away if I configure reactor_worker_threads: 1 though that's certainly not ideal. Seems like there could be some multi-threading issue with either the reactor's cached RunnerClient or the orchestration runner specifically.

calc84maniac · 2018-05-21T19:40:57Z

I did a bit of digging into how this all works, and the problem might be that the global multiprocessing option is propagated when creating the RunnerClient shared between worker threads, meaning that by default the LazyLoader created underneath will not be thread-safe, despite being used in a threaded environment. Perhaps multiprocessing should be forced to false in the reactor's options?

Edit: By this I mean in the ReactWrap class, which is specifically using threads.

Changing this line to the following appears to alleviate my issues:

self.opts = opts.copy()
self.opts['multiprocessing'] = False

anitakrueger · 2018-05-31T12:10:15Z

@calc84maniac This sounds like the issue here. In my case, I am trying to create 2 cloud servers in quick succession, where both salt/cloud/*/created events kick off orchestrations via the reactor. I have tried to set 'multiprocessing' to False like you suggested, but my reactor still dies.

Our case relies heavily on the reactor for after cloud-instance creation processing which now fails because of the reactor throwing the error above. We also have other reactor events that are now not being processed :( the only thing that fixes the situation is a salt-master restart.

Where would I set reactor_worker_threads: 1? In the master config? Also what happens in that case when multiple events arrive at the same time?

hupebln · 2018-05-31T13:36:54Z

I am in the same boat and have disabled the multiprocessing for now. But cannot really say if it comes back again.
@calc84maniac have you changed it in the code itself? Why not just set in the config? Wasn't it working this way?
But that issue is a really blocker for me at the moment as almost all of my processes are using reactors in some way.

calc84maniac · 2018-05-31T13:56:56Z

Sorry, it does look like that didn't fully fix the problem after all, after further testing. But I feel like it's still some kind of race condition with the loader, regardless.

reactor_worker_threads: 1 in the master config never gave me problems, but in that case events are queued and executed in sequence, which may not be appropriate in regard to response time.

richie-tt · 2018-06-10T17:40:54Z

Hi @Ch3LL

I am sorry for late answer, for last weeks I was very busy with implementing new project.

First way to reproduce:

I prepared some steps to reproduce this issue, but I am not sure if this is the best approach because it can be related with other issue/bug. I mentioned before that I had this issue before but with hits to API (which also using reactor), but I solve it by workaround, maybe this is useless, someone need to verify if this approach is value.

Config of reactor

rest_cherrypy:
  port: 4080
  host: 0.0.0.0
  webhook_url: /hook
  webhook_disable_auth: True
  ssl_crt: /etc/pki/tls/certs/salt01-tech.inet.crt
  ssl_key: /etc/pki/tls/private/salt01-tech.inet.key
  log_access_file: /var/log/salt/access_api.log
  log_error_file: /var/log/salt/error_api.log

 reactor:
 ### bug test
  - 'salt/netapi/hook/bugtest':
    - /opt/salt/salt-pillar/reactor/bugtest.sls

bugtest_event:
  runner.state.orchestrate:
    - args:
      - mods: orch.bugtest
      - pillar:
          event_data: {{ data|json }}

orchestration

{%- set fqdn = salt.grains.get("fqdn") %}
{%- set hook = salt.pillar.get("event_data",False) %}

{%- import "system/module_slack.sls" as slack %}

{%- set service = "Salt-Orchestration - Percona" %}

{%- if hook != False %}
  {%- set RemoteAddr = hook["headers"]["Remote-Addr"] %}

cmd.run:
  salt.function:
    - tgt: 'salt01-tech.inet'
    - arg:
      - echo test

{%- else %}
Echo_percona:
  cmd.run:
    - name: echo "No Data"

{% endif %}

I started following command on 2 vm:

for i in {1..100}; do curl -sSk https://salt01-tech.inet:4080/hook/bugtest -H 'Accept: application/x-yaml' -d 'cluster: sql-prd-shared1'; done

Log from debug

2018-06-10 18:59:08,358 [salt.utils.lazy  :97  ][DEBUG   ][24046] LazyLoaded jinja.render
2018-06-10 18:59:08,358 [salt.utils.event :728 ][DEBUG   ][24046] Sending event: tag = salt/run/20180610185908347361/new; data = {'fun': 'runner.state.orchestrate', 'fun_args': [{'pillar': OrderedDict([('event_data', OrderedDict([('_stamp', '2018-06-10T16:54:34.530484'), ('body', ''), ('headers', OrderedDict([('Accept', 'application/x-yaml'), ('Content-Length', '24'), ('Content-Type', 'application/x-www-form-urlencoded'), ('Host', 'salt01-tech.inet:4080'), ('Remote-Addr', '10.201.34.125'), ('User-Agent', 'curl/7.29.0')])), ('post', OrderedDict([('cluster: sql-prd-shared1', '')]))]))]), 'mods': 'orch.bugtest'}], 'jid': '20180610185908347361', 'user': 'Reactor', '_stamp': '2018-06-10T16:59:08.357805'}
2018-06-10 18:59:08,362 [salt.loaded.int.rawmodule.state:2036][DEBUG   ][25798] Remaining event matches: -975
2018-06-10 18:59:08,361 [salt.utils.lazy  :97  ][DEBUG   ][24046] LazyLoaded yaml.render
2018-06-10 18:59:08,361 [salt.config      :1954][DEBUG   ][24046] Reading configuration from /etc/salt/master
2018-06-10 18:59:08,364 [salt.fileclient  :1072][DEBUG   ][24046] Could not find file 'salt://orch/bugtest.sls' in saltenv 'base'
2018-06-10 18:59:08,369 [salt.fileclient  :1072][DEBUG   ][24046] Could not find file 'salt://orch/bugtest/init.sls' in saltenv 'base'
2018-06-10 18:59:08,369 [salt.template    :48  ][DEBUG   ][24046] compile template: False
2018-06-10 18:59:08,370 [salt.template    :62  ][ERROR   ][24046] Template was specified incorrectly: False
2018-06-10 18:59:08,377 [salt.utils.lazy  :97  ][DEBUG   ][24046] LazyLoaded config.get
2018-06-10 18:59:08,378 [salt.loaded.int.render.yaml:76  ][DEBUG   ][24046] Results of YAML rendering: 
OrderedDict([('include', ['directory.consul', 'directory.consul_cache']), ('directory', OrderedDict([('UID', OrderedDict([('saltmaster', OrderedDict([('uid', 'root')]))]))]))])
2018-06-10 18:59:08,378 [salt.template    :26  ][PROFILE ][24046] Time (in seconds) to render '/opt/salt/salt-pillar/pillar/directory/saltmaster.sls' using 'yaml' renderer: 0.0239288806915
2018-06-10 18:59:08,380 [salt.template    :48  ][DEBUG   ][24046] compile template: /opt/salt/salt-pillar/pillar/directory/consul.sls
2018-06-10 18:59:08,379 [salt.utils.event :728 ][DEBUG   ][24046] Sending event: tag = salt/run/20180610185848992206/ret; data = {'fun_args': [{'pillar': OrderedDict([('event_data', OrderedDict([('_stamp', '2018-06-10T16:54:33.650618'), ('body', ''), ('headers', OrderedDict([('Accept', 'application/x-yaml'), ('Content-Length', '24'), ('Content-Type', 'application/x-www-form-urlencoded'), ('Host', 'salt01-tech.inet:4080'), ('Remote-Addr', '10.201.34.125'), ('User-Agent', 'curl/7.29.0')])), ('post', OrderedDict([('cluster: sql-prd-shared1', '')]))]))]), 'mods': 'orch.bugtest'}], 'jid': '20180610185848992206', 'return': {'outputter': 'highstate', 'data': {'salt01-tech.inet_master': ["No matching sls found for 'orch.bugtest' in env 'base'"]}, 'retcode': 1}, 'success': False, '_stamp': '2018-06-10T16:59:08.379624', 'user': 'Reactor', 'fun': 'runner.state.orchestrate'}

state.event

[DEBUG   ] Remaining event matches: -975
salt/run/20180610185848992206/ret       {
    "_stamp": "2018-06-10T16:59:08.379624", 
    "fun": "runner.state.orchestrate", 
    "fun_args": [
        {
            "mods": "orch.bugtest", 
            "pillar": {
                "event_data": {
                    "_stamp": "2018-06-10T16:54:33.650618", 
                    "body": "", 
                    "headers": {
                        "Accept": "application/x-yaml", 
                        "Content-Length": "24", 
                        "Content-Type": "application/x-www-form-urlencoded", 
                        "Host": "salt01-tech.omarsys.inet:4080", 
                        "Remote-Addr": "10.201.34.125", 
                        "User-Agent": "curl/7.29.0"
                    }, 
                    "post": {
                        "cluster: sql-prd-shared1": ""
                    }
                }
            }
        }
    ], 
    "jid": "20180610185848992206", 
    "return": {
        "data": {
            "salt01-tech.omarsys.inet_master": [
                "No matching sls found for 'orch.bugtest' in env 'base'"
            ]
        }, 
        "outputter": "highstate", 
        "retcode": 1
    }, 
    "success": false, 
    "user": "Reactor"
}

Please do not judge my code and sense here, I just wanted to create something to reproduce the issue :)

What is important:

that issue was generated on version 2017.7.4, for now I don't have testing environment to install 2017.7.5
when I remove {%- import "system/module_slack.sls" as slack %} than, this problem dosen't appear any more. I noticed that it was usually problem with hits to the API.

Second way to reproduce:

I have idea that You can put some task to the crontab on 2 vm what will be executing hits to the Reactor in the same time (if the NTP is synchronized) let's say every 2 minutes, this should work, but I didn't test it. I will try to do it in next few days and let you know about results.

Ch3LL · 2018-06-11T14:59:14Z

Thanks for the additional details that will help to dive in.

ping @DmitryKuzmenko just a friendly reminder to add this to your bug list to fix :)

DmitryKuzmenko · 2018-06-13T12:58:52Z

@Ch3LL thank you. It's already in my icebox, just not the top priority at this moment, sorry. I'll work on it right after a few critical tasks.

Ch3LL · 2018-06-15T14:14:32Z

no worries just wanted to make sure it was on your radar. thanks :)

rickh563 · 2018-06-28T17:33:08Z

ZD-2593

There is issue [1] with reactor in salt 2017, one of the workarounds is to set reactor workers to 1 [1] saltstack/salt#47539 Change-Id: I47d76cc1dc5d0afe6d8b215e2d32cdbab3ac1a8c Related-Prod: https://mirantis.jira.com/browse/PROD-21463

anitakrueger · 2018-08-02T20:44:44Z

Is there any update on this issue please? Our infrastructure heavily relies on the reactor and I have to restart the salt-master process multiple times a day to get it working again :(

Ch3LL · 2018-08-06T14:15:29Z

not currently an update, but is assigned out to an engineer. i don't believe @DmitryKuzmenko is working on it currently as there are higher priority issues.

anitakrueger · 2018-08-06T14:37:05Z

Is there any more information needed?

Ch3LL · 2018-08-06T15:05:17Z

nope your right i need to re-label this thanks :)

Currently salt doesn't allow to get confirmation on minion upon successfull reactor execution on event. However there can be issues with reactor in salt 2017.7 [1] or reactor register state can fail if pillar failed to render, so node registration confirmation maybe needed. In order to enable this functionality add node_confirm_registration parameter to event data with value true. [1] saltstack/salt#47539 Change-Id: I1abc4fdf172e018dcdb48abcf61532fb51ad3660 Related-Prod: https://mirantis.jira.com/browse/PROD-21463

anitakrueger · 2018-08-31T03:53:48Z

@Ch3LL @DmitryKuzmenko is there any chance someone could look at this please?
I've tried to set reactor_worker_threads: 1 in /etc/salt/master and it does seem to "fix" the problem, but then things become so slow, they actually time out for me. For example I create 3 new minions in the cloud and they all produce events on startup which then need to be handled sequentially.
@johje349 said this might have been introduced after 2017.7.4? Any idea where to look?

anitakrueger · 2018-09-05T18:50:33Z

Unfortunately after creating many more minions in the cloud for my salt master, I have to report that setting reactor_worker_threads: 1 has not fixed the problem. The reactor still crashes when 2 or more events come in at the same time.

rallytime · 2018-09-05T19:46:39Z

Thank you very much for the updates @anitakrueger. This is definitely still on our radar and @DmitryKuzmenko will be working on this as soon as he can.

cachedout · 2018-10-02T14:19:14Z

@anitakrueger I believe this fix may help: https://github.com/saltstack/salt/pull/46641/files

Are you willing to test it?

gtmanfred · 2018-10-02T15:38:07Z

This change was made between 2017.7.2 and 2017.7.3, #43817

Are any of yall experiencing this issue using multiple environments for your state or orchestrate files?

Thanks,
Daniel

gtmanfred · 2018-10-02T16:58:33Z

I have replicated this issue.

anitakrueger · 2018-10-02T19:27:41Z

@cachedout I've looked at https://github.com/saltstack/salt/pull/46641/files, but we are running 2018.3.2 and this seems to be already merged in there.

@gtmanfred We only use the base environment. I could never work multiple environments out properly for our use case.
I'm glad you could reproduce it.

gtmanfred · 2018-10-02T19:50:17Z

So, here is what I have found.

What is happening is that the master worker threads are getting overloaded. These are the ones that the master uses to respond to fileserver requests from the minions and anything that uses the remote fileserver. The No matching sls found for message is appearing when fileclient.cache_file returns false inside of this get_state function. Now, what is happening with @1Ricardo is that the import inside of jinja is causing an extra call to the fileserver, which is causing the master to work harder, which is why when you remove that import, the problem appears to go away. But if you double the amount of events you put on the event bus, even with that import removed, the problem comes back.

So there is really only one work around, increase the worker_threads so that the salt master can handle all of the fileserver requests that you expect to come in at one time.

As for solutions that we can do, we are talking about two.

implement an exponential backoff for requests to the fileserver, specify a number of times a request can be made, and if after say ... 5 requests, it is still failing, then say the file is not there. But this has a couple caveats, one... this will make it way more difficult to troubleshoot performance problems, because there could be random sleeps. but the bigger one is the sleep would block one of the reactor worker threads.
I believe this is the better solution, since state.orchestrate should only ever be run on the master, we can switch the state renderer over to using the local fileclient. This will allow the orchestration runner, which usually gets run by the reactor, to be able to react more quickly to events, and not have to depend on worker threads to distribute jobs.

gtmanfred · 2018-10-02T19:58:48Z

If I can figure out why this has to call get_state twice, and get it to be not bad, and only call it once, using the local fileclient for state.orchestrate called from the master should be easy to implement.

diff --git a/salt/state.py b/salt/state.py
index fcacaeb935..04bbe4abe8 100644
--- a/salt/state.py
+++ b/salt/state.py
@@ -3443,6 +3443,8 @@ class BaseHighState(object):
         errors = []
         if not local:
             state_data = self.client.get_state(sls, saltenv)
+            if state_data.get('dest', False) is False:
+                state_data = self.client.get_state(sls, saltenv)
             fn_ = state_data.get('dest', False)
         else:
             fn_ = sls
@@ -4056,8 +4058,8 @@ class HighState(BaseHighState):
             mocked=False,
             loader='states',
             initial_pillar=None):
-        self.opts = opts
-        self.client = salt.fileclient.get_file_client(self.opts)
+        self.opts = copy.deepcopy(opts)
+        self.opts['file_client'] = 'local' if self.opts['__role'] == 'master' else self.opts['file_client']
         BaseHighState.__init__(self, opts)
         self.state = State(self.opts,
                            pillar_override,

xadina · 2018-10-03T09:20:11Z

I have been hit by this too, I've put in place a simple check to notify me when this happens until the fix is in place, I am sharing here in case someone might want to make use of it:

reactor.conf:

  - 'monitoring/ping':
    - salt://reactor/pong.sls

salt/reactor/pong.sls:

pong_back:
  local.file.write:
    - name: pong back
    - tgt: {{ data['id'] }}
    - args:
      - path: {{ data['data']['file'] }}
      - args: {{ data['data']['text'] }}

check script:

#!/bin/bash

string=$(cat /dev/urandom | tr -cd 'a-z0-9' | head -c 10)
file=$(mktemp)

sendevent ()
{
output=$(sudo salt-call event.send 'monitoring/ping' text=${string} file=${file} --retcode-passthrough 2>&1)
result=$?
if [ ${result} != 0 ]
  then
    status=2
    echo "Failure occured whilst sending event: [${output}]"
  else
    sleep 2
    contents=$(cat ${file} 2>/dev/null)
    if [ "${string}" == "${contents}" ]
      then
        status=0
        echo "OK: the Reactor is working"
      else
        status=2
        echo "Reactor Failure: check salt master!"
    fi
fi
}

sendevent
exit ${status}

gtmanfred · 2018-10-03T21:08:54Z

Ok, so the other part of this problem that is causing the issue is with all these calls happening at exactly the same time, they are all trying to open file handlers to cache the same file to the same location, so we need to add some sort of file lock to this to make sure multiple reactor threads do not write to the cache at the same time. Then that should solve this issue totally.

If the __role is master, then we should already have all the configs needed to get the files from the local machine, and from gitfs instead of having to request them from the fileserver on the master. Fixes the first part of saltstack#47539

gtmanfred · 2018-10-10T15:24:19Z

Ok, we have figured out what is happening, and hopefully a fix for it, but it is going to take a bit to finish. @terminalmage is working on a fix.

To render pillar files, we use the same logic that is used to render file_roots, so what we do is we munge the data in opts for file_roots, to be the pillar_roots, and run it through the same code, those options are overwritten here. As you can see above that line, we do a deep copy of the opts dictionary, so it should not affect the opts that is used when pulling states. However, even with this, it appears that this assignment is still affecting the rendering of states for the reactor sometimes.

Here is what we found yesterday.

In one thread when trying to grab the orchestration file, and sometimes the reactor state file, we see it trying to pull from the pillar roots.

2018-10-09 16:23:06,911 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/salt/test.sls
2018-10-09 16:23:06,920 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] THERE7
2018-10-09 16:23:06,933 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/salt/orch/bugtest.sls
2018-10-09 16:23:06,941 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] THERE7
2018-10-09 16:23:13,159 {139939631068928}[salt.fileserver  ][CRITICAL] LOAD: {u'path': u'orch/bugtest.sls', u'saltenv': u'base', u'cmd': u'_file_hash'}
2018-10-09 16:23:13,165 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/pillar/orch/bugtest.sls
2018-10-09 16:23:13,167 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/spm/pillar/orch/bugtest.sls
2018-10-09 16:23:13,168 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] THERE8
2018-10-09 16:23:13,170 {139939631068928}[salt.fileserver  ][CRITICAL] HERE2: {u'path': u'', u'rel': u''}
2018-10-09 16:23:13,171 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/pillar/orch/bugtest.sls
2018-10-09 16:23:13,172 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/spm/pillar/orch/bugtest.sls
2018-10-09 16:23:13,173 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] THERE8
2018-10-09 16:23:13,176 {139939631068928}[salt.fileserver  ][CRITICAL] LOAD: {u'path': u'orch/bugtest/init.sls', u'saltenv': u'base', u'cmd': u'_file_hash'}
2018-10-09 16:23:13,177 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/pillar/orch/bugtest/init.sls
2018-10-09 16:23:13,179 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/spm/pillar/orch/bugtest/init.sls
2018-10-09 16:23:13,181 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] THERE8
2018-10-09 16:23:13,183 {139939631068928}[salt.fileserver  ][CRITICAL] HERE2: {u'path': u'', u'rel': u''}
2018-10-09 16:23:13,184 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/pillar/orch/bugtest/init.sls
2018-10-09 16:23:13,186 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/spm/pillar/orch/bugtest/init.sls
2018-10-09 16:23:13,189 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] THERE8
2018-10-09 16:23:13,190 {139939631068928}[salt.state       ][CRITICAL] GET_STATE: {}

So, our plan is to break this out so that it does not need to overwrite the file_roots to render pillar data, that is something Erik started earlier this year, but hasn't finished yet. We hope to put that code into 2018.3 for the 2018.3.4 release.

Now, for a work around, there are two ways to get around this bug.

copy all your reactor and orchestration states the you use from your master to the /srv/pillar roots, or wherever your pillar roots are, so that it can find them there if this bug occurs.
move to using an external fileserver for these files. This bug only affects the roots fileserver, so if you put your reactor and orchestration file in gitfs or another fileserver, it won't reach this bug.

If you put your reactor into gitfs too, you can reference it with salt://

 reactor:
 ### bug test
  - 'salt/netapi/hook/bugtest':
    - salt://reactor/bugtest.sls

Thanks for your patience.
Daniel

anitakrueger · 2018-10-10T19:14:04Z

Daniel, thank you SOOOO much for figuring this out! This has really been bugging us as we use the reactor heavily. I will give workaround #1 a go and report back.

gtmanfred · 2018-10-10T19:16:08Z

👍 happy to help.

Erik has a commit that seems to fix the issue, and it should be included in 2018.3.4.

terminalmage · 2018-10-10T20:00:35Z

I've opened #49987, see that PR for details.

gtmanfred · 2018-10-15T21:59:42Z

The above fix has been merged and it will be in the 2018.3.4 release.

#49987

Thank you for your patience with helping us troubleshoot this.
Daniel

Ch3LL added cannot-reproduce cannot be replicated with info/context provided info-needed waiting for more info labels May 8, 2018

Ch3LL added this to the Blocked milestone May 8, 2018

Ch3LL mentioned this issue May 11, 2018

Reactor occasionally fails to run orchestration #47608

Closed

Ch3LL assigned DmitryKuzmenko May 11, 2018

This was referenced May 30, 2018

Reactor periodically can't find files in gitfs #47206

Closed

Salt Reactors can not be rendered #44631

Closed

rickh563 added the ZD The issue is related to a Zendesk customer support ticket. label Jun 28, 2018

Ch3LL added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around labels Aug 6, 2018

Ch3LL removed the info-needed waiting for more info label Aug 6, 2018

Ch3LL modified the milestones: Blocked, Approved Aug 6, 2018

gtmanfred removed the cannot-reproduce cannot be replicated with info/context provided label Oct 2, 2018

gtmanfred mentioned this issue Oct 3, 2018

[WIP] Allow reactor to react to more events quickly #49888

Closed

gtmanfred self-assigned this Oct 9, 2018

terminalmage mentioned this issue Oct 10, 2018

Make Pillar no longer munge file_roots #49987

Merged

rallytime added the fixed-pls-verify fix is linked, bug author to confirm fix label Oct 10, 2018

gtmanfred closed this as completed Oct 15, 2018

Reactor - No matching sls found for 'orch.minion-onBoot' in env 'base' #47539

Reactor - No matching sls found for 'orch.minion-onBoot' in env 'base' #47539

Comments

richie-tt commented May 8, 2018 • edited

Description of Issue/Question

Setup

Reactor config

Reactor state:

Orchestrate

Steps to Reproduce Issue

Versions Report

Ch3LL commented May 8, 2018

richie-tt commented May 9, 2018 • edited

Ch3LL commented May 9, 2018

Ch3LL commented May 11, 2018

johje349 commented May 21, 2018

calc84maniac commented May 21, 2018

calc84maniac commented May 21, 2018 • edited

anitakrueger commented May 31, 2018

hupebln commented May 31, 2018

calc84maniac commented May 31, 2018

richie-tt commented Jun 10, 2018

First way to reproduce:

Config of reactor

Reactor

orchestration

Log from debug

state.event

Second way to reproduce:

Ch3LL commented Jun 11, 2018

DmitryKuzmenko commented Jun 13, 2018

Ch3LL commented Jun 15, 2018

rickh563 commented Jun 28, 2018

anitakrueger commented Aug 2, 2018

Ch3LL commented Aug 6, 2018

anitakrueger commented Aug 6, 2018

Ch3LL commented Aug 6, 2018

anitakrueger commented Aug 31, 2018

anitakrueger commented Sep 5, 2018

rallytime commented Sep 5, 2018

cachedout commented Oct 2, 2018

gtmanfred commented Oct 2, 2018

gtmanfred commented Oct 2, 2018

anitakrueger commented Oct 2, 2018

gtmanfred commented Oct 2, 2018

gtmanfred commented Oct 2, 2018

xadina commented Oct 3, 2018

gtmanfred commented Oct 3, 2018

gtmanfred commented Oct 10, 2018

anitakrueger commented Oct 10, 2018

gtmanfred commented Oct 10, 2018

terminalmage commented Oct 10, 2018

gtmanfred commented Oct 15, 2018

richie-tt commented May 8, 2018 •

edited

richie-tt commented May 9, 2018 •

edited

calc84maniac commented May 21, 2018 •

edited