Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactor - No matching sls found for 'orch.minion-onBoot' in env 'base' #47539

Closed
richie-tt opened this issue May 8, 2018 · 35 comments
Closed
Assignees
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix P2 Priority 2 Reactor severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around ZD The issue is related to a Zendesk customer support ticket.
Milestone

Comments

@richie-tt
Copy link

richie-tt commented May 8, 2018

Description of Issue/Question

After my deep investigation, why I receiving randomly following message:

2018-05-08 15:05:55,227 [salt.template    :62  ][ERROR   ][18028] Template was specified incorrectly: False

I figure out, that it happening only, when 2 or more minion hits in the reactor in the same time.
It looks like that some cache is cleared or something and not generated on the time, when second minion hit to the Reactor

salt/run/20180508150554686064/ret       {
    "_stamp": "2018-05-08T13:05:57.144721", 
    "fun": "runner.state.orchestrate", 
    "fun_args": [
        "orch.minion-onBoot", 
        {
            "pillar": {
                "event_data": {
                    "_stamp": "2018-05-08T13:05:54.546766", 
                    "cmd": "_minion_event", 
                    "data": "Minion web220c-stg-1.inet started at Tue May  8 13:05:53 2018", 
                    "id": "web220c-stg-1.inet", 
                    "pretag": null, 
                    "tag": "salt/minion/web220c-stg-1.inet/start"
                }, 
                "event_id": "web220c-stg-1.inet"
            }
        }
    ], 
    "jid": "20180508150554686064", 
    "return": {
        "data": {
            "salt01-tech.inet_master": [
                "No matching sls found for 'orch.minion-onBoot' in env 'base'"
            ]
        }, 
        "outputter": "highstate", 
        "retcode": 1
    }, 
    "success": false, 
    "user": "Reactor"
}

I have Auto-Scaling-Group from AWS, which deploys VM to cloud and make basic setup for minion(update master address, setup grains and start service).
After that SALT-Master make the highstate. The problem is that the Reactor works perfectly when there is only 1 new minion, but if I will deploy 2 or more new machine in the same time, then Reacotor will works only for first one which makes attempts, for others Reactor will throw error that some SLS was not found.

For me it looks like that after hit in the Reactor there is some cache clearing and not regenrated on the time.

Flow how looks like the "communication"
minion -> reactor -> orchestrate -> highstate for minion

Setup

Reactor config
reactor:
  - 'salt/minion/*/start':
    - /opt/salt/salt-pillar/reactor/minion-onStart.sls
Reactor state:
log_new_minion:
  local.cmd.run:
    - name: log new minion
    - tgt: 'apps:saltmaster'
    - expr_form: grain
    - arg:
      - 'echo "[{{ data['id'] }}][minion started]  A new Minion has (re)born on $(date). Say hello to him ({{ tag }})" >> /var/log/salt/reactor.log'

hello_state:
  runner.state.orchestrate:
    - mods: orch.minion-onBoot
    - kwarg:
        pillar:
          event_data: {{ data|json }}
          event_id: {{ data['id'] }}
Orchestrate
{%- set message = "A new Minion has (re)born.\nSay *hello* to him :thinking_face:" %}
{% set service = "Reactor-NEW Minion" %}

{%- set data = salt.pillar.get("event_data") %}
{%- set id = salt.pillar.get("event_id") %}
{%- import "system/module_slack.sls" as slack %}

{% do slack.sendSlackMessage(service, None, id, message, 2) %}


boot_check_version:
  salt.state:
    - tgt: {{ data["id"] }}
    - sls: prepare.version

Steps to Reproduce Issue

Create 2-3 new VM

Versions Report

Salt Version:
           Salt: 2017.7.5
 
Dependency Versions:
           cffi: 1.11.5
       cherrypy: unknown
       dateutil: 2.7.2
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.1
   mysql-python: Not Installed
      pycparser: 2.18
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Aug  4 2017, 00:39:18)
   python-gnupg: 0.3.8
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4
 
System Versions:
           dist: centos 7.4.1708 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-693.21.1.el7.x86_64
         system: Linux
@Ch3LL
Copy link
Contributor

Ch3LL commented May 8, 2018

i'm having a hard time replicating this. Does this occur every time you create 2 minions? If you just restart two minions at the same time do you see the same behavior? That should trigger the reactor as well

@Ch3LL Ch3LL added cannot-reproduce cannot be replicated with info/context provided info-needed waiting for more info labels May 8, 2018
@Ch3LL Ch3LL added this to the Blocked milestone May 8, 2018
@richie-tt
Copy link
Author

richie-tt commented May 9, 2018

Hi @Ch3LL thank you for your involve in that case.
I believe that it is hard to reproduce, because after manually restart of the minions, everything works fine, I think because the time between hit to reactor is bigger.

To be honest I was surprised why this issue came up now, than I figure out that a 1 week ago I upgraded SALT to version 2018.3.0 (here was reported problem with generation token for Vault) but than I make downgrade to 2017.7.5 and the problem show up. After moving back to initial version 2017.7.4 all worked normally. I think the bug was introduced in version 2017.7.5, in my opinion.

I will try to prepare some easy steps to reproduce it, because I noticed the same problem, when there was 10 hits to API which also calls the orchestration.

@Ch3LL
Copy link
Contributor

Ch3LL commented May 9, 2018

yeah if you can include an easy use case that would be awesome. I tried replicating by using your states you provided and starting up 2 minions at the same time to try to hit the reactor at the exact time.

Also in the meantime while you are preparing the use case @saltstack/team-core any ideas on why this might be occurring?

@Ch3LL
Copy link
Contributor

Ch3LL commented May 11, 2018

ping @DmitryKuzmenko can you take a look here?

@johje349
Copy link

I can confirm that the problem was introduced in 2017.7.5. Works fine in 2017.7.4.

@calc84maniac
Copy link

I can confirm getting the same error fairly consistently on 2018.3.0 when running an orchestrate in reaction to cloud created events while deploying a cloud in parallel. The error seems to go away if I configure reactor_worker_threads: 1 though that's certainly not ideal. Seems like there could be some multi-threading issue with either the reactor's cached RunnerClient or the orchestration runner specifically.

@calc84maniac
Copy link

calc84maniac commented May 21, 2018

I did a bit of digging into how this all works, and the problem might be that the global multiprocessing option is propagated when creating the RunnerClient shared between worker threads, meaning that by default the LazyLoader created underneath will not be thread-safe, despite being used in a threaded environment. Perhaps multiprocessing should be forced to false in the reactor's options?

Edit: By this I mean in the ReactWrap class, which is specifically using threads.

Changing this line to the following appears to alleviate my issues:

self.opts = opts.copy()
self.opts['multiprocessing'] = False

@anitakrueger
Copy link
Contributor

@calc84maniac This sounds like the issue here. In my case, I am trying to create 2 cloud servers in quick succession, where both salt/cloud/*/created events kick off orchestrations via the reactor. I have tried to set 'multiprocessing' to False like you suggested, but my reactor still dies.

Our case relies heavily on the reactor for after cloud-instance creation processing which now fails because of the reactor throwing the error above. We also have other reactor events that are now not being processed :( the only thing that fixes the situation is a salt-master restart.

Where would I set reactor_worker_threads: 1? In the master config? Also what happens in that case when multiple events arrive at the same time?

@hupebln
Copy link

hupebln commented May 31, 2018

I am in the same boat and have disabled the multiprocessing for now. But cannot really say if it comes back again.
@calc84maniac have you changed it in the code itself? Why not just set in the config? Wasn't it working this way?
But that issue is a really blocker for me at the moment as almost all of my processes are using reactors in some way.

@calc84maniac
Copy link

Sorry, it does look like that didn't fully fix the problem after all, after further testing. But I feel like it's still some kind of race condition with the loader, regardless.

reactor_worker_threads: 1 in the master config never gave me problems, but in that case events are queued and executed in sequence, which may not be appropriate in regard to response time.

@richie-tt
Copy link
Author

Hi @Ch3LL

I am sorry for late answer, for last weeks I was very busy with implementing new project.

First way to reproduce:

I prepared some steps to reproduce this issue, but I am not sure if this is the best approach because it can be related with other issue/bug. I mentioned before that I had this issue before but with hits to API (which also using reactor), but I solve it by workaround, maybe this is useless, someone need to verify if this approach is value.

Config of reactor

rest_cherrypy:
  port: 4080
  host: 0.0.0.0
  webhook_url: /hook
  webhook_disable_auth: True
  ssl_crt: /etc/pki/tls/certs/salt01-tech.inet.crt
  ssl_key: /etc/pki/tls/private/salt01-tech.inet.key
  log_access_file: /var/log/salt/access_api.log
  log_error_file: /var/log/salt/error_api.log

 reactor:
 ### bug test
  - 'salt/netapi/hook/bugtest':
    - /opt/salt/salt-pillar/reactor/bugtest.sls

Reactor

bugtest_event:
  runner.state.orchestrate:
    - args:
      - mods: orch.bugtest
      - pillar:
          event_data: {{ data|json }}

orchestration

{%- set fqdn = salt.grains.get("fqdn") %}
{%- set hook = salt.pillar.get("event_data",False) %}

{%- import "system/module_slack.sls" as slack %}

{%- set service = "Salt-Orchestration - Percona" %}

{%- if hook != False %}
  {%- set RemoteAddr = hook["headers"]["Remote-Addr"] %}

cmd.run:
  salt.function:
    - tgt: 'salt01-tech.inet'
    - arg:
      - echo test

{%- else %}
Echo_percona:
  cmd.run:
    - name: echo "No Data"

{% endif %}

I started following command on 2 vm:

for i in {1..100}; do curl -sSk https://salt01-tech.inet:4080/hook/bugtest -H 'Accept: application/x-yaml' -d 'cluster: sql-prd-shared1'; done

Log from debug

2018-06-10 18:59:08,358 [salt.utils.lazy  :97  ][DEBUG   ][24046] LazyLoaded jinja.render
2018-06-10 18:59:08,358 [salt.utils.event :728 ][DEBUG   ][24046] Sending event: tag = salt/run/20180610185908347361/new; data = {'fun': 'runner.state.orchestrate', 'fun_args': [{'pillar': OrderedDict([('event_data', OrderedDict([('_stamp', '2018-06-10T16:54:34.530484'), ('body', ''), ('headers', OrderedDict([('Accept', 'application/x-yaml'), ('Content-Length', '24'), ('Content-Type', 'application/x-www-form-urlencoded'), ('Host', 'salt01-tech.inet:4080'), ('Remote-Addr', '10.201.34.125'), ('User-Agent', 'curl/7.29.0')])), ('post', OrderedDict([('cluster: sql-prd-shared1', '')]))]))]), 'mods': 'orch.bugtest'}], 'jid': '20180610185908347361', 'user': 'Reactor', '_stamp': '2018-06-10T16:59:08.357805'}
2018-06-10 18:59:08,362 [salt.loaded.int.rawmodule.state:2036][DEBUG   ][25798] Remaining event matches: -975
2018-06-10 18:59:08,361 [salt.utils.lazy  :97  ][DEBUG   ][24046] LazyLoaded yaml.render
2018-06-10 18:59:08,361 [salt.config      :1954][DEBUG   ][24046] Reading configuration from /etc/salt/master
2018-06-10 18:59:08,364 [salt.fileclient  :1072][DEBUG   ][24046] Could not find file 'salt://orch/bugtest.sls' in saltenv 'base'
2018-06-10 18:59:08,369 [salt.fileclient  :1072][DEBUG   ][24046] Could not find file 'salt://orch/bugtest/init.sls' in saltenv 'base'
2018-06-10 18:59:08,369 [salt.template    :48  ][DEBUG   ][24046] compile template: False
2018-06-10 18:59:08,370 [salt.template    :62  ][ERROR   ][24046] Template was specified incorrectly: False
2018-06-10 18:59:08,377 [salt.utils.lazy  :97  ][DEBUG   ][24046] LazyLoaded config.get
2018-06-10 18:59:08,378 [salt.loaded.int.render.yaml:76  ][DEBUG   ][24046] Results of YAML rendering: 
OrderedDict([('include', ['directory.consul', 'directory.consul_cache']), ('directory', OrderedDict([('UID', OrderedDict([('saltmaster', OrderedDict([('uid', 'root')]))]))]))])
2018-06-10 18:59:08,378 [salt.template    :26  ][PROFILE ][24046] Time (in seconds) to render '/opt/salt/salt-pillar/pillar/directory/saltmaster.sls' using 'yaml' renderer: 0.0239288806915
2018-06-10 18:59:08,380 [salt.template    :48  ][DEBUG   ][24046] compile template: /opt/salt/salt-pillar/pillar/directory/consul.sls
2018-06-10 18:59:08,379 [salt.utils.event :728 ][DEBUG   ][24046] Sending event: tag = salt/run/20180610185848992206/ret; data = {'fun_args': [{'pillar': OrderedDict([('event_data', OrderedDict([('_stamp', '2018-06-10T16:54:33.650618'), ('body', ''), ('headers', OrderedDict([('Accept', 'application/x-yaml'), ('Content-Length', '24'), ('Content-Type', 'application/x-www-form-urlencoded'), ('Host', 'salt01-tech.inet:4080'), ('Remote-Addr', '10.201.34.125'), ('User-Agent', 'curl/7.29.0')])), ('post', OrderedDict([('cluster: sql-prd-shared1', '')]))]))]), 'mods': 'orch.bugtest'}], 'jid': '20180610185848992206', 'return': {'outputter': 'highstate', 'data': {'salt01-tech.inet_master': ["No matching sls found for 'orch.bugtest' in env 'base'"]}, 'retcode': 1}, 'success': False, '_stamp': '2018-06-10T16:59:08.379624', 'user': 'Reactor', 'fun': 'runner.state.orchestrate'}

state.event

[DEBUG   ] Remaining event matches: -975
salt/run/20180610185848992206/ret       {
    "_stamp": "2018-06-10T16:59:08.379624", 
    "fun": "runner.state.orchestrate", 
    "fun_args": [
        {
            "mods": "orch.bugtest", 
            "pillar": {
                "event_data": {
                    "_stamp": "2018-06-10T16:54:33.650618", 
                    "body": "", 
                    "headers": {
                        "Accept": "application/x-yaml", 
                        "Content-Length": "24", 
                        "Content-Type": "application/x-www-form-urlencoded", 
                        "Host": "salt01-tech.omarsys.inet:4080", 
                        "Remote-Addr": "10.201.34.125", 
                        "User-Agent": "curl/7.29.0"
                    }, 
                    "post": {
                        "cluster: sql-prd-shared1": ""
                    }
                }
            }
        }
    ], 
    "jid": "20180610185848992206", 
    "return": {
        "data": {
            "salt01-tech.omarsys.inet_master": [
                "No matching sls found for 'orch.bugtest' in env 'base'"
            ]
        }, 
        "outputter": "highstate", 
        "retcode": 1
    }, 
    "success": false, 
    "user": "Reactor"
}

Please do not judge my code and sense here, I just wanted to create something to reproduce the issue :)

What is important:

  • that issue was generated on version 2017.7.4, for now I don't have testing environment to install 2017.7.5
  • when I remove {%- import "system/module_slack.sls" as slack %} than, this problem dosen't appear any more. I noticed that it was usually problem with hits to the API.

Second way to reproduce:

I have idea that You can put some task to the crontab on 2 vm what will be executing hits to the Reactor in the same time (if the NTP is synchronized) let's say every 2 minutes, this should work, but I didn't test it. I will try to do it in next few days and let you know about results.

@Ch3LL
Copy link
Contributor

Ch3LL commented Jun 11, 2018

Thanks for the additional details that will help to dive in.

ping @DmitryKuzmenko just a friendly reminder to add this to your bug list to fix :)

@DmitryKuzmenko
Copy link
Contributor

@Ch3LL thank you. It's already in my icebox, just not the top priority at this moment, sorry. I'll work on it right after a few critical tasks.

@Ch3LL
Copy link
Contributor

Ch3LL commented Jun 15, 2018

no worries just wanted to make sure it was on your radar. thanks :)

@rickh563
Copy link

ZD-2593

@rickh563 rickh563 added the ZD The issue is related to a Zendesk customer support ticket. label Jun 28, 2018
cloudlab-autobuild pushed a commit to salt-formulas/salt-formula-salt that referenced this issue Jul 25, 2018
There is issue [1] with reactor in salt 2017, one
of the workarounds is to set reactor workers to 1

[1] saltstack/salt#47539

Change-Id: I47d76cc1dc5d0afe6d8b215e2d32cdbab3ac1a8c
Related-Prod: https://mirantis.jira.com/browse/PROD-21463
@anitakrueger
Copy link
Contributor

Is there any update on this issue please? Our infrastructure heavily relies on the reactor and I have to restart the salt-master process multiple times a day to get it working again :(

@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 6, 2018

not currently an update, but is assigned out to an engineer. i don't believe @DmitryKuzmenko is working on it currently as there are higher priority issues.

@anitakrueger
Copy link
Contributor

Is there any more information needed?

@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 6, 2018

nope your right i need to re-label this thanks :)

@Ch3LL Ch3LL added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around labels Aug 6, 2018
@Ch3LL Ch3LL removed the info-needed waiting for more info label Aug 6, 2018
@Ch3LL Ch3LL modified the milestones: Blocked, Approved Aug 6, 2018
cloudlab-autobuild pushed a commit to salt-formulas/salt-formula-reclass that referenced this issue Aug 7, 2018
Currently salt doesn't allow to get confirmation on minion upon
successfull reactor execution on event. However there can be issues with
reactor in salt 2017.7 [1] or reactor register state can fail if pillar
failed to render, so node registration confirmation maybe needed. In order
to enable this functionality add node_confirm_registration parameter to
event data with value true.

[1] saltstack/salt#47539
Change-Id: I1abc4fdf172e018dcdb48abcf61532fb51ad3660
Related-Prod: https://mirantis.jira.com/browse/PROD-21463
@anitakrueger
Copy link
Contributor

@Ch3LL @DmitryKuzmenko is there any chance someone could look at this please?
I've tried to set reactor_worker_threads: 1 in /etc/salt/master and it does seem to "fix" the problem, but then things become so slow, they actually time out for me. For example I create 3 new minions in the cloud and they all produce events on startup which then need to be handled sequentially.
@johje349 said this might have been introduced after 2017.7.4? Any idea where to look?

@anitakrueger
Copy link
Contributor

Unfortunately after creating many more minions in the cloud for my salt master, I have to report that setting reactor_worker_threads: 1 has not fixed the problem. The reactor still crashes when 2 or more events come in at the same time.

@rallytime
Copy link
Contributor

Thank you very much for the updates @anitakrueger. This is definitely still on our radar and @DmitryKuzmenko will be working on this as soon as he can.

@cachedout
Copy link
Contributor

@anitakrueger I believe this fix may help: https://github.com/saltstack/salt/pull/46641/files

Are you willing to test it?

@gtmanfred
Copy link
Contributor

This change was made between 2017.7.2 and 2017.7.3, #43817

Are any of yall experiencing this issue using multiple environments for your state or orchestrate files?

Thanks,
Daniel

@gtmanfred
Copy link
Contributor

I have replicated this issue.

@gtmanfred gtmanfred removed the cannot-reproduce cannot be replicated with info/context provided label Oct 2, 2018
@anitakrueger
Copy link
Contributor

@cachedout I've looked at https://github.com/saltstack/salt/pull/46641/files, but we are running 2018.3.2 and this seems to be already merged in there.

@gtmanfred We only use the base environment. I could never work multiple environments out properly for our use case.
I'm glad you could reproduce it.

@gtmanfred
Copy link
Contributor

So, here is what I have found.

What is happening is that the master worker threads are getting overloaded. These are the ones that the master uses to respond to fileserver requests from the minions and anything that uses the remote fileserver. The No matching sls found for message is appearing when fileclient.cache_file returns false inside of this get_state function. Now, what is happening with @1Ricardo is that the import inside of jinja is causing an extra call to the fileserver, which is causing the master to work harder, which is why when you remove that import, the problem appears to go away. But if you double the amount of events you put on the event bus, even with that import removed, the problem comes back.

So there is really only one work around, increase the worker_threads so that the salt master can handle all of the fileserver requests that you expect to come in at one time.

As for solutions that we can do, we are talking about two.

  1. implement an exponential backoff for requests to the fileserver, specify a number of times a request can be made, and if after say ... 5 requests, it is still failing, then say the file is not there. But this has a couple caveats, one... this will make it way more difficult to troubleshoot performance problems, because there could be random sleeps. but the bigger one is the sleep would block one of the reactor worker threads.

  2. I believe this is the better solution, since state.orchestrate should only ever be run on the master, we can switch the state renderer over to using the local fileclient. This will allow the orchestration runner, which usually gets run by the reactor, to be able to react more quickly to events, and not have to depend on worker threads to distribute jobs.

@gtmanfred
Copy link
Contributor

If I can figure out why this has to call get_state twice, and get it to be not bad, and only call it once, using the local fileclient for state.orchestrate called from the master should be easy to implement.

diff --git a/salt/state.py b/salt/state.py
index fcacaeb935..04bbe4abe8 100644
--- a/salt/state.py
+++ b/salt/state.py
@@ -3443,6 +3443,8 @@ class BaseHighState(object):
         errors = []
         if not local:
             state_data = self.client.get_state(sls, saltenv)
+            if state_data.get('dest', False) is False:
+                state_data = self.client.get_state(sls, saltenv)
             fn_ = state_data.get('dest', False)
         else:
             fn_ = sls
@@ -4056,8 +4058,8 @@ class HighState(BaseHighState):
             mocked=False,
             loader='states',
             initial_pillar=None):
-        self.opts = opts
-        self.client = salt.fileclient.get_file_client(self.opts)
+        self.opts = copy.deepcopy(opts)
+        self.opts['file_client'] = 'local' if self.opts['__role'] == 'master' else self.opts['file_client']
         BaseHighState.__init__(self, opts)
         self.state = State(self.opts,
                            pillar_override,

@xadina
Copy link

xadina commented Oct 3, 2018

I have been hit by this too, I've put in place a simple check to notify me when this happens until the fix is in place, I am sharing here in case someone might want to make use of it:

reactor.conf:

  - 'monitoring/ping':
    - salt://reactor/pong.sls

salt/reactor/pong.sls:

pong_back:
  local.file.write:
    - name: pong back
    - tgt: {{ data['id'] }}
    - args:
      - path: {{ data['data']['file'] }}
      - args: {{ data['data']['text'] }}

check script:

#!/bin/bash

string=$(cat /dev/urandom | tr -cd 'a-z0-9' | head -c 10)
file=$(mktemp)

sendevent ()
{
output=$(sudo salt-call event.send 'monitoring/ping' text=${string} file=${file} --retcode-passthrough 2>&1)
result=$?
if [ ${result} != 0 ]
  then
    status=2
    echo "Failure occured whilst sending event: [${output}]"
  else
    sleep 2
    contents=$(cat ${file} 2>/dev/null)
    if [ "${string}" == "${contents}" ]
      then
        status=0
        echo "OK: the Reactor is working"
      else
        status=2
        echo "Reactor Failure: check salt master!"
    fi
fi
}

sendevent
exit ${status}

@gtmanfred
Copy link
Contributor

Ok, so the other part of this problem that is causing the issue is with all these calls happening at exactly the same time, they are all trying to open file handlers to cache the same file to the same location, so we need to add some sort of file lock to this to make sure multiple reactor threads do not write to the cache at the same time. Then that should solve this issue totally.

gtmanfred added a commit to gtmanfred/salt that referenced this issue Oct 4, 2018
If the __role is master, then we should already have all the configs
needed to get the files from the local machine, and from gitfs instead
of having to request them from the fileserver on the master.

Fixes the first part of saltstack#47539
@gtmanfred gtmanfred self-assigned this Oct 9, 2018
@gtmanfred
Copy link
Contributor

Ok, we have figured out what is happening, and hopefully a fix for it, but it is going to take a bit to finish. @terminalmage is working on a fix.

To render pillar files, we use the same logic that is used to render file_roots, so what we do is we munge the data in opts for file_roots, to be the pillar_roots, and run it through the same code, those options are overwritten here. As you can see above that line, we do a deep copy of the opts dictionary, so it should not affect the opts that is used when pulling states. However, even with this, it appears that this assignment is still affecting the rendering of states for the reactor sometimes.

Here is what we found yesterday.

In one thread when trying to grab the orchestration file, and sometimes the reactor state file, we see it trying to pull from the pillar roots.

2018-10-09 16:23:06,911 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/salt/test.sls
2018-10-09 16:23:06,920 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] THERE7
2018-10-09 16:23:06,933 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/salt/orch/bugtest.sls
2018-10-09 16:23:06,941 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] THERE7
2018-10-09 16:23:13,159 {139939631068928}[salt.fileserver  ][CRITICAL] LOAD: {u'path': u'orch/bugtest.sls', u'saltenv': u'base', u'cmd': u'_file_hash'}
2018-10-09 16:23:13,165 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/pillar/orch/bugtest.sls
2018-10-09 16:23:13,167 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/spm/pillar/orch/bugtest.sls
2018-10-09 16:23:13,168 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] THERE8
2018-10-09 16:23:13,170 {139939631068928}[salt.fileserver  ][CRITICAL] HERE2: {u'path': u'', u'rel': u''}
2018-10-09 16:23:13,171 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/pillar/orch/bugtest.sls
2018-10-09 16:23:13,172 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/spm/pillar/orch/bugtest.sls
2018-10-09 16:23:13,173 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] THERE8
2018-10-09 16:23:13,176 {139939631068928}[salt.fileserver  ][CRITICAL] LOAD: {u'path': u'orch/bugtest/init.sls', u'saltenv': u'base', u'cmd': u'_file_hash'}
2018-10-09 16:23:13,177 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/pillar/orch/bugtest/init.sls
2018-10-09 16:23:13,179 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/spm/pillar/orch/bugtest/init.sls
2018-10-09 16:23:13,181 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] THERE8
2018-10-09 16:23:13,183 {139939631068928}[salt.fileserver  ][CRITICAL] HERE2: {u'path': u'', u'rel': u''}
2018-10-09 16:23:13,184 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/pillar/orch/bugtest/init.sls
2018-10-09 16:23:13,186 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] FILEPATH: /srv/spm/pillar/orch/bugtest/init.sls
2018-10-09 16:23:13,189 {139939631068928}[salt.loaded.int.fileserver.roots][CRITICAL] THERE8
2018-10-09 16:23:13,190 {139939631068928}[salt.state       ][CRITICAL] GET_STATE: {}

So, our plan is to break this out so that it does not need to overwrite the file_roots to render pillar data, that is something Erik started earlier this year, but hasn't finished yet. We hope to put that code into 2018.3 for the 2018.3.4 release.

Now, for a work around, there are two ways to get around this bug.

  1. copy all your reactor and orchestration states the you use from your master to the /srv/pillar roots, or wherever your pillar roots are, so that it can find them there if this bug occurs.

  2. move to using an external fileserver for these files. This bug only affects the roots fileserver, so if you put your reactor and orchestration file in gitfs or another fileserver, it won't reach this bug.

If you put your reactor into gitfs too, you can reference it with salt://

 reactor:
 ### bug test
  - 'salt/netapi/hook/bugtest':
    - salt://reactor/bugtest.sls

Thanks for your patience.
Daniel

@anitakrueger
Copy link
Contributor

Daniel, thank you SOOOO much for figuring this out! This has really been bugging us as we use the reactor heavily. I will give workaround #1 a go and report back.

@gtmanfred
Copy link
Contributor

👍 happy to help.

Erik has a commit that seems to fix the issue, and it should be included in 2018.3.4.

@terminalmage
Copy link
Contributor

I've opened #49987, see that PR for details.

@rallytime rallytime added the fixed-pls-verify fix is linked, bug author to confirm fix label Oct 10, 2018
@gtmanfred
Copy link
Contributor

The above fix has been merged and it will be in the 2018.3.4 release.

#49987

Thank you for your patience with helping us troubleshoot this.
Daniel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix P2 Priority 2 Reactor severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around ZD The issue is related to a Zendesk customer support ticket.
Projects
None yet
Development

No branches or pull requests