Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactor alert on highstate fail #28569

Closed
andrejohansson opened this issue Nov 4, 2015 · 14 comments
Closed

Reactor alert on highstate fail #28569

andrejohansson opened this issue Nov 4, 2015 · 14 comments
Assignees
Labels
Core relates to code central or existential to Salt Feature new functionality including changes to functionality and code refactors, etc. State-Compiler ZD The issue is related to a Zendesk customer support ticket. ZRELEASED - Boron
Milestone

Comments

@andrejohansson
Copy link

Just as in this post and in this example i am trying to make a reactor state that triggers on failed highstate runs so that I can send various alerts.

I have the following files:
master config

# Reactors
reactor:
  # Listen when jobs complete, if its highstate failing, alert
  - 'salt/job/*/ret/*':
    - salt://reactor/hipchat-highstate-fail.sls

_hipchat-highstate-fail.sls_

# if its a highstate function
# note: we cannot use not data['success'] since its always true and retcode is always 0

# check if the highstate had errors
# how?!
{%- if data['fun'] == 'state.highstate' and hasErrors -%}

# execute an alert only if the runner found an error
hipchat_highstate_alert:
  local.hipchat.send_message:
    - tgt: "myminion"
    - kwarg:
        room_id: "MyRoom"
        from_name: "SaltStack"
        color: "red"
        notify: "True"
        message: "Highstate failed on minion {{ data['id'] }}."

{%- endif -%}

Things works well until you want to filter out just the failing runs. Is there really no simpler way than residing to custom jinjia templating or custom runners? I would expect something like listening on failed jobs only salt/job/*/failure/*

So the question I'm asking is: How can I make a reactor state that runs only on failed highstates runs?

* Versions report*
(I'm in the early access enterprise program, don't remember how the versions are mapped)

root@prs-eu-sama:~# salt --versions-report
           Salt: 4.0.2
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
 msgpack-python: 0.3.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 14.0.1
           RAET: Not Installed
            ZMQ: 4.0.4
           Mako: Not Installed
@andrejohansson
Copy link
Author

Got it to work with the following hack, but my question remains, is there no cleaner way?

# see http://stackoverflow.com/questions/4870346/can-a-jinja-variables-scope-extend-beyond-in-an-inner-block
{% set exists = [] %}

{%- for state, result in data['return'].iteritems() -%}
    {%- if not result['result'] -%}
        {% do exists.append(1) %}
    {%- endif -%}
{%- endfor %}


# execute an alert only if the highstate had some error
{% if exists %}
...

@DanyC97
Copy link

DanyC97 commented Nov 4, 2015

this is interesting idea, i could replace hipchat with slack and get something cool... +1

@andrejohansson
Copy link
Author

@DanyC97, enjoy! Say if you create something cool. But this seems like a working concept for having alerting applied to you environment in a global fashion. I do not know how performant it is though.

@basepi
Copy link
Contributor

basepi commented Nov 9, 2015

Are we defining "failure" as "any state returned result: False"?

I think "failed highstate" is probably a very common use-case for the reactor. And I don't think there's an easy way to see whether a highstate failed or not. I think currently data['success'] is cueing off of whether salt had any errors, not whether the highstate had any errors. We should either change that behavior or add a new field.

Any thoughts on better ways to implement this @thatch45 or @cachedout?

@basepi
Copy link
Contributor

basepi commented Nov 9, 2015

Also, could you please cross-post this issue using your Zendesk login if you haven't already @andrejohansson? You should have one since you're using an enterprise version.

@basepi basepi added Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged Core relates to code central or existential to Salt Feature new functionality including changes to functionality and code refactors, etc. ZRELEASED - Boron ZD The issue is related to a Zendesk customer support ticket. State-Compiler labels Nov 9, 2015
@basepi basepi self-assigned this Nov 9, 2015
@basepi basepi added this to the B 8 milestone Nov 9, 2015
@thatch45
Copy link
Contributor

thatch45 commented Nov 9, 2015

Hmm, I thought I put somethign into the master already which fired events for failed states (result: False). but similarly we could add a hook in the reactor or even in the natural event processing to add some stte failure event

@basepi
Copy link
Contributor

basepi commented Nov 9, 2015

Yeah, I think it would be useful to have a general-case failure event as well, so you don't have to be worried about emitting multiple pages, or multiple messages in hipchat or whatever. Thoughts, @andrejohansson?

@andrejohansson
Copy link
Author

@basepi if data['success'] is signaling salt errors, what is then retcode signaling today? And yes, my hack currently defines "failure" as "any state returned result: False".

I will send a mail to my contact regarding your Zendesk proposal, I don't think I've gotten a login yet, I have to double check.

A general failure event feels like the natural way, and maybe a possibility to extend the targeting mechanism for events? Today I can react on event names and use wildcards (eg 'salt/job/*/ret/*'). Would it be possible to add the function name aswell as the result? Or could you in some way incorporate the same flexible targeting as the minions? I understand the must be performance considerations here but something like:

'salt/job/*/ret/*/highstate'
'salt/job/*/ret/*/highstate/failure'

@basepi
Copy link
Contributor

basepi commented Nov 10, 2015

Yeah, it would be an additional event (we don't want to change the existing tags). Or it would be another field inside of the existing event.

In at least 2015.8, I'm pretty sure the retcode does change on failures, so we might actually be able to just cue off of that instead. I'll have to do some testing.

@basepi basepi modified the milestones: B 7, B 8 Nov 24, 2015
@basepi basepi modified the milestones: B 5, B 7 Dec 14, 2015
@DanyC97
Copy link

DanyC97 commented Dec 19, 2015

@basepi do you have a full example on how it works in the new implementation/ version (2015.8)?
thanks

@basepi
Copy link
Contributor

basepi commented Jan 4, 2016

@DanyC97 this has not yet been implemented.

@basepi basepi modified the milestones: B 5, B 4 Jan 19, 2016
@basepi basepi removed this from the B 4 milestone Feb 2, 2016
@basepi basepi modified the milestones: B 3, B 4 Feb 2, 2016
@basepi basepi assigned DmitryKuzmenko and unassigned basepi Feb 6, 2016
@basepi
Copy link
Contributor

basepi commented Feb 6, 2016

@DmitryKuzmenko The reactor shouldn't need any modifications here. What we want to look at are the events which accompany a state run in salt. We want to put an indication in the return event for a state run to show that the run had failures in it. If we can make sure the "success" piece of the event data for the state run correctly shows True or False based on whether there were failures, then users will be able to write reactors which can detect failures.

Ping me on Slack if you have more questions.

@DmitryKuzmenko
Copy link
Contributor

@andrejohansson

As I understood job ret data contains a 'retcode' field that must be non-zero if there are errors in state execution. So the following could be used for your case:

{%- if data['fun'] == 'state.highstate' and data['retcode'] != 0 -%}
# Send to hipchat here
{%- endif -%}

If the retcode is non-zero it's a bug and we have to fix it. Now I'm trying to figure out when the retcode could possibly get the false-zero result.

@DmitryKuzmenko
Copy link
Contributor

@andrejohansson I've just fixed an issue with the retcode.
Not sure have you met this, but this could be related.
All details provided in the PR #31164

If you'll find any problem with getting the execution results with the retcode value as I described above, please create a new issue.

@DmitryKuzmenko DmitryKuzmenko removed the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label Feb 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core relates to code central or existential to Salt Feature new functionality including changes to functionality and code refactors, etc. State-Compiler ZD The issue is related to a Zendesk customer support ticket. ZRELEASED - Boron
Projects
None yet
Development

No branches or pull requests

5 participants