Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get batch mode and --failhard to work as expected #29643

Closed
matthayes opened this issue Dec 11, 2015 · 67 comments
Closed

Can't get batch mode and --failhard to work as expected #29643

matthayes opened this issue Dec 11, 2015 · 67 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix P3 Priority 3 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around State-Compiler ZD The issue is related to a Zendesk customer support ticket. ZRELEASED - Boron
Milestone

Comments

@matthayes
Copy link

I'm using a command like this below to execute state.highstate on one machine at a time, with the intention that it stop on the first failure. But, it's not stopping even though there is a failure.

sudo salt -b 1 --failhard -G 'roles:foo' state.highstate

The cause of the failure is a http_test.ping custom module function I wrote that checks that an HTTP call is succeeding. I gave it a bogus path to force it to fail. I'm testing this against two minions. Looking at the minion logs I can see the calls are failing. But, the output I get from the salt command above is essentially:

Executing run on ['some-machine-A']

retcode:
    0

Executing run on ['some-machine-B']

some-machine-B:    <-- highlighted red

(... lots of SLS state output ...)

----------
ID: Ping the web app    <-- all highlighted red
    Function: http_test.ping
      Result: False
     Comment: Call to http://localhost:80/pingping failed. Error: HTTP 404: NOT FOUND. HTTP status: 404
     Started: 22:52:49.686395
    Duration: 57.519 ms
     Changes: 

(... more SLS state output ...)

So, you can see that the Ping the web app call is failing and has Result: False. This happens for both machines. But, the retcode for both machines is apparently 0. I ran the same command again with --out=yaml appended, which showed me that retcode is 0 for both machines.

Maybe this isn't a problem with the batch mode but instead a problem with how retcode is determined. Am I missing something? Shouldn't the retcode be non-zero since the state has Result: False?

@matthayes
Copy link
Author

I've also tried following the events by running sudo salt-run state.event pretty=True. Here I also see that retcode is 0 for both state.highstate executions. How is the value of retcode determined? Is it not based on the state success or failure?

@matthayes
Copy link
Author

As a test I tried adding some logging to get_retcode that I saw mentioned in #22855 . What I found is that the final else is executing, meaning that the default retcode of 0 is returned. I also tried dumping ret. It looks like it only consists of keys corresponding to each state that is executed. I don't see any useful return values that could be used to detect failure.

def get_retcode(ret):
    '''
    Determine a retcode for a given return
    '''
    retcode = 0
    # if there is a dict with retcode, use that
    if isinstance(ret, dict) and ret.get('retcode', 0) != 0:
        log.error("Found retcode: " + ret['retcode'])
        return ret['retcode']
    # if its a boolean, False means 1
    elif isinstance(ret, bool) and not ret:
        log.error("Found bool value, returning 1")
        return 1
    else:
        log.error("No retcode found, returning default: " + str(retcode))
    return retcode

Also, if I remove -b 1 --failhard from the command, get_retcode is logging different output (below). This time there is a retcode with value 0, which still doesn't seem right since one of the states has a false result.

{
    "out": "highstate",
    "ret": {  ... ommitted .. },
    "retcode": 0
}

Also this seems to indicate a bug in how get_retcode is called in these two cases. In batch mode it appears that the value of ret is passed to get_retcode, which doesn't have retcode. But when not in batch mode the parent object that contains ret is passed to it, which does contain retcode.

@matthayes
Copy link
Author

@jacksontj thoughts on this?

@matthayes
Copy link
Author

@basepi , wanted to bring this to your attention as it is related to your suggestion in #1038 .

@matthayes
Copy link
Author

To summarize, I think there are two issues:

  1. The retcode coming back from the minion for state.highstate is zero even though one of the result values is false.
  2. In batch mode, even if we were getting back a non-zero value for retcode we would never know because the wrong object is passed to get_retcode in salt.py.

@lorengordon
Copy link
Contributor

Does salt support --retcode-passthrough or is that only an option for salt-call? (I only use masterless, so I don't have a master to check with...)

https://docs.saltstack.com/en/latest/ref/cli/salt-call.html#cmdoption-salt-call--retcode-passthrough

@matthayes
Copy link
Author

Not that I've seen. I think that's only supported by salt-call.

@basepi basepi added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around Core relates to code central or existential to Salt P3 Priority 3 State-Compiler labels Dec 14, 2015
@basepi basepi added this to the Approved milestone Dec 14, 2015
@basepi
Copy link
Contributor

basepi commented Dec 14, 2015

thanks for the ping, @matthayes. This looks like something we need to get fixed.

@matthayes
Copy link
Author

I can help diagnose the issue on my side but I need some pointers for what code to check. I've been adding logging to dig deeper but I've reached the limit of my familiarity with the Salt code. Can you point me to where in the code the minion returns the retcode to the master? Or let me know if there is something I can do to help find the problem. I was working on automating some deployment but now I'm stuck until this is fixed so I am motivated to identify the problem :)

@basepi
Copy link
Contributor

basepi commented Dec 15, 2015

If I'm not mistaken, modules can set their retcode in the special __context__ dictionary that is loaded into each module. Were I testing this myself I would first just add a line to state.highstate and set that retcode, and see if it started coming through. It might be as easy as modifying state.highstate to return the correct retcode.....

If I'm not mistaken, the minion pulls that retcode here:

salt/salt/minion.py

Lines 1166 to 1169 in bbc1034

ret['retcode'] = minion_instance.functions.pack['__context__'].get(
'retcode',
0
)

@matthayes
Copy link
Author

Thanks, I've added some more logging.

At the end of the highstate method in salt/modules/state.py, the value of __context__['retcode'] is 2. This is consistent with what I saw when I ran salt-call manually on the minion machine. But, in _thread_return in salt/minion.py, minion_instance.functions.pack['__context__'].get('retcode') has no value at the point where this code runs:

                ret['retcode'] = minion_instance.functions.pack['__context__'].get(
                    'retcode',
                    0
                )

So, it is setting retcode to zero.

@matthayes
Copy link
Author

Actually, correction: this is the content of minion_instance.functions.pack['__context__']:

{'retcode': 0}

I thought it had no value because I was using if and 0 evaluates as false.

@matthayes
Copy link
Author

So, what is the relationship between __context__ and minion_instance.functions.pack['__context__']? Should these have the same content? As you can see above, minion_instance.functions.pack['__context__'] in salt/minion.py only has the retcode value, which was set to 0 before the call was made. But when I log the entire contents of __context__ in the highstate method in salt/modules/state.py it is huge and has a lot of data in it.

@basepi
Copy link
Contributor

basepi commented Dec 15, 2015

I feel like they should have the same content but I haven't spent enough time digging through the context around __context__ (haha) to know for sure. Perhaps it's different, or the side effects are not propagating as expected.....

@matthayes
Copy link
Author

What should I investigate next then? What is the purpose of this pack object anyways? Can __context__ not be accessed at this point? Where is the propagation supposed to happen?

@basepi
Copy link
Contributor

basepi commented Dec 15, 2015

The loader inserts the globals in the pack dictionary into functions, at call time I think? I haven't had time today to look around more for more leads to give you, unfortunately. I'll see if I can find some time this afternoon.

@matthayes
Copy link
Author

Who knows the most about how this pack/lazy-loader system is supposed to work? I logged the object ID of the context objects to see if they were the same object and the IDs don't match. That is, the object ID of minion_instance.functions.pack['__context__'] in minion.py's _thread_return is different than the value of id(__context__) in modules/state.py's highstate method.

@matthayes
Copy link
Author

Okay, well here's something else interesting. I changed minion.py's _thread_return to set the retcode like this:

minion_instance.functions.pack['__context__']['retcode'] = 123

At the beginning of modules/state.py's highstate method, the value of __context__['retcode'] is 123. At the end of this method it is set to 2, overridden because of the one failure I forced. At the end of minion.py's _thread_return the value of retcode is still 123.

I'm assuming the purpose of minion_instance.functions.pack['__context__'] is to serve as a base object for __context__ into which data is added during execution time for each module. If it's a base object then I expect a copy of it should be made rather than modifying the original. If this is the case we shouldn't assume that by reading the value of minion_instance.functions.pack['__context__']['retcode'] after calling the function in the execution module that we'll get the value set into __context__['retcode']. We'll just get the initial starting value.

@matthayes
Copy link
Author

I got it working with a very small change. I just don't know if this is correct so would like someone with knowledge of __context__ to review.

I figured in minion.py that the __context__ in pack is used to construct the __context__ the function uses at runtime, and that this latter __context__ may be accessible some way. Inspecting func I found I could get the retcode like this:

func.func_globals['__context__'].get('retcode', 0)

This gives me back the expected value of 2. When I update the code in minion.py as below then the master gets the right retcode back.

ret['retcode'] = func.func_globals['__context__'].get('retcode', 0)

I can prepare a pull request and test it, but I'd like to know: is this the right solution? One thing that concerns me in general is whether this __context__ instance is unique to the call to this function. I want to make sure that if we set it to a non-zero value that something else won't override it.

@basepi
Copy link
Contributor

basepi commented Dec 16, 2015

Interesting! __context__ is supposed to be the same object across all calls in the same job in salt. So a single state run will have the same __context__ which will persist throughout. This allows us to do memoization, stuff like that. However, I think only the first-called execution module function is supposed to set the retcode. So I don't think you need to worry about losing your retcode there.

I guess my main question is why the __context__ in the pack object isn't the same object. I feel like it should be. If you look a little higher in the same function in minion.py, you'll see that we set the retcode to 0 right before we make the actual function call. Then we expect to be able to retrieve the retcode if the function sets one right after the call. Something weird is going on in there, we're creating a new context dictionary or something somewhere in the stack. That's really what I'd like to track down, I think. I'm going to spend the next half hour going through code seeing if I can figure that out.

@basepi
Copy link
Contributor

basepi commented Dec 16, 2015

@matthayes what version of salt are you running?

@matthayes
Copy link
Author

salt 2015.8.3 (Beryllium)

@basepi
Copy link
Contributor

basepi commented Dec 16, 2015

Thanks.

@basepi
Copy link
Contributor

basepi commented Dec 16, 2015

Alright, next piece of debugging: can you log the actual object address of __context__, both in minion.py and in state.py? Basically, I want to verify that they are different objects in each place, whereas they're supposed to be the same object. You can use id() to get the memory address of an object in python. I just want to make sure I'm on the right path, because my code dive is coming up empty.

@danlsgiga
Copy link
Contributor

Ok @basepi, I was wrong and you were right! I think I found the issue. As you said, 'retcode' is indeed in the minion response. The issue is exactly here:
https://github.com/saltstack/salt/blob/2015.8/salt/cli/batch.py#L197

The retcode key is outside the data['ret'] dict.

The return batch.run() is giving to get_retcode is this:

{'test_|-abort-execution_|-foo_|-fail_without_changes': {'comment': 'Failure!', 'name': 'foo', 'start_time': '12:37:44.192329', 'result': False, 'duration': 0.931, '__run_num__': 0, 'changes': {}}}

After changing the mentioned line to

yield {minion: data}

The return batch.run() is giving to get_retcode is this:

{'retcode': 2, 'ret': {'test_|-abort-execution_|-foo_|-fail_without_changes': {'comment': 'Failure!', 'name': 'foo', 'start_time': '12:39:22.453917', 'result': False, 'duration': 0.765, '__run_num__': 0, 'changes': {}}}, 'out': 'highstate'}

And now the --failhard is working as expected!

@matthayes
Copy link
Author

Hey @basepi I'm noticing another strange issue with failhard that affects test mode. I'm not sure if this is related. I can open another issue if you think that's best. I'm executing state.highstate from the master with test=True. When I have failhard: True in the master config, it doesn't execute all the salt states (only one out of about fifty). If I remove failhard: True and bounce the master, then it does execute all the states. Note that when I remove test=True and actually execute the highstate for real, it doesn't matter whether I have failhard enabled or not; all the states execute either way.

@basepi
Copy link
Contributor

basepi commented Jan 15, 2016

@matthayes That's a weird one. Do any of the states it does run fail? It's possible for things to fail even in test mode.

@basepi
Copy link
Contributor

basepi commented Jan 15, 2016

(But to answer your question, probably a separate issue.)

@matthayes
Copy link
Author

Nope, no failures for the states that ran.

@DmitryKuzmenko
Copy link
Contributor

@matthayes I've just fixed this. All details provided in the PR #31164

@DmitryKuzmenko DmitryKuzmenko added the fixed-pls-verify fix is linked, bug author to confirm fix label Feb 12, 2016
@danlsgiga
Copy link
Contributor

This may fix an issue I've opened for a few months already. #24996
Thanks for the fix @DmitryKuzmenko

@DmitryKuzmenko
Copy link
Contributor

@danlsgiga you're welcome!

@meggiebot meggiebot added ZD The issue is related to a Zendesk customer support ticket. ZRELEASED - Boron labels Feb 12, 2016
@meggiebot meggiebot modified the milestones: B 3, Approved Feb 12, 2016
@matthayes
Copy link
Author

@DmitryKuzmenko thanks for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix P3 Priority 3 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around State-Compiler ZD The issue is related to a Zendesk customer support ticket. ZRELEASED - Boron
Projects
None yet
Development

No branches or pull requests

6 participants