Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve retcode reporting in salt/salt-call CLI commands #48361

Merged
merged 17 commits into from Aug 3, 2018

Conversation

Projects
None yet
7 participants
@terminalmage
Copy link
Contributor

commented Jun 28, 2018

What this PR does

Improves how the salt and salt-call CLI commands set exit codes.

Refs: #47732

For salt:

The following cases will result in a nonzero exit status for salt

  • Any exception raised by the minion function will result in salt
  • If the return data is a dictionary and contains either a result or success key, and its value is False

For salt-call:

The following cases will result in a nonzero exit status for salt-call (assuming it was invoked using --retcode-passthrough):

  • If the return data is a dictionary and contains either a result or success key, and its value is False

Further points of discussion needed before merge

There is some asymmetry to how salt and salt-call handle exit codes:

For salt, when the master sees that one of its minions has set a nonzero retcode in the return data, it does a sys.exit(11). When salt-call is invoked with --retcode-passthrough, it will return whatever retcode was set in __context__['retcode'], otherwise it just returns a 0 exit status (assuming no exception was raised).

For salt, when a minion catches an exception, it sets a retcode in the return event sent back to the master. As noted above, any one minion that generated a nonzero retcode will result in the master exiting with a return code of 11. However, for salt-call, any exception raised just results in a sys.exit(salt.defaults.exitcodes.EX_GENERIC) (wth that retcode being 1).

To represent this all in a table:

Result salt exit code salt-call exit code salt-call --retcode-passthrough exit code
Nonzero retcode set in __context__ 11 0 Whatever was set in __context__['retcode']
Exception raised 11 1 1
result or success is False 11 0 1

I feel like we should normalize this somehow. Here are my ideas:

  • --retcode-passthrough should be deprecated and salt-call's default behavior should be to return a nonzero retcode when there is a failure case
  • The same retcode should be returned in both salt and salt-call when an exception is raised by a minion function

This is somewhat complicated by the fact that salt has to set a CLI retcode based on potentially multiple minions having failed, while salt-call only has to set a CLI retcode based on one host's result. So perhaps some asymmetry is called for. I think it would be worth creating a new exit code in salt.defaults.exitcodes which represents a caught exception, and then exiting with that code in both salt and salt-call. This gets around the fact that when running salt from the master, the master can't necessarily tell that the minion raised an exception or whether that retcode was set in __context__['retcode']. Alternatively, rather than trying to manage this simply via the retcode, it may be a good idea to set an optional key in the return event payload the minion sends back to the master, so that the master knows uneqivocally that an exception was caught. After all, if we just managed this using the retcode key in the return event payload, then one could just set __context__['retcode'] to that number and it would fool the salt CLI command into thinking that an exception had been raised.

I'm less sure what to do when __context__['retcode'] is set to a nonzero value. What if there are multiple distinct retcodes that come back to the master for different minions? Which does the salt CLI command use for its exit code?

@terminalmage terminalmage requested a review from saltstack/team-core as a code owner Jun 28, 2018

@salt-jenkins salt-jenkins requested a review from saltstack/team-suse Jun 28, 2018

@cachedout

This comment has been minimized.

Copy link
Collaborator

commented Jul 2, 2018

This is great. Let me respond to few of your comments as best I can:

Which does the salt CLI command use for its exit code?

Frankly, I think we can be somewhat arbitrary about this. We could, for example, simply agree to return the highest integer. The more important notion is that it's non-zero and there's not a ton that we can do beyond that.

--retcode-passthrough should be deprecated and salt-call's default behavior should be to return a nonzero retcode when there is a failure case

I have never liked this flag. I think it's non-intuitive and I've always felt like salt-call should just "do the right thing" by default which would be to pass through the return code of the called function. That said, I'm pretty skittish about this change to behavior. We'd be talking about something that might bite a lot of people. I'd want to discuss how we'd communicate this impending change. That said, I do think that as much parity as we can achieve between salt and salt-call is a very worthy goal.

# return data is not a dict
func_result = True
if not func_result:
retcode = 1

This comment has been minimized.

Copy link
@cachedout

cachedout Jul 2, 2018

Collaborator

We've got a lot of things setting retcodes of one. Worth doing some debug logging here so at least the reason could be inspected?

This comment has been minimized.

Copy link
@terminalmage

terminalmage Jul 2, 2018

Author Contributor

Yes, that can be arranged. The 1s I added are kind of just placeholders right now though, I would prefer to eventually be using constants defined in salt.defaults.exitcodes.

if _is_exc(exc):
raise exc(*args, **salt.utils.args.clean_kwargs(**kwargs))
else:
log.error('%s is not an exception', name)

This comment has been minimized.

Copy link
@cachedout

cachedout Jul 2, 2018

Collaborator

This is not the most intuitive error message, IMHO. As a user, I wouldn't know what to do if I saw this.

This comment has been minimized.

Copy link
@terminalmage

terminalmage Jul 2, 2018

Author Contributor

It would be accompanied by a line number in the log file. This is a function designed to be used for testing, so I didn't think we needed something that was all that verbose. The idea is that if you are running this function, you're doing it for a reason. I don't think it's something that 99.99% of non-developer users are ever going to see.

What did you have in mind?

This comment has been minimized.

Copy link
@cachedout

cachedout Jul 27, 2018

Collaborator

Ah, I failed to recognize that this was in the test module. Yes, your counter-point is sound.

log.error('%s is not an exception', name)
return False
except AttributeError:
log.error('No such exception: %s', name)

This comment has been minimized.

Copy link
@cachedout

cachedout Jul 2, 2018

Collaborator

Same as above.

This comment has been minimized.

Copy link
@cachedout

cachedout Jul 27, 2018

Collaborator

Issue resolved via above comment.

@isbm
Copy link
Contributor

left a comment

@terminalmage Right direction! Since the exit codes are always very important to any Unix scripting, I would though say my few cents here:

  • salt-call should not return 1 and 0 for the same thing. If it is success, then it is 0. But then it is not, it should be something else (TBD).
  • sys.exit(11) is still very odd and we should change that. POSIX it means "Try Again" (?), Salt Exit Codes it means "Thin redeployment failure" (??). I would propose to return an actual code, or at least stick to a generic failure and success.
  • We probably need to either change exit codes normalising for every command or at least segregate one for Salt SSH and other for the rest.
  • I would concentrate of what is returned, not how (though it is also very important). Because, again, if one gets 0 for False and True, then the whole thing makes no much sense.
  • Separate "CLI error codes" from "Salt call error codes". That is, whatever mechanism is chosen to determine what kind of "False" with whatever internal error code Minion just returned in the actual result inside JSON or so, to CLI it should be POSIX way.

I am still puzzled why we are overriding POSIX codes, instead of just allocating new ones? I.e. why we are overriding 11 with "Thin wrong deployment" instead of allocating 211 for it?

@terminalmage

This comment has been minimized.

Copy link
Contributor Author

commented Jul 2, 2018

My vote would be for a generic 1 for failure in most cases

@isbm what are your thoughts for signaling an exception raised on a remote minion? I personally like the idea of having an optional key in the return event payload rather than setting a retcode key in the return payload.

Perhaps --retcode-passthrough is worth keeping as-is. That way we can keep with the standardized retcode by default, and still let people keep the retcode from the context dunder if they want.

@isbm

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2018

Since the whole thing is quite complex here, I'd break it into steps and then summarise the following:

  • Let's use salt.defaults.exitcodes but clean them up. By this, I mean remove all of those codes and shift them to the free range, leaving untouched exit codes that are overlapping with the errno module. I would say, we should simply take the range of 200-210 (in case we need more, which I really doubt, we can go upper). Also remove strange hard-coded numbers like 11, 254 etc. We also should take care that the exit codes are the same on Unix and Windows (i.e. avoid use os.EX_OK as it is not available on Windows, but let's define one in salt.defaults.exitcodes).
  • Make salt-call, salt-run, salt-ssh and salt consistent. I see no reason to have salt-call return something else than salt or salt-ssh in case there is the same type of error (although salt-call is just for one host). So all of these can be happily scripted with Bash/KSH/ZSH by admins with no pain.
  • The passthrough parameter I would actually set to ON and remove the parameter at all. That said, we would be having passthrough always ON forever. It will work for salt-call say cmd.run noexistcomand will return 127 instead of 0 (and that's exactly we always want!). As of states/SLS, one can just set the exit code or simply remove/comment it. No need to have a case, where you set an exit code in your state, but it still doesn't work, because you did not invoke the CLI with this parameter.
  • The additional key inside the event indeed feels the right way to do it: that way we know also why thread failed. So we can also then reuse it in API (which is another topic).

Open question: How do we handle exit codes between Unix and Windows?

Problem is that Windows has up to 16000 ways to crash 😃 of those and POSIX no longer applicable here. I would think of a logic inside the salt.default.exitcodes that would say "If this is Windows, then...". Something like:

EX_ACCESS_DENIED = 5 if is_windows() else 13

@twangboy this is where you are going to shine. 😉

@isbm isbm referenced this pull request Jul 4, 2018

Closed

Exit codes cleanup #48439

@damon-atkins

This comment has been minimized.

Copy link
Member

commented Jul 4, 2018

The other option is to assign an error to a bit from 0 to 255 there are 8 bits. Therefore 8 different types of errors. If two different errors occur xor them together.

1 General Error
2 No Response from some minions
4 Some states failed
8 ?
16 ?
32 ?
64 ?
128 ?

So 6 would mean some minions did not respond and some states failed. Then a person could care about States failing and not care about minion not responding in their shell script which determines to alarm or not based on the exit code.

@isbm

This comment has been minimized.

Copy link
Contributor

commented Jul 4, 2018

This is not how error codes supposed to work.

@damon-atkins

This comment has been minimized.

Copy link
Member

commented Jul 5, 2018

Not sure if 20 plus errors codes are the way to go.

We need to known for salt \* something if:

  • everything worked (0)
  • we need to know if we should try again e.g. minion did not respond. (2)
  • we need to know if something failed and we need to review the output. (4)

Anything else is a bit more details along the same lines.

Also people like to use cmd.run to get real error code results as well, which might be an exception to the rule which only makes sense for an action against a single server. Maybe a new command called salt-cmd <stdout_file_name> <stdin_file_name> <return_code_filename> <single minion only> cmd.*

@terminalmage

This comment has been minimized.

Copy link
Contributor Author

commented Jul 10, 2018

I don't think we need a large number of error codes. At most we would need 2, in my opinion. As I see it, here are two good options:

  • All failures receive the same error code

  • Failures which result from an exception get one error code, while any failures set via __context__['retcode'] (which denote a job that ran to completion but failed without raising an uncaught exception) would get a different error code.

@damon-atkins

This comment has been minimized.

Copy link
Member

commented Jul 11, 2018

I am suggesting an extra command e.g. salt-cmd as ssh returns an error code of the remote command. However the way ssh is implemented its hard to tell if ssh had the problem or remote command has the problem. Hence the
salt-cmd <stdout_file_name> <stdin_file_name> <return_code_filename> <single minion only> cmd.*
which allows the issues to be separated.

Which will give people the scripted remote execution they want, without impacting salt \* standard behaviour.

terminalmage added some commits Jun 28, 2018

Set nonzero retcode when a function fails
Criteria include:

1. __context__['retcode'] is nonzero
2. An exception is caught
3. The return data is a dict, and has either a 'result' or 'success' key
   with a False value.
Don't iterate through retcodes twice
This use a generator comprehension to stop iteration as soon as a
nonzero retcode is found.
Add return code integration tests
This tests that we set return codes properly both for salt and salt-call
Improve code re-use in retcode tests
Also add tests for checking result/success keys
@terminalmage

This comment has been minimized.

Copy link
Contributor Author

commented Jul 25, 2018

After talking with @thatch45 earlier today on this, this is what we've decided:

  • All error cases will use EX_GENERIC (i.e. 1).
  • --retcode-passthrough will be left untouched, but salt-call will still exit with 1 when there is an error condition and --retcode-passthrough is not used. Essentially, the only function that --retcode-passthrough has will be to return what is in __context__['retcode'].
@thatch45

This comment has been minimized.

Copy link
Member

commented Jul 25, 2018

My name is @thatch45 and I endorse this conclusion.
:)

terminalmage added some commits Jul 26, 2018

Use constants for nonzero exit codes in salt/modules/state.py
This will make testing more reliable

@terminalmage terminalmage force-pushed the terminalmage:return-codes branch from db41ded to 828331f Jul 26, 2018

@terminalmage

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2018

OK, adjustments have been made to error handling, including adjusting some of the integration tests I added and adding a couple new ones. The updated behavior should be as follows:

Result salt exit code salt-call exit code salt-call --retcode-passthrough exit code
Nonzero retcode set in __context__ 1 1 Whatever was set in __context__['retcode']
Exception raised 1 1 1
result or success is False 1 1 1
@cachedout

This comment has been minimized.

Copy link
Collaborator

commented Jul 27, 2018

IMHO we absolutely need a page in the docs dedicated to explaining this added before we merge it.

terminalmage and others added some commits Jul 31, 2018

@rallytime

This comment has been minimized.

Copy link
Contributor

commented Aug 1, 2018

@cachedout How does this look to you now?

@rallytime rallytime added the Fluorine label Aug 1, 2018

@rallytime rallytime merged commit 76426af into saltstack:develop Aug 3, 2018

7 of 9 checks passed

codeclimate 14 issues to fix
Details
jenkins/pr/py2-centos-7 running py2-centos-7...
Details
WIP ready for review
Details
continuous-integration/jenkins/pr-merge This commit looks good
Details
jenkins/pr/docs The docs job has passed
Details
jenkins/pr/lint The lint job has passed
Details
jenkins/pr/py2-ubuntu-1604 The py2-ubuntu-1604 job has passed
Details
jenkins/pr/py3-centos-7 The py3-centos-7 job has passed
Details
jenkins/pr/py3-ubuntu-1604 The py3-ubuntu-1604 job has passed
Details

@terminalmage terminalmage deleted the terminalmage:return-codes branch Aug 6, 2018

@terminalmage terminalmage referenced this pull request Nov 6, 2018

Closed

salt exit codes #18510

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.