Join GitHub today
Exit codes #564
We are running "/usr/sbin/rear checklayout || /usr/sbin/rear mkrescue" from cron.
If EXIT_CODE gets set to a non-zero value anywhere in the code, /usr/sbin/rear writes "WARNING rc=$EXIT_CODE" to syslog. That triggers a logcheck script to notify our monitoring system - so someone can look into this.
However, if there is a change in the layout, EXIT_CODE is set to 1, in order to trigger mkrescue to run.
Unfortunately that also result in a WARNING in syslog, which creates a false alarm.
Any suggestions, on how to "fix" this?
EXIT_CODE=1 must be kept for the cron entry so that mkrescue is getting triggered.
I think in general the exit code value '1' is used for "anything wrong"
In rear-1.17.0 in usr/sbin/rear I don't see a reason why the EXIT_CODE value for cron cannot be something else.
In particular in usr/sbin/rear it seems "EXIT_CODE=1" is already used for "ERROR: The specified command '$WORKFLOW' does not exist !"
If my understanding is right I would like to suggest something like:
EXIT_CODE=0 : no issue found
EXIT_CODE=1 : default value for "anything wrong"
EXIT_CODE=2 ... EXIT_CODE=9 : reserved for specific fatal errors.
EXIT_CODE>9 : reserved to indicate a specific outcome
in particular currently used:
EXIT_CODE=11 : layout changed
For background info what I have in mind see the exit codes of CUPS backends and see CUPS filters how messages are prefixed there, see
I am actually against using exit codes as an official interface. My favourite bad example are the rsync exit codes Where 24 means that all is fine but some source files disappeared between scanning and transferring. The result is that whereever I use rsync in a script I must check the exit code and add code to handle 24 like 0.
Therefore I suggest to use exit codes > 0 only as a signal to tell the outside world that we could not do our job as expected. More advanced stuff should be done via an explicit interface so that you can write something like this
IMHO that is much more legible than for example
For the case of creating a cron job to run ReaR if the layout changed I would always opt to add stuff to rear to keep the conditionals out of the cron job. Within ReaR we have much more detail and context infos available to make decisions.
In this specific example I would rather extend ReaR to support something like
so that you can rely on ReaR to do the right thing. And if the exit code is >0 then you know for sure that there was a problem.
when exit codes are used as an interface I agree that the ranges of error exit codes and non-error exit codes must be strictly separated.
If I understand you correctly you propose some special stdout (or perhaps also stderr) messages as an official interface.
Currently I don't have an opinion if this is good or bad.
But couldn't that conflict with other output when e.g. "rear -v" is used?
Only a quick idea:
What about using specific prefixed lines on stdout as official interface like:
$ rear --very-verbose something Hello, my name is rear! How are you? INFO: I do now 'something' for you: DEBUG: Init 'something' Hello, my name is something! DEBUG: Cheking whether or not need-new-rescue-image INFO: Found new device /dev/sdX mounted as /dev/sdX9 at /foo disk layout has changed => need-new-rescue-image INTERFACE: need-new-rescue-image I am fine! ERROR: failed to read /etc/fstab Oops! I am no longer fine! I abort now... $ echo $? 1
Do I understand your "defining call backs" correctly that this is another different interface how rear could communicate with "the outer world"
If I understand it correctly there are now three proposals for an interface how rear could communicate with "the outer world":
I think this particular issue has revealed a general issue in rear that currently it is not yet well defined how rear would communicate with "the outer world".
Perhaps a new separated issue should be created for the general issue how rear would communicate with "the outer world" and this issue here is only about how rear could tell when the disk layout has changed (or when in general a new rescue image is needed, e.g. also when kernel drivers have changed).
You are right. This is a topic which should be discussed in greater detail. What I meant is that if to go beyond a simple single reply on stdout then I would rather opt for a call back system.
So far we use echo and read to communicate with the user. I think that this should be moved to a library that could then be customized to use a call back.
To go back to the original issue as opened by @clausa, I think that it would be enough to simply add a
Perhaps I am overanxious and I do not have sufficient overview knowledge of the whole rear but I fear such a feature could become a nightmare to implement and to keep it working correctly.
I think "rear mkrescue --only-if-changed" promises the user that he will get a new rear recovery system image when anything has changed that requires a new recovery system.
This means when any such change is by accident not recognized and later recovery fails because the recovery system is outdated, then the user rightfully complains that rear is guilty.
Without such a feature it is left to the user to decide if he needs a new recovery system (i.e. in case of doubt he would just make it anew).
In contrast a specific feature like "rear mkrescue --if-disklayout-changed" is perfectly o.k. from my point of view because it can be documented that this means any change in the disklayout.conf file (nothing more, nothing less).
When several such specific "rear mkrescue --if-this-changed", "rear mkrescue --if-that-changed" features are implemented, a general "rear mkrescue --if-changed" feature could be just a simpler way to call "rear mkrescue --if-this-changed --if-that-changed" but this would not mean the user gets a new recovery system image when anything has changed that requires a new recovery system.
So your concern is mostly about the wording? No problem.
In any case, if I would implement this then I would not change any existing behaviour but extend mkrescue (and possibly mkbackup) to drop out quickly if nothing relevant changed.
In any case there would be some remaining risk involved. Imagine a server beeing set up and getting OS updated over 5 years. In many cases the disk layout would not change in those 5 years. But the ReaR version and the kernel would change.
To successfully recover this system on new hardware you would really need the recent ReaR version and OS kernel from all those updates. But since you never created a new Rescue Image you will try to recover your system with a Linux and ReaR that is 5 years old.
I actually believe that this example covers a majority of use cases in a data center.
Coming back to the original issue of @clausa I can think of the following:
Looking at 7820c99 where that behaviour was added I see a reference to OVO which was apparently used somewhere to collect infos about ReaR. @gdha, could you please try to remember what was the background there? Is it really necessary to log any exit code != 0 to syslog as a warning? Or maybe it would be enough to log the actual Errors?
Bottom line, after looking at everything I start to believe that it would be more "logical" that an exit code != 0 should be logged to syslog as info and not as warning. After all, it can be part of regular operations.
@schlomo Indeed it was added on request of an important customer (who actually pay for writing code) for picking it up via OVO as there are thousands of Linux systems and nobody seemed to know how many failures there where on a weekly basis. So, what @clausa does not want to see, is not what my customer wanted... To be honest I believe it is good practice if you see a warning to always do a quick check of the log file. Better one check too many then a failed recovery afterwards...
referenced this issue
Mar 25, 2015
@gdha, are you 100% sure that your OVO customer actually needs the WARNING line in syslog for all cases where the exit code was != 0? Or do they more care about catching the ERROR lines which are generated by our Error function?
Could you talk to the OVO customer about the cases where a non-0 exit code is actually the desired behaviour? Maybe now they have some filters in their monitoring to remove those "false positives" and maybe they could also benefit from improving this topic?
@gdha That sounds great! IMHO we should drop the WARNING line altogether. A WARNING is in my experience totally useless. Either I treat it as an error and live with a lot of false positives or I ignore it and only care about errors.
I would rather like to improve ReaR to throw an Error if we think that a human should check and not bother humans in all other cases.
# /bin/showsomething --output-mode=color WARNING: output-mode=color unsupported on this terminal, using fallback output-mode=ASCII
in contrast to using the fallback silently or abort.
Perhaps for rear it could be a special case that there is nothing in between success and failure for a disaster recovery tool?
In my experience the wish for "warnings" is actually an expression of not trusting the software. As a software program, emitting a "warning" means that you want to shed the responsibility to actually decide if this is an error or not.
But the only thing you achieve as a program is to make your human waste time on false positives.
That is why I am against warnings. A program should call my attention to something only if I must take action as a result.
Your example with the color mode is perfectly showing my point: There is actually no action the user can take right now, the user can only accept it as it is. The only real "action" the user could possibly take is to use the program in a different way (from a color terminal). But since the user happens to use the program from an ASCII terminal, he probably has a reason for doing so and the warning does not help at all.
Therefore in your example the warning is a false positive because there is nothing one can do about it. I would really prefer your showsomething program to either issue an info message or - even better - to just shut up about this. If I miss the color output I will start myself to look into it. For that purpose a verbose or debug message would be much better suited.
Or put different: Warnings are only useful for people who want to run their environment manually. For those who build automation all kind of warnings are a curse.
Many thanks for your great explanation!
In particular your "emitting a 'warning' means that you want to shed the responsibility to actually decide if this is an error or not" describes exactly what at least my motivation as programmer is in such cases - as I wrote "uncertain feeling".
Your "issue an info message" is also an interesting point that makes things very clear.
The issue is not that a program must run either silently successfully or it aborts with an ERROR message, the issue is only that WARNING messages are useless.
If I understand it correctly the following would be o.k.:
# /bin/showsomething --output-mode=color INFO: output-mode=color unsupported on this terminal, using fallback output-mode=ASCII # echo $? 0 # /bin/showsomething --output-mode=color 2>/dev/null # echo $? 0 # /bin/showsomething --debug --output-mode=color DEBUG: TERM=ibmmono INFO: output-mode=color unsupported on this terminal, using fallback output-mode=ASCII # echo $? 0
added a commit
Apr 7, 2015
referenced this issue
Aug 4, 2016
Thanks for the heads-up. I'm pretty sure this change will bring the issue raised in this ticket. (as "failed" in logfiles will also raise an alarm in our systems).
I might be wrong, but IIRC the issues is merely related to how non-zero exit codes are being used. A change in the disklayout is IMHO not a failure, but it will be logged as "rear checklayout failed with exit code 1" as RC has been set to 1 in order to notify/trigger "mkrescue"
As far as I see the root cause that
Therefore one cannot distinguish between a real failure
With the new syslog messages you see at least the workflow
referenced this issue
Dec 20, 2016
rear: 12759: rear checklayout finished with zero exit code
versus (when the checklayout workflow exits with exit code 1)
rear: 10888: rear checklayout finished with exit code 1 (layout or config changed)
versus (when checklayout exits with another non-zero exit code)
rear: 11473: rear checklayout failed with exit code 2