-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New unique_unsorted() function #3177
Conversation
In lib/global-functions.sh added a new without_subsequent_duplicates() function that outputs lines in STDIN or in a file without subsequent duplicate lines which keeps the ordering of the lines.
In backup/NETFS/default/500_make_backup.sh use without_subsequent_duplicates $TMP_DIR/backup-include.txt to ignore duplicate arguments provided to 'tar' and 'rsync' what should be archived to avoid that 'tar' and 'rsync' archive exact same things multiple times which needlessly increases backup time and in case of 'tar' the backup archive size and storage space and backup restore time, cf. #3175 (comment)
I want to preserve the ordering to be on the safe side. I have no real-world use-case why the ordering is important. It is just my gut feeling that in some special cases Only what comes to my mind offhandedly right now:
might behave rather different during restore compared to
(under the assumption that /database is on its own filesystem) As an exercise for the reader: |
Sigh! So
versus
(under the assumption that /database is on its own filesystem) |
I avoid false expectations about this pull request: I think it can never be needed that same arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'd suggest calling the function uniq_unsorted
Maybe https://stackoverflow.com/a/20639730 provides an implementation that doesn't need the if check for a potential argument as you can call the initial cat with "$@"
to handle 0 or more args.
Great @schlomo I had independently exactly the same idea for the function's name! |
Thank you for the better function name! I was thinking about a shorter function name |
I am wondering if an even better function name And in general I don't like old *nix style |
renamed without_subsequent_duplicates() function into unique_unsorted, see #3177 (comment)
Call unique_unsorted instead of without_subsequent_duplicates
Let the unique_unsorted() function timeout after 5 seconds when there is nothing at STDIN or when STDIN does not get closed to avoid that 'awk' waits endlessly for (further) input which makes ReaR behave as if it hung up, cf. https://github.com/rear/rear/wiki/Coding-Style#beware-of-the-emptiness
For now I kept the if check for a filename argument @schlomo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good idea and it nicely fits into the code where it is used.
I'd suggest not using a timeout as a safeguard.
# Timeout after 5 seconds when there is nothing at STDIN or when STDIN does not get closed | ||
# to avoid that 'awk' waits endlessly for (further) input which makes ReaR behave as if it hung up | ||
# cf. https://github.com/rear/rear/wiki/Coding-Style#beware-of-the-emptiness | ||
timeout 5 awk '!seen[$0]++' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a timeout of only one second?
This should be more than enough time for our use cases.
I copied the "timeout 5" from
init/default/100_check_stale_nfs_mounts.sh
where we Error out when things don't behave as they should.
I think I can add some explicit exit code handling
when it times out because "man timeout" tells
If the command times out,
and --preserve-status is not set,
then exit with status 124.
Otherwise, exit with the status of COMMAND.
so I think it is safe to assume that exit code 124
happens only when it times out because "man awk" tells
EXIT STATUS
If the exit statement is used with a value,
then gawk exits with the numeric value given to it.
Otherwise, if there were no problems during execution,
gawk exits with the value of the C constant EXIT_SUCCESS.
This is usually zero.
If an error occurs, gawk exits with the value of the
C constant EXIT_FAILURE. This is usually one.
If gawk exits because of a fatal error, the exit status is 2.
On non-POSIX systems, this value may be mapped to EXIT_FAILURE.
so in the awk use case here (no exit statement)
the awk exit codes can be only 0, 1, or 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking with 'tty' won't help against
too slow "dripping" input via a pipe from a program
or when the producer program in a pipe hung up
(so it does not close its stdout).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly:
Since I learned about 'timeout' I like to experiment
with it how far that actually helps us in practice.
In particular because 'timeout' could be a generic method
to deal with all kind of "unexpected hangup" issues, cf.
https://github.com/rear/rear/wiki/Coding-Style#it-should-be-possible-to-run-rear-unattended
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why timeout
seems like a good solution to many problems, however I'd like to caution against using it for internal stuff because it introduces non-deterministic behavior and that can create hard-to-analyze bugs.
For activities that interact with external components, e.g. network access, however timeout
is a good idea to set an upper boundary for stuff that might otherwise block forever.
Reading and sorting file seems for me to be an "internal" problem so that I wouldn't use a timeout solution for that.
Maybe another example: Every use of cat
or read
in ReaR carries the same risk that it would silently wait for input from STDIN, and we don't have the habit of protecting against such cases. Therefore I would also suggest to simply ignore that scenario with your sorting function and only make sure that it handles both reading from STDIN and from a file, similar to how cat
behaves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
For now I leave out checking for possible problems
when input is stdin.
Perhaps a specific check if stdin is a tty
is really the best compromise by catching
only the one most often happeneing misuse case
and not overengineering things with an attempt
to find a generic solution to catch all misuse cases.
But I will add such a specific check later after
thinking a bit more about it.
For now I merge it as is because its current state
should be good enough - at least for now.
No 'timeout 5' for 'awk' when its input is stdin
Removed useless and misleading '|| return 1' because it already returns 1 when 'test -r $filename' fails and when awk fails it should return the awk return code which could be 2 when there is a fatal error in awk
@rear/contributors |
Looks good to me. Maybe silly question:
|
@schlomo Furthermore I would have to think about
|
Overhauled 400_create_include_exclude_files.sh Now do first backup the mounted filesystems to backup '/' first so the basic system files get stored first in the backup and then backup what is specified in BACKUP_PROG_INCLUDE see #3177 (comment) and #3217 (comment) Report suspicious cases as LogPrintError to have the user at least informed. Remove duplicates but keep the ordering. to avoid possibly unwanted and unexpected subtle consequences see #3175 (comment) Verify that at least '/' is in backup-include.txt see #3217 Redirect stdout into files exactly at the command where needed instead of more global redirections, cf. "horrible coding style" in #3175 (comment)
Overhauled backup/NETFS/default/400_create_include_exclude_files.sh * Now do first backup the mounted filesystems to backup '/' first so the basic system files get stored first in the backup and then backup what is specified in BACKUP_PROG_INCLUDE see #3177 (comment) and #3217 (comment) * Report suspicious cases as LogPrintError to have the user at least informed. * Remove duplicates in backup-[in/ex]clude.txt but keep the ordering to avoid possibly unwanted and unexpected subtle consequences see #3175 (comment) * Verify that at least '/' is in backup-include.txt see #3217 * Redirect stdout into files exactly at the command where needed instead of more global redirections, cf. "horrible coding style" in #3175 (comment) Update backup/NETFS/default/500_make_backup.sh * In backup/NETFS/default/500_make_backup.sh unique_unsorted is no longer needed because backup-include.txt is already without duplicates because unique_unsorted is now called in backup/NETFS/default/400_create_include_exclude_files.sh
Type: Enhancement
Impact: High
High impact where needed.
No impact esewhere.
Reference to related issue (URL):
Triggred by my experiments in
#3175
how 'tar' behaves when exact same files or directories
are provided as 'tar' arguments to be archived.
See
#3175 (comment)
But I did not yet test "rear recover".
In lib/global-functions.sh added a
new unique_unsorted() function
that outputs lines in STDIN or in a file
without subsequent duplicate lines
which keeps the ordering of the lines.