New unique_unsorted() function #3177

jsmeix · 2024-03-08T15:11:13Z

Type: Enhancement
Impact: High
High impact where needed.
No impact esewhere.
Reference to related issue (URL):

Triggred by my experiments in
#3175
how 'tar' behaves when exact same files or directories
are provided as 'tar' arguments to be archived.

How was this pull request tested?

See
#3175 (comment)
But I did not yet test "rear recover".

Description of the changes in this pull request:

In lib/global-functions.sh added a
new unique_unsorted() function
that outputs lines in STDIN or in a file
without subsequent duplicate lines
which keeps the ordering of the lines.

In lib/global-functions.sh added a new without_subsequent_duplicates() function that outputs lines in STDIN or in a file without subsequent duplicate lines which keeps the ordering of the lines.

pcahyna · 2024-03-08T15:18:41Z

@jsmeix there is sort -u added in #3175 , isn't that enough? Should be, unless we require preserving the order - do we? To us (me and @lzaoral ) it seemed that the order is quite unpredictable / random anyway.

In backup/NETFS/default/500_make_backup.sh use without_subsequent_duplicates $TMP_DIR/backup-include.txt to ignore duplicate arguments provided to 'tar' and 'rsync' what should be archived to avoid that 'tar' and 'rsync' archive exact same things multiple times which needlessly increases backup time and in case of 'tar' the backup archive size and storage space and backup restore time, cf. #3175 (comment)

jsmeix · 2024-03-08T15:46:07Z

I want to preserve the ordering to be on the safe side.

I have no real-world use-case why the ordering is important.

It is just my gut feeling that in some special cases
the ordering could be crucial because for 'tar'
the ordering of the arguments does make a difference
(what is stored where in the archive) and in some special
cases that could be crucial.

Only what comes to my mind offhandedly right now:

# tar -czf fullbackup.tgz / /database

might behave rather different during restore compared to

# tar -czf fullbackup.tgz /database /

(under the assumption that /database is on its own filesystem)
i.e.
first the basic system gets restored (some Gigabytes),
then the database gets restored (some Terabytes)
versus
first the the database gets restored (some Terabytes),
then the basic system gets restored (some Gigabytes)

As an exercise for the reader:
For both cases estimate the probability
that restore of the basic system fails
;-)

jsmeix · 2024-03-08T15:52:22Z

Sigh!
I guess it is ReaR's careless code why
the order is quite unpredictable / random anyway
when one leaves such things to ReaR's automatism
and that is one reason behind why I implemented
BACKUP_ONLY_INCLUDE and BACKUP_ONLY_EXCLUDE
to provide "final power to the user".

So

BACKUP_ONLY_INCLUDE="yes"
BACKUP_PROG_INCLUDE=( / /database )

versus

BACKUP_ONLY_INCLUDE="yes"
BACKUP_PROG_INCLUDE=( /database / )

(under the assumption that /database is on its own filesystem)
should behave rather different during restore
but I did not test it - something to do for next week.

jsmeix · 2024-03-08T16:11:06Z

I avoid false expectations about this pull request:
This one does not help when when a filesystem/btrfs subvolume
is mounted more than once at different mountpoint directories
because this one only skips exactly same duplicated arguments
what 'tar' and 'rsync' should archive so this pull request
is meant only to help when by accident exactly same arguments
are provided more than once to 'tar' and 'rsync'.

I think it can never be needed that same arguments
are provided more than once to 'tar' and 'rsync'
so I think it is (hopefully) safe to always ignore
exactly same duplicated arguments.

pcahyna · 2024-03-08T16:15:30Z

@jsmeix I would characterize it as a way to avoid sort -u in #3175 and thus preserve the order of tar arguments in cases where the order matters ( e.g. explicit BACKUP_PROG_INCLUDE setting ). Correct me if I am wrong.

schlomo

Good idea, I'd suggest calling the function uniq_unsorted

Maybe https://stackoverflow.com/a/20639730 provides an implementation that doesn't need the if check for a potential argument as you can call the initial cat with "$@" to handle 0 or more args.

pcahyna · 2024-03-08T16:32:57Z

Great @schlomo I had independently exactly the same idea for the function's name!

jsmeix · 2024-03-11T06:57:03Z

Thank you for the better function name!
I will use uniq_unsorted

I was thinking about a shorter function name
that still clearly tells what that function does
but had no better idea than the long verbose name.

jsmeix · 2024-04-09T12:42:15Z

I am wondering if an even better function name
could be unique_unsorted with spelled out 'unique'
to make it more clear that this function
is something "on its own" which is in particular
not a wrapper or some enhanced reimplementation
of the uniq program (with all its options)?

And in general I don't like old *nix style
crppld wrds lk 'uniq_usrtd' ;-)

renamed without_subsequent_duplicates() function into unique_unsorted, see #3177 (comment)

Call unique_unsorted instead of without_subsequent_duplicates

Let the unique_unsorted() function timeout after 5 seconds when there is nothing at STDIN or when STDIN does not get closed to avoid that 'awk' waits endlessly for (further) input which makes ReaR behave as if it hung up, cf. https://github.com/rear/rear/wiki/Coding-Style#beware-of-the-emptiness

jsmeix · 2024-04-11T09:23:05Z

For now I kept the if check for a filename argument
because in the STDIN case I like to timeout after 5 seconds
when there is nothing at STDIN or when STDIN does not get closed
to avoid that 'awk' waits endlessly for (further) input
which would make ReaR behave as if it hung up, cf.
https://github.com/rear/rear/wiki/Coding-Style#beware-of-the-emptiness

@schlomo
I assume all this can be implemented in a single case
but I would need time to find out how to do that properly.
If a proper implementation in a single case
is obvious to you, please show it to me here.

schlomo

Very good idea and it nicely fits into the code where it is used.

I'd suggest not using a timeout as a safeguard.

schlomo · 2024-04-11T17:27:03Z

usr/share/rear/lib/global-functions.sh

+        # Timeout after 5 seconds when there is nothing at STDIN or when STDIN does not get closed
+        # to avoid that 'awk' waits endlessly for (further) input which makes ReaR behave as if it hung up
+        # cf. https://github.com/rear/rear/wiki/Coding-Style#beware-of-the-emptiness
+        timeout 5 awk '!seen[$0]++'


While this works as intended it might be better to check for unintented scenarios with tty and abort if stdin is a tty?

I'm worried about difficulties identifying bugs where ReaR just hangs 5 secs which could be hard to distinguish from ReaR actually doing somthing.

What about a timeout of only one second?
This should be more than enough time for our use cases.

I copied the "timeout 5" from
init/default/100_check_stale_nfs_mounts.sh
where we Error out when things don't behave as they should.

I think I can add some explicit exit code handling
when it times out because "man timeout" tells

If the command times out, and --preserve-status is not set, then exit with status 124. Otherwise, exit with the status of COMMAND.

so I think it is safe to assume that exit code 124
happens only when it times out because "man awk" tells

EXIT STATUS If the exit statement is used with a value, then gawk exits with the numeric value given to it. Otherwise, if there were no problems during execution, gawk exits with the value of the C constant EXIT_SUCCESS. This is usually zero. If an error occurs, gawk exits with the value of the C constant EXIT_FAILURE. This is usually one. If gawk exits because of a fatal error, the exit status is 2. On non-POSIX systems, this value may be mapped to EXIT_FAILURE.

so in the awk use case here (no exit statement)
the awk exit codes can be only 0, 1, or 2.

Checking with 'tty' won't help against
too slow "dripping" input via a pipe from a program
or when the producer program in a pipe hung up
(so it does not close its stdout).

Honestly:
Since I learned about 'timeout' I like to experiment
with it how far that actually helps us in practice.
In particular because 'timeout' could be a generic method
to deal with all kind of "unexpected hangup" issues, cf.
https://github.com/rear/rear/wiki/Coding-Style#it-should-be-possible-to-run-rear-unattended

I understand why timeout seems like a good solution to many problems, however I'd like to caution against using it for internal stuff because it introduces non-deterministic behavior and that can create hard-to-analyze bugs.

For activities that interact with external components, e.g. network access, however timeout is a good idea to set an upper boundary for stuff that might otherwise block forever.

Reading and sorting file seems for me to be an "internal" problem so that I wouldn't use a timeout solution for that.

Maybe another example: Every use of cat or read in ReaR carries the same risk that it would silently wait for input from STDIN, and we don't have the habit of protecting against such cases. Therefore I would also suggest to simply ignore that scenario with your sorting function and only make sure that it handles both reading from STDIN and from a file, similar to how cat behaves.

I agree.

For now I leave out checking for possible problems
when input is stdin.

Perhaps a specific check if stdin is a tty
is really the best compromise by catching
only the one most often happeneing misuse case
and not overengineering things with an attempt
to find a generic solution to catch all misuse cases.
But I will add such a specific check later after
thinking a bit more about it.

For now I merge it as is because its current state
should be good enough - at least for now.

No 'timeout 5' for 'awk' when its input is stdin

Removed useless and misleading '|| return 1' because it already returns 1 when 'test -r $filename' fails and when awk fails it should return the awk return code which could be 2 when there is a fatal error in awk

jsmeix · 2024-04-12T12:35:46Z

@rear/contributors
if there are no objections I will merge it today
a bit after 15:00 CEST (13:00 UTC).

schlomo · 2024-04-12T12:46:14Z

Looks good to me.

Maybe silly question:

awk "..." "$@" doesn't work? It would save you the if clause

jsmeix · 2024-04-12T13:31:39Z

@schlomo
I merged it "as is" because I had tested that at least a bit
and - at least currently - I don't find time to test a rather
different implementation.

Furthermore I would have to think about
if a check if stdin is a tty makes sense, cf.
#3177 (comment)
and if yes if then it helps to have the stdin case
well separated from the "input from file" case,
cf. RFC 1925 item (5)

It is always possible to aglutenate [sic] multiple separate problems
into a single complex interdependent solution. In most cases
this is a bad idea.

Overhauled 400_create_include_exclude_files.sh Now do first backup the mounted filesystems to backup '/' first so the basic system files get stored first in the backup and then backup what is specified in BACKUP_PROG_INCLUDE see #3177 (comment) and #3217 (comment) Report suspicious cases as LogPrintError to have the user at least informed. Remove duplicates but keep the ordering. to avoid possibly unwanted and unexpected subtle consequences see #3175 (comment) Verify that at least '/' is in backup-include.txt see #3217 Redirect stdout into files exactly at the command where needed instead of more global redirections, cf. "horrible coding style" in #3175 (comment)

Overhauled backup/NETFS/default/400_create_include_exclude_files.sh * Now do first backup the mounted filesystems to backup '/' first so the basic system files get stored first in the backup and then backup what is specified in BACKUP_PROG_INCLUDE see #3177 (comment) and #3217 (comment) * Report suspicious cases as LogPrintError to have the user at least informed. * Remove duplicates in backup-[in/ex]clude.txt but keep the ordering to avoid possibly unwanted and unexpected subtle consequences see #3175 (comment) * Verify that at least '/' is in backup-include.txt see #3217 * Redirect stdout into files exactly at the command where needed instead of more global redirections, cf. "horrible coding style" in #3175 (comment) Update backup/NETFS/default/500_make_backup.sh * In backup/NETFS/default/500_make_backup.sh unique_unsorted is no longer needed because backup-include.txt is already without duplicates because unique_unsorted is now called in backup/NETFS/default/400_create_include_exclude_files.sh

New without_subsequent_duplicates() function

231eb0d

In lib/global-functions.sh added a new without_subsequent_duplicates() function that outputs lines in STDIN or in a file without subsequent duplicate lines which keeps the ordering of the lines.

jsmeix added the enhancement Adaptions and new features label Mar 8, 2024

jsmeix added this to the ReaR v2.8 milestone Mar 8, 2024

jsmeix self-assigned this Mar 8, 2024

jsmeix mentioned this pull request Mar 8, 2024

Automatically include mounted btrfs subvolumes in NETFS backups #3175

Open

schlomo approved these changes Mar 8, 2024

View reviewed changes

jsmeix added 2 commits April 11, 2024 10:51

Update global-functions.sh

6942911

renamed without_subsequent_duplicates() function into unique_unsorted, see #3177 (comment)

Update 500_make_backup.sh

7d938ee

Call unique_unsorted instead of without_subsequent_duplicates

jsmeix changed the title ~~New without_subsequent_duplicates() function~~ New unique_unsorted() function Apr 11, 2024

schlomo approved these changes Apr 11, 2024

View reviewed changes

jsmeix added 2 commits April 12, 2024 13:36

Update global-functions.sh

14fdc31

No 'timeout 5' for 'awk' when its input is stdin

Update global-functions.sh

cebb7ce

Removed useless and misleading '|| return 1' because it already returns 1 when 'test -r $filename' fails and when awk fails it should return the awk return code which could be 2 when there is a fatal error in awk

jsmeix merged commit ec90806 into master Apr 12, 2024
21 checks passed

jsmeix deleted the jsmeix-without_subsequent_duplicates branch April 12, 2024 13:25

jsmeix added the fixed / solved / done label Apr 12, 2024

jsmeix mentioned this pull request Apr 12, 2024

On hold: New source_variable_from_file() #3203

Draft

jsmeix mentioned this pull request May 3, 2024

Check for empty NETFS 'tar' backup #3217

Closed

jsmeix mentioned this pull request May 14, 2024

Overhauled 400_create_include_exclude_files.sh #3221

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New unique_unsorted() function #3177

New unique_unsorted() function #3177

jsmeix commented Mar 8, 2024 •

edited

pcahyna commented Mar 8, 2024

jsmeix commented Mar 8, 2024 •

edited

jsmeix commented Mar 8, 2024 •

edited

jsmeix commented Mar 8, 2024

pcahyna commented Mar 8, 2024 •

edited

schlomo left a comment

pcahyna commented Mar 8, 2024

jsmeix commented Mar 11, 2024

jsmeix commented Apr 9, 2024 •

edited

jsmeix commented Apr 11, 2024 •

edited

schlomo left a comment

schlomo Apr 11, 2024

jsmeix Apr 12, 2024

jsmeix Apr 12, 2024

jsmeix Apr 12, 2024 •

edited

schlomo Apr 12, 2024

jsmeix Apr 12, 2024

jsmeix commented Apr 12, 2024

schlomo commented Apr 12, 2024

jsmeix commented Apr 12, 2024

New unique_unsorted() function #3177

New unique_unsorted() function #3177

Conversation

jsmeix commented Mar 8, 2024 • edited

pcahyna commented Mar 8, 2024

jsmeix commented Mar 8, 2024 • edited

jsmeix commented Mar 8, 2024 • edited

jsmeix commented Mar 8, 2024

pcahyna commented Mar 8, 2024 • edited

schlomo left a comment

Choose a reason for hiding this comment

pcahyna commented Mar 8, 2024

jsmeix commented Mar 11, 2024

jsmeix commented Apr 9, 2024 • edited

jsmeix commented Apr 11, 2024 • edited

schlomo left a comment

Choose a reason for hiding this comment

schlomo Apr 11, 2024

Choose a reason for hiding this comment

jsmeix Apr 12, 2024

Choose a reason for hiding this comment

jsmeix Apr 12, 2024

Choose a reason for hiding this comment

jsmeix Apr 12, 2024 • edited

Choose a reason for hiding this comment

schlomo Apr 12, 2024

Choose a reason for hiding this comment

jsmeix Apr 12, 2024

Choose a reason for hiding this comment

jsmeix commented Apr 12, 2024

schlomo commented Apr 12, 2024

jsmeix commented Apr 12, 2024

jsmeix commented Mar 8, 2024 •

edited

jsmeix commented Mar 8, 2024 •

edited

jsmeix commented Mar 8, 2024 •

edited

pcahyna commented Mar 8, 2024 •

edited

jsmeix commented Apr 9, 2024 •

edited

jsmeix commented Apr 11, 2024 •

edited

jsmeix Apr 12, 2024 •

edited