Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New unique_unsorted() function #3177

Merged
merged 7 commits into from Apr 12, 2024
Merged

Conversation

jsmeix
Copy link
Member

@jsmeix jsmeix commented Mar 8, 2024

  • Type: Enhancement

  • Impact: High
    High impact where needed.
    No impact esewhere.

  • Reference to related issue (URL):

Triggred by my experiments in
#3175
how 'tar' behaves when exact same files or directories
are provided as 'tar' arguments to be archived.

  • How was this pull request tested?

See
#3175 (comment)
But I did not yet test "rear recover".

  • Description of the changes in this pull request:

In lib/global-functions.sh added a
new unique_unsorted() function
that outputs lines in STDIN or in a file
without subsequent duplicate lines
which keeps the ordering of the lines.

In lib/global-functions.sh added a
new without_subsequent_duplicates() function
that outputs lines in STDIN or in a file
without subsequent duplicate lines
which keeps the ordering of the lines.
@jsmeix jsmeix added the enhancement Adaptions and new features label Mar 8, 2024
@jsmeix jsmeix added this to the ReaR v2.8 milestone Mar 8, 2024
@jsmeix jsmeix self-assigned this Mar 8, 2024
@pcahyna
Copy link
Member

pcahyna commented Mar 8, 2024

@jsmeix there is sort -u added in #3175 , isn't that enough? Should be, unless we require preserving the order - do we? To us (me and @lzaoral ) it seemed that the order is quite unpredictable / random anyway.

In backup/NETFS/default/500_make_backup.sh use
without_subsequent_duplicates $TMP_DIR/backup-include.txt
to ignore duplicate arguments provided
to 'tar' and 'rsync' what should be archived
to avoid that 'tar' and 'rsync' archive
exact same things multiple times
which needlessly increases backup time and
in case of 'tar' the backup archive size and
storage space and backup restore time, cf.
#3175 (comment)
@jsmeix
Copy link
Member Author

jsmeix commented Mar 8, 2024

I want to preserve the ordering to be on the safe side.

I have no real-world use-case why the ordering is important.

It is just my gut feeling that in some special cases
the ordering could be crucial because for 'tar'
the ordering of the arguments does make a difference
(what is stored where in the archive) and in some special
cases that could be crucial.

Only what comes to my mind offhandedly right now:

# tar -czf fullbackup.tgz / /database

might behave rather different during restore compared to

# tar -czf fullbackup.tgz /database /

(under the assumption that /database is on its own filesystem)
i.e.
first the basic system gets restored (some Gigabytes),
then the database gets restored (some Terabytes)
versus
first the the database gets restored (some Terabytes),
then the basic system gets restored (some Gigabytes)

As an exercise for the reader:
For both cases estimate the probability
that restore of the basic system fails
;-)

@jsmeix
Copy link
Member Author

jsmeix commented Mar 8, 2024

Sigh!
I guess it is ReaR's careless code why
the order is quite unpredictable / random anyway
when one leaves such things to ReaR's automatism
and that is one reason behind why I implemented
BACKUP_ONLY_INCLUDE and BACKUP_ONLY_EXCLUDE
to provide "final power to the user".

So

BACKUP_ONLY_INCLUDE="yes"
BACKUP_PROG_INCLUDE=( / /database )

versus

BACKUP_ONLY_INCLUDE="yes"
BACKUP_PROG_INCLUDE=( /database / )

(under the assumption that /database is on its own filesystem)
should behave rather different during restore
but I did not test it - something to do for next week.

@jsmeix
Copy link
Member Author

jsmeix commented Mar 8, 2024

I avoid false expectations about this pull request:
This one does not help when when a filesystem/btrfs subvolume
is mounted more than once at different mountpoint directories
because this one only skips exactly same duplicated arguments
what 'tar' and 'rsync' should archive so this pull request
is meant only to help when by accident exactly same arguments
are provided more than once to 'tar' and 'rsync'.

I think it can never be needed that same arguments
are provided more than once to 'tar' and 'rsync'
so I think it is (hopefully) safe to always ignore
exactly same duplicated arguments.

@pcahyna
Copy link
Member

pcahyna commented Mar 8, 2024

@jsmeix I would characterize it as a way to avoid sort -u in #3175 and thus preserve the order of tar arguments in cases where the order matters ( e.g. explicit BACKUP_PROG_INCLUDE setting ). Correct me if I am wrong.

Copy link
Member

@schlomo schlomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'd suggest calling the function uniq_unsorted

Maybe https://stackoverflow.com/a/20639730 provides an implementation that doesn't need the if check for a potential argument as you can call the initial cat with "$@" to handle 0 or more args.

@pcahyna
Copy link
Member

pcahyna commented Mar 8, 2024

Great @schlomo I had independently exactly the same idea for the function's name!

@jsmeix
Copy link
Member Author

jsmeix commented Mar 11, 2024

Thank you for the better function name!
I will use uniq_unsorted

I was thinking about a shorter function name
that still clearly tells what that function does
but had no better idea than the long verbose name.

@jsmeix
Copy link
Member Author

jsmeix commented Apr 9, 2024

I am wondering if an even better function name
could be unique_unsorted with spelled out 'unique'
to make it more clear that this function
is something "on its own" which is in particular
not a wrapper or some enhanced reimplementation
of the uniq program (with all its options)?

And in general I don't like old *nix style
crppld wrds lk 'uniq_usrtd' ;-)

renamed without_subsequent_duplicates() function
into unique_unsorted, see
#3177 (comment)
Call unique_unsorted instead of without_subsequent_duplicates
@jsmeix jsmeix changed the title New without_subsequent_duplicates() function New unique_unsorted() function Apr 11, 2024
Let the unique_unsorted() function timeout after 5 seconds
when there is nothing at STDIN or when STDIN does not get closed
to avoid that 'awk' waits endlessly for (further) input
which makes ReaR behave as if it hung up, cf.
https://github.com/rear/rear/wiki/Coding-Style#beware-of-the-emptiness
@jsmeix
Copy link
Member Author

jsmeix commented Apr 11, 2024

For now I kept the if check for a filename argument
because in the STDIN case I like to timeout after 5 seconds
when there is nothing at STDIN or when STDIN does not get closed
to avoid that 'awk' waits endlessly for (further) input
which would make ReaR behave as if it hung up, cf.
https://github.com/rear/rear/wiki/Coding-Style#beware-of-the-emptiness

@schlomo
I assume all this can be implemented in a single case
but I would need time to find out how to do that properly.
If a proper implementation in a single case
is obvious to you, please show it to me here.

Copy link
Member

@schlomo schlomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good idea and it nicely fits into the code where it is used.

I'd suggest not using a timeout as a safeguard.

# Timeout after 5 seconds when there is nothing at STDIN or when STDIN does not get closed
# to avoid that 'awk' waits endlessly for (further) input which makes ReaR behave as if it hung up
# cf. https://github.com/rear/rear/wiki/Coding-Style#beware-of-the-emptiness
timeout 5 awk '!seen[$0]++'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this works as intended it might be better to check for unintented scenarios with tty and abort if stdin is a tty?
image

I'm worried about difficulties identifying bugs where ReaR just hangs 5 secs which could be hard to distinguish from ReaR actually doing somthing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a timeout of only one second?
This should be more than enough time for our use cases.

I copied the "timeout 5" from
init/default/100_check_stale_nfs_mounts.sh
where we Error out when things don't behave as they should.

I think I can add some explicit exit code handling
when it times out because "man timeout" tells

If the command times out,
and --preserve-status is not set,
then exit with status 124.
Otherwise, exit with the status of COMMAND.

so I think it is safe to assume that exit code 124
happens only when it times out because "man awk" tells

EXIT STATUS

If the exit statement is used with a value,
then gawk exits with the numeric value given to it.

Otherwise, if there were no problems during execution,
gawk exits with the value of the C constant EXIT_SUCCESS.
This is usually zero.

If an error occurs, gawk exits with the value of the
C constant EXIT_FAILURE. This is usually one.

If gawk exits because of a fatal error, the exit status is 2.
On non-POSIX systems, this value may be mapped to EXIT_FAILURE.

so in the awk use case here (no exit statement)
the awk exit codes can be only 0, 1, or 2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking with 'tty' won't help against
too slow "dripping" input via a pipe from a program
or when the producer program in a pipe hung up
(so it does not close its stdout).

Copy link
Member Author

@jsmeix jsmeix Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly:
Since I learned about 'timeout' I like to experiment
with it how far that actually helps us in practice.
In particular because 'timeout' could be a generic method
to deal with all kind of "unexpected hangup" issues, cf.
https://github.com/rear/rear/wiki/Coding-Style#it-should-be-possible-to-run-rear-unattended

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why timeout seems like a good solution to many problems, however I'd like to caution against using it for internal stuff because it introduces non-deterministic behavior and that can create hard-to-analyze bugs.

For activities that interact with external components, e.g. network access, however timeout is a good idea to set an upper boundary for stuff that might otherwise block forever.

Reading and sorting file seems for me to be an "internal" problem so that I wouldn't use a timeout solution for that.

Maybe another example: Every use of cat or read in ReaR carries the same risk that it would silently wait for input from STDIN, and we don't have the habit of protecting against such cases. Therefore I would also suggest to simply ignore that scenario with your sorting function and only make sure that it handles both reading from STDIN and from a file, similar to how cat behaves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

For now I leave out checking for possible problems
when input is stdin.

Perhaps a specific check if stdin is a tty
is really the best compromise by catching
only the one most often happeneing misuse case
and not overengineering things with an attempt
to find a generic solution to catch all misuse cases.
But I will add such a specific check later after
thinking a bit more about it.

For now I merge it as is because its current state
should be good enough - at least for now.

No 'timeout 5' for 'awk' when its input is stdin
Removed useless and misleading '|| return 1' because
it already returns 1 when 'test -r $filename' fails
and when awk fails it should return the awk return code
which could be 2 when there is a fatal error in awk
@jsmeix
Copy link
Member Author

jsmeix commented Apr 12, 2024

@rear/contributors
if there are no objections I will merge it today
a bit after 15:00 CEST (13:00 UTC).

@schlomo
Copy link
Member

schlomo commented Apr 12, 2024

Looks good to me.

Maybe silly question:

awk "..." "$@" doesn't work? It would save you the if clause

@jsmeix jsmeix merged commit ec90806 into master Apr 12, 2024
21 checks passed
@jsmeix jsmeix deleted the jsmeix-without_subsequent_duplicates branch April 12, 2024 13:25
@jsmeix
Copy link
Member Author

jsmeix commented Apr 12, 2024

@schlomo
I merged it "as is" because I had tested that at least a bit
and - at least currently - I don't find time to test a rather
different implementation.

Furthermore I would have to think about
if a check if stdin is a tty makes sense, cf.
#3177 (comment)
and if yes if then it helps to have the stdin case
well separated from the "input from file" case,
cf. RFC 1925 item (5)

It is always possible to aglutenate [sic] multiple separate problems
into a single complex interdependent solution. In most cases
this is a bad idea.

jsmeix added a commit that referenced this pull request May 14, 2024
Overhauled 400_create_include_exclude_files.sh

Now do first backup the mounted filesystems
to backup '/' first so the basic system files get stored
first in the backup and then backup what is specified in BACKUP_PROG_INCLUDE
see #3177 (comment)
and #3217 (comment)

Report suspicious cases as LogPrintError to have the user at least informed.

Remove duplicates but keep the ordering.
to avoid possibly unwanted and unexpected subtle consequences
see #3175 (comment)

Verify that at least '/' is in backup-include.txt
see #3217

Redirect stdout into files exactly at the command where needed
instead of more global redirections,
cf. "horrible coding style"
in #3175 (comment)
jsmeix added a commit that referenced this pull request May 15, 2024
Overhauled backup/NETFS/default/400_create_include_exclude_files.sh
* Now do first backup the mounted filesystems to backup '/' first
so the basic system files get stored first in the backup and
then backup what is specified in BACKUP_PROG_INCLUDE
see #3177 (comment)
and #3217 (comment)
* Report suspicious cases as LogPrintError to have the user at least informed.
* Remove duplicates in backup-[in/ex]clude.txt but keep the ordering
to avoid possibly unwanted and unexpected subtle consequences
see #3175 (comment)
* Verify that at least '/' is in backup-include.txt
see #3217
* Redirect stdout into files exactly at the command where needed
instead of more global redirections, cf. "horrible coding style"
in #3175 (comment)

Update backup/NETFS/default/500_make_backup.sh
* In backup/NETFS/default/500_make_backup.sh
unique_unsorted is no longer needed because
backup-include.txt is already without duplicates
because unique_unsorted is now called in
backup/NETFS/default/400_create_include_exclude_files.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adaptions and new features fixed / solved / done
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants