Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multiple vg recreation #2691

Merged
merged 5 commits into from Oct 14, 2021
Merged

Conversation

pcahyna
Copy link
Member

@pcahyna pcahyna commented Oct 4, 2021

Pull Request Details:
  • Type: Bug Fix

  • Impact: Normal

  • Reference to related issue (URL): rmetrich@b184194#r56498696 rmetrich@311bfb3#r56498755

  • How was this pull request tested?
    Backing up and recreating a system with two VGs, one contains thin pools, the other does not.

  • Brief description of the changes in this pull request:
    Fix a problem introduced in commits b184194 and 311bfb3 (PR Issue #1380 - ReaR recovery fails when the OS contains a Thin Pool/Volume #1806) that shows up when there are multiple VGs to restore.
    Using variables create_thin_volumes_only and create_logical_volumes to propagate infromation from VG creation to LV creation does not work well in the case of multiple VGs, because the variables are global and if there are multiple VGs, their values will leak from one VG to another. The generated diskrestore.sh script does not guarantee that the LVs of a given VG are created immediately after their VG and before creating another VG. Currently, the script first creates all VGs and then all LVs, so all the LVs in all VGs will see the value of create_logical_volumes and create_thin_volumes_only from the last VG, not from their own. This matters when different VGs behave differently (typically if one has a thin pool and the other does not).
    Fix by replacing the scalar values by arrays of VG names. If a given VG is in the array, it is the equivalent of the former scalar value being 1 for the given VG, if it is not in the array, it is an equivalent of a former value of 0.
    For the create_volume_group variable the change is not needed, but do it nevertheless for symmetry with other variables.

Fix a problem introduced in commits
b184194 and
311bfb3 (PR rear#1806) that shows up when
there are multiple VGs to restore.

Using variables create_thin_volumes_only and create_logical_volumes to
propagate information from VG creation to LV creation does not work well
in the case of multiple VGs, because the variables are global and if
there are multiple VGs, their values will leak from one VG to another.
The generated diskrestore.sh script does not guarantee that the LVs of a
given VG are created immediately after their VG and before creating
another VG. Currently, the script first creates all VGs and then all LVs,
so all the LVs in all VGs will see the value of create_logical_volumes
and create_thin_volumes_only from the last VG, not from their own. This
matters when different VGs behave differently (typically if one has a
thin pool and the other does not).

Fix by replacing the scalar values by arrays of VG names. If a given VG
is in the array, it is the equivalent of the former scalar value being 1
for the given VG, if it is not in the array, it is an equivalent of a
former value of 0.

For the create_volume_group variable the change is not needed, but do it
nevertheless for symmetry with other variables.
@pcahyna pcahyna requested a review from rmetrich October 4, 2021 17:50
pcahyna referenced this pull request in rmetrich/rear Oct 4, 2021
…l/Volume

Using 'vgcfgrestore' to restore Thin LVs doesn't work, because there
isn't the metadata in the configuration file.
To be able to use Thin LVs, upon failure of 'vgcfgrestore' (or migration),
traditional 'vgcreate/lvcreate' are used instead.

This approach is sub-optimal because it requires to know all the
possible options to 'lvcreate', which is hard/impossible to do.
Nonetheless, this is better than nothing since the issue with 'lvcreate'
options was already there in case of migration.

The code using 'lvcreate' only runs upon 'vgcfgrestore' failure, which
typically happens only with Thin LVs (IMHO). Hence, on most setups, the
optimized code will still run, preventing any regression to happen.

Signed-off-by: Renaud Métrich <rmetrich@redhat.com>
@pcahyna
Copy link
Member Author

pcahyna commented Oct 6, 2021

Hello @rmetrich, you introduced the original code, could you please have a look at the change?

Copy link
Contributor

@rmetrich rmetrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@@ -30,15 +30,6 @@ create_disk() {
[[ -b "$disk" ]]
BugIfError "Disk $disk is not a block device."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you (while you are at it) by the way replace all *IfError function calls
by plain normal bash code, see my comment at the *IfError functions about
"Using the ...IfError functions can result unexpected behaviour in certain cases"
in usr/share/rear/lib/_input-output-functions.sh that is currently at
https://github.com/rear/rear/blob/master/usr/share/rear/lib/_input-output-functions.sh#L773

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsmeix done. I used this opportunity to move the check for block device inside the layout script, because I like to develop the code by running the ReaR scripts outside the system to be restored, and for this I need that they do not touch the hardware (that only the generated layout script does). This way I can take a saved layout information, run a modified script on it, and examine the resulting generated layout script.

Copy link
Member

@jsmeix jsmeix Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is OK to move checks into diskrestore.sh as you think it is best
but I think we must not call Error() or BugError() in diskrestore.sh
see #2691 (comment)

I think it works inside diskrestore.sh to call e.g. LogPrintError()
to show an error message to the user and then exit the
diskrestore.sh script normally with non zero exit code
or by calling false and let it exit via set -e.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsmeix

call e.g. LogPrintError()
to show an error message to the user and then exit the
diskrestore.sh script normally with non zero exit code

I think it is worth to have a helper function for this. Now, the question is whether to introduce new function(s), called maybe SubshellError() and SubshellBugError() and call them from the layout script, or whether make Error() and BugError() smarter by teaching them to DTRT when called from inside the layout script (this could be achieved by setting a dedicated variable at the start of the layout script, in 540_generate_device_code.sh, and then examining it in Error() and BugError()). The advantage of the latter approach would be that in the former approach, you need to classify functions into two categories: those called from the layout (or similar) script and those called from the regular ReaR code, use SubshellError()/SubshellBugError() in the former and Error()/BugError() in the latter and never mix them together.
(The same problem arises even without introducing helpers like SubshellError()/SubshellBugError(). Even if performing what you propose manually (call e.g. LogPrintError() to show an error message to the user and then exit the diskrestore.sh script normally with non zero exit code) you still need to classify functions according to whether one should do this or one is allowed to use Error()/BugError() normally.)

Copy link
Member

@jsmeix jsmeix Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking when I found out about this problem here
was to make Error() smarter.

I think only Error() need to be enhanced
but no other functions like BugError() that call Error()
because all what matter is the code in Error() that exits
the whole running rear program with all its child processes.

I am even thinking about no longer exit the whole ReaR
during "rear recover" by default when Error() is called
because this problem here (Error exit during diskrestore.sh)
shows that likely it is basically always wrong to
exit the whole ReaR during "rear recover" because
that results basically always bad UX for the user
because I think exit the whole ReaR during "rear recover"
is never helpful for the user - what should he do in that case?
In contrast clear ERROR messages during "rear recover"
with the option to proceed (cf. "Final power to the user!")
may help the user to shomehow complete "rear recover"
and afterwards fix the issues that lead to the error messages
or deliberately ignore them, reboot, and hope for the best...

I think the "How to deal with errors during 'rear recover'?" area
should be properly investigated via a new separated issue
#2695

@pcahyna
I leave it to you if you prefer to keep the current BugError
in this pull request here or to change it into a LogPrintError.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am even thinking about no longer exit the whole ReaR
during "rear recover" by default when Error() is called
because this problem here (Error exit during diskrestore.sh)
shows that likely it is basically always wrong to
exit the whole ReaR during "rear recover" because
that results basically always bad UX for the user

Your reasoning may apply to Error() but not to BugError(). There is not much that can be done when a code bug is detected - all bets are off in this situation and nasal daemons may fly, as you can never be certain what the bug is and how to handle it. After all, calling assert() or a similar function that aborts in places that are considered unreachable is a common practice, and a good one, as it permits detecting problems early instead of continuing with bad data. So I think that some variant of BugError() that aborts ReaR will always have its uses.
Of course, this does not mean that in the layout script BugError() needs to abort the whole ReaR. It is reasonable to abort only the script and let the user fix it (the script is meant to be adjustable by the user, after all).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsmeix

I leave it to you if you prefer to keep the current BugError
in this pull request here or to change it into a LogPrintError.

I prefer to keep it as it is and enhance Error() (or BugError()) separately to behave properly in scripts executed in subshells.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not about subshells.
Error() should (hopefully) behave properly in scripts executed in subshells,
see my comments in its code - I treid to make it behave properly.

It is about whether or not enforced abort of whole ReaR during "rear recover"
is helpful for the user compared to provide final power to the user
to let him decide as he wants to deal with an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not about subshells.

By "behave properly in scripts executed in subshells" I meant, behave properly in the case of scripts like diskrestore.sh (are there any other scripts like this, that are dynamically generated and users can review and edit them before executing?) whose failure should not abort the whole ReaR process. I did not mean an arbitrary script sourced in an arbitrary subshell, sorry for the confusion.

@jsmeix jsmeix added this to the ReaR v2.7 milestone Oct 6, 2021
After this change, nothing in 100_include_partition_code.sh requires the
actual hardware anymore - only the saved layout. This enables running
the script outside of the rescue environment for testing, if one has
the saved layout data, and thus facilitates testing and experimenting a
lot.

Functional change: if the disk is not found, it is now the layout script
that aborts with BugError, not whole ReaR.
cat >> "$LAYOUT_CODE" <<EOF

#
# Code handling disk '$disk'
#

### Disks should be block devices.
[ -b "$disk" ] || BugError "Disk $disk is not a block device."

Copy link
Member

@jsmeix jsmeix Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think BugError should not be called here
because the code is

create_disk() {
    local component disk size label junk
    read component disk size label junk < <(grep "^disk $1 " "$LAYOUT_FILE")

    cat >> "$LAYOUT_CODE" <<EOF

#
# Code handling disk '$disk'
#

### Disks should be block devices.
[ -b "$disk" ] || BugError "Disk $disk is not a block device."

so the BugError call happens inside the diskrestore.sh script
but usually BugError exits the whole running rear program
with all its child processes in particular via the function
terminate_descendants_from_children_to_grandchildren()
while any error in the diskrestore.sh script that runs with set -e
should only exit the diskrestore.sh script and then there is
a user dialog in the still running rear program
via layout/recreate/default/200_run_layout_code.sh
about the error exit of the diskrestore.sh script
so the user could even fix the diskrestore.sh script
and let the fixed one run again.

There may be other possibly wrong calls of BugError() or Error()
within code that gets run inside the diskrestore.sh script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsmeix I thought about that and decided that it is OK, because in several other places the layout script can call BugError (indirectly through create_disk_label and create_disk_partition from layout-functions.sh). Of course, if those other calls are wrong, this changes things.

Copy link
Member

@jsmeix jsmeix Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcahyna
I had never thought before if it is OK or wrong to call Error()
(also indirectly via BugError) within the diskrestore.sh script
so I expected that this already happens somewhere directly
or indirectly in our code.
I did not check what actually happens in case of Error()
inside diskrestore.sh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way:

In general it is bad to error out the whole program from inside a function.
Normally a function should return a failure value to its caller
and let the caller decide what to do in this case.

Of course there are exceptions for example the function
assert_ISO_FILE_SIZE_LIMIT in lib/output-functions.sh currently at
https://github.com/rear/rear/blob/master/usr/share/rear/lib/output-functions.sh#L28

# create_logical_volumes - LVs in the VG need to be created using the lvcreate command
# create_thin_volumes_only - when the previous condition is true,
# do not create volumes that are not thin volumes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for that great explanation.
Now everyone understands why arrays are needed
so noone would have at a later time "a great idea"
to simplify that overcomplicated looking code ;-)

Copy link
Member

@jsmeix jsmeix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve it, see
#2691 (comment)

@pcahyna
Copy link
Member Author

pcahyna commented Oct 13, 2021

Thanks for the review @jsmeix and @rmetrich, I plan to merge it today or tomorrow morning.

@pcahyna pcahyna merged commit db673c4 into rear:master Oct 14, 2021
@jsmeix
Copy link
Member

jsmeix commented Oct 15, 2021

@pcahyna
thank you for your valuable enhancement!

jsmeix added a commit that referenced this pull request Feb 28, 2023
In layout/prepare/GNU/Linux/110_include_lvm_code.sh
fixed typo in comment 'grup' -> 'group' that is from
#2691
@@ -67,10 +66,35 @@ create_lvmgrp() {

local vg=${vgrp#/dev/}

# If a volume group name is in one of the following lists, it
# means that the particular condition is valid for this volume
# grup. If it is not in the list, it means that the condition is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the typo in comment 'grup' -> 'group' via
02dad20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants